command line scanned pdf to text
John G Heim
jheim at math.wisc.edu
Tue Nov 3 09:30:31 EST 2015
Here is the complete script. Sorry I forgot to post it last night. I
turned the machine on as I left this morning and sshed into it from
work. Theresome junk in here you may or may not be interested in. You
can pass the script 2 parameters. #1 is the page number.It uses this
number to make the output text file name. Page 99 would be named
p099.txt. If you don't pass it a page number, it looks for files
matching the same pattern and takes the next highest number. So if there
already is a p099.txt, it would create a p100.txt. The second parameter
is the tesseract psm flag. The tesseract man page explains these. The
default is 3.
After it's done with the scan and ocr, it concatenates all the pages
into one big file. It also beeps if the new page it just scanned is an
even numbered page. This is to remind me to turn the page. Otherwise I
sometimes forget if I've already done both sides.
#!/bin/bash
IDX=$1
if [ ! -z "$IDX" ]; then
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
for IDX in {1..999}; do
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -f "${TEXT}" && break
done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
PSM="$2"
test -z "$PSM" && PSM=3
RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
if [ -f "$TEXTFILE" ]; then
echo "Page $IDX" >> "$OUTFILE"
cat "$TEXTFILE">> "$OUTFILE"
echo -e "\f" >> "$OUTFILE"
fi
done
# EOF
IDX=$1
if [ ! -z "$IDX" ]; then
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
for IDX in {1..999}; do
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -f "${TEXT}" && break
done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
PSM="$2"
test -z "$PSM" && PSM=3
RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" | cleantext >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
if [ -f "$TEXTFILE" ]; then
echo "Page $IDX" >> "$OUTFILE"
cat "$TEXTFILE">> "$OUTFILE"
echo -e "\f" >> "$OUTFILE"
fi
done
# EOF
On 11/02/2015 04:39 PM, Cheryl Homiak wrote:
> Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
> I suspect the error was mine so I won't give up on it yet.
>
> Thanks.
>
--
John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim,
sip:jheim at sip.linphone.org
More information about the Speakup
mailing list