command line scanned pdf to text
Cheryl Homiak
cah4110 at icloud.com
Tue Nov 3 11:19:24 EST 2015
Thanks. I did try another file and it worked in botyh cuneiform and tesseract so I think the two files I tried were an anomaly or it was a rotation issue. I haven't compared to see which package did the best job but it doesn't hurt to have both of them.
--
Cheryl
May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)
> On Nov 3, 2015, at 8:30 AM, John G Heim <jheim at math.wisc.edu> wrote:
>
> Here is the complete script. Sorry I forgot to post it last night. I turned the machine on as I left this morning and sshed into it from work. Theresome junk in here you may or may not be interested in. You can pass the script 2 parameters. #1 is the page number.It uses this number to make the output text file name. Page 99 would be named p099.txt. If you don't pass it a page number, it looks for files matching the same pattern and takes the next highest number. So if there already is a p099.txt, it would create a p100.txt. The second parameter is the tesseract psm flag. The tesseract man page explains these. The default is 3.
>
> After it's done with the scan and ocr, it concatenates all the pages into one big file. It also beeps if the new page it just scanned is an even numbered page. This is to remind me to turn the page. Otherwise I sometimes forget if I've already done both sides.
>
>
> #!/bin/bash
>
> IDX=$1
> if [ ! -z "$IDX" ]; then
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> else
> for IDX in {1..999}; do
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> test ! -f "${TEXT}" && break
> done
> fi
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
>
> PSM="$2"
> test -z "$PSM" && PSM=3
>
> RESOLUTION=600
> SCAN=/tmp/page.tif
> scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
>
> PAGE=/tmp/page
> tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
> cat "${PAGE}.txt" >> "$TEXT"
> /usr/bin/beep -r $((2 - IDX % 2))
> test ! -z "$VERBOSE" && file "${TEXT}"
> OUTFILE="/home/john/phb5/PHB5.txt"
> echo "" > "$OUTFILE"
> for IDX in {1..999}; do
> TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
> if [ -f "$TEXTFILE" ]; then
> echo "Page $IDX" >> "$OUTFILE"
> cat "$TEXTFILE">> "$OUTFILE"
> echo -e "\f" >> "$OUTFILE"
> fi
> done
> # EOF
>
> IDX=$1
> if [ ! -z "$IDX" ]; then
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> else
> for IDX in {1..999}; do
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> test ! -f "${TEXT}" && break
> done
> fi
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
>
> PSM="$2"
> test -z "$PSM" && PSM=3
>
> RESOLUTION=600
> SCAN=/tmp/page.tif
> scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
>
> PAGE=/tmp/page
> tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
> cat "${PAGE}.txt" | cleantext >> "$TEXT"
> /usr/bin/beep -r $((2 - IDX % 2))
> test ! -z "$VERBOSE" && file "${TEXT}"
> OUTFILE="/home/john/phb5/PHB5.txt"
> echo "" > "$OUTFILE"
> for IDX in {1..999}; do
> TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
> if [ -f "$TEXTFILE" ]; then
> echo "Page $IDX" >> "$OUTFILE"
> cat "$TEXTFILE">> "$OUTFILE"
> echo -e "\f" >> "$OUTFILE"
> fi
> done
> # EOF
>
> On 11/02/2015 04:39 PM, Cheryl Homiak wrote:
>> Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
>> I suspect the error was mine so I won't give up on it yet.
>>
>> Thanks.
>>
>
> --
> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, sip:jheim at sip.linphone.org
> _______________________________________________
> Speakup mailing list
> Speakup at linux-speakup.org
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
More information about the Speakup
mailing list