command line scanned pdf to text

John G Heim jheim at math.wisc.edu
Tue Nov 3 09:30:31 EST 2015


Here is the complete script.  Sorry I forgot to post it last night. I 
turned the machine on as I left this morning and sshed into it from 
work. Theresome junk in here you may or may not be interested in. You 
can pass the script 2 parameters. #1 is the page number.It uses this 
number to make the output text file name. Page 99 would be named 
p099.txt. If you don't pass it a page number, it looks for files 
matching the same pattern and takes the next highest number. So if there 
already is a p099.txt, it would create a p100.txt. The second parameter 
is the tesseract psm flag.  The tesseract man page explains these. The 
default is 3.

After it's done with the scan and ocr, it concatenates all the pages 
into one big file. It also beeps if the new page it just scanned is an 
even numbered page. This is to remind me to turn the page. Otherwise I 
sometimes forget if I've already done both sides.


#!/bin/bash

IDX=$1
if [ ! -z "$IDX" ]; then
	TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
	for IDX in {1..999}; do
		TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
		test ! -f "${TEXT}" && break
	done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "

PSM="$2"
test -z "$PSM" && PSM=3

RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN

PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
	TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
	if [ -f "$TEXTFILE" ]; then
		echo "Page $IDX" >>  "$OUTFILE"
		cat "$TEXTFILE">> "$OUTFILE"
		echo -e "\f" >> "$OUTFILE"
	fi
done
# EOF

IDX=$1
if [ ! -z "$IDX" ]; then
	TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
	for IDX in {1..999}; do
		TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
		test ! -f "${TEXT}" && break
	done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "

PSM="$2"
test -z "$PSM" && PSM=3

RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN

PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" | cleantext >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
	TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
	if [ -f "$TEXTFILE" ]; then
		echo "Page $IDX" >>  "$OUTFILE"
		cat "$TEXTFILE">> "$OUTFILE"
		echo -e "\f" >> "$OUTFILE"
	fi
done
# EOF

On 11/02/2015 04:39 PM, Cheryl Homiak wrote:
> Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
> I suspect the error was mine so I won't give up on it yet.
>
> Thanks.
>

-- 
John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, 
sip:jheim at sip.linphone.org


More information about the Speakup mailing list