command line scanned pdf to text

John G Heim jheim at math.wisc.edu
Mon Nov 2 17:06:09 EST 2015


Huh, it strikes me as strange that tesseract didn't work for you. I used 
tesseract last week to read a page in a pdf document that was stored as 
an image. I used pdftohtml to extract the image and then tesseract to 
convert it to text. I also pretty routinely use tesseract to read screen 
capture images. It's not very accurate there but it's usually good 
enough to make sense of.

Just "tesseract <infile> <outfile>" should work. The infile can be the 
string "stdin" in which case it read from standard input. The outfile 
can be "stdout" in which case it writes the text to stdout. Right off 
hand, I do not have the command line I use to scan the D&D book. It's on 
a computer at home that is turned off at the moment.  But I can post the 
whole thing tonight. Here are some lines from a backup version of the 
script:

scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout


On 11/02/2015 02:53 PM, Cheryl Homiak wrote:
> Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
>
> Thanks.
>

-- 
John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, 
sip:jheim at sip.linphone.org


More information about the Speakup mailing list