command line scanned pdf to text

John G Heim jheim at math.wisc.edu
Wed Nov 4 10:11:22 EST 2015


On ubuntu it's tesseract-ocr-en.



On 11/04/2015 09:01 AM, Jude DaShiell wrote:
> What data pack for tesseract has the english language in it?  I'm being
> prompted to download a data pack and I figure best get what language I
> understand rather than the whole data set since both memory and disk
> space over here are not unlimited.
>
> On Mon, 2 Nov 2015, Cheryl Homiak wrote:
>
>> Date: Mon, 2 Nov 2015 17:39:38
>> From: Cheryl Homiak <cah4110 at icloud.com>
>> Reply-To: Speakup is a screen review system for Linux.
>>     <speakup at linux-speakup.org>
>> To: Speakup is a screen review system for Linux.
>> <speakup at linux-speakup.org>
>> Subject: Re: command line scanned pdf to text
>>
>> Thanks much. No, the way to get into a turned-off computer far away
>> hasn't been invented yet, unless you can turn it on by remote control
>> somehow - :-)
>> I suspect the error was mine so I won't give up on it yet.
>>
>> Thanks.
>>
>> --
>> Cheryl
>>
>> May the words of my mouth
>> and the meditation of my heart
>> be acceptable to You, Lord,
>> my rock and my Redeemer.
>> (Psalm 19:14 HCSB)
>>
>>
>>
>>
>>
>>> On Nov 2, 2015, at 4:06 PM, John G Heim <jheim at math.wisc.edu> wrote:
>>>
>>> Huh, it strikes me as strange that tesseract didn't work for you. I
>>> used tesseract last week to read a page in a pdf document that was
>>> stored as an image. I used pdftohtml to extract the image and then
>>> tesseract to convert it to text. I also pretty routinely use
>>> tesseract to read screen capture images. It's not very accurate there
>>> but it's usually good enough to make sense of.
>>>
>>> Just "tesseract <infile> <outfile>" should work. The infile can be
>>> the string "stdin" in which case it read from standard input. The
>>> outfile can be "stdout" in which case it writes the text to stdout.
>>> Right off hand, I do not have the command line I use to scan the D&D
>>> book. It's on a computer at home that is turned off at the moment.
>>> But I can post the whole thing tonight. Here are some lines from a
>>> backup version of the script:
>>>
>>> scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
>>> tesseract /tmp/page.tiff stdout
>>>
>>>
>>> On 11/02/2015 02:53 PM, Cheryl Homiak wrote:
>>>> Would you mind enlarging on this if you can and have time? What kind
>>>> of file did you use and what did you put in your command-line? I am
>>>> asking this because I have tried to use tesseract a couple of times
>>>> with tiff files and have gotten mostly gibberish so obviously I am
>>>> doing something wrong. I am running debian testing if that makes a
>>>> difference.
>>>>
>>>> Thanks.
>>>>
>>>
>>> --
>>> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim,
>>> sip:jheim at sip.linphone.org
>>> _______________________________________________
>>> Speakup mailing list
>>> Speakup at linux-speakup.org
>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>>
>> _______________________________________________
>> Speakup mailing list
>> Speakup at linux-speakup.org
>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>

-- 
John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, 
sip:jheim at sip.linphone.org


More information about the Speakup mailing list