command line scanned pdf to text

Cheryl Homiak cah4110 at icloud.com
Wed Nov 4 10:49:03 EST 2015


On debian, it is tesseract-ocr-eng and it may or may not be installed with the main package; I don't remember having to do it separately but I have it.

-- 
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)





> On Nov 4, 2015, at 9:11 AM, John G Heim <jheim at math.wisc.edu> wrote:
> 
> On ubuntu it's tesseract-ocr-en.
> 
> 
> 
> On 11/04/2015 09:01 AM, Jude DaShiell wrote:
>> What data pack for tesseract has the english language in it?  I'm being
>> prompted to download a data pack and I figure best get what language I
>> understand rather than the whole data set since both memory and disk
>> space over here are not unlimited.
>> 
>> On Mon, 2 Nov 2015, Cheryl Homiak wrote:
>> 
>>> Date: Mon, 2 Nov 2015 17:39:38
>>> From: Cheryl Homiak <cah4110 at icloud.com>
>>> Reply-To: Speakup is a screen review system for Linux.
>>>    <speakup at linux-speakup.org>
>>> To: Speakup is a screen review system for Linux.
>>> <speakup at linux-speakup.org>
>>> Subject: Re: command line scanned pdf to text
>>> 
>>> Thanks much. No, the way to get into a turned-off computer far away
>>> hasn't been invented yet, unless you can turn it on by remote control
>>> somehow - :-)
>>> I suspect the error was mine so I won't give up on it yet.
>>> 
>>> Thanks.
>>> 
>>> --
>>> Cheryl
>>> 
>>> May the words of my mouth
>>> and the meditation of my heart
>>> be acceptable to You, Lord,
>>> my rock and my Redeemer.
>>> (Psalm 19:14 HCSB)
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Nov 2, 2015, at 4:06 PM, John G Heim <jheim at math.wisc.edu> wrote:
>>>> 
>>>> Huh, it strikes me as strange that tesseract didn't work for you. I
>>>> used tesseract last week to read a page in a pdf document that was
>>>> stored as an image. I used pdftohtml to extract the image and then
>>>> tesseract to convert it to text. I also pretty routinely use
>>>> tesseract to read screen capture images. It's not very accurate there
>>>> but it's usually good enough to make sense of.
>>>> 
>>>> Just "tesseract <infile> <outfile>" should work. The infile can be
>>>> the string "stdin" in which case it read from standard input. The
>>>> outfile can be "stdout" in which case it writes the text to stdout.
>>>> Right off hand, I do not have the command line I use to scan the D&D
>>>> book. It's on a computer at home that is turned off at the moment.
>>>> But I can post the whole thing tonight. Here are some lines from a
>>>> backup version of the script:
>>>> 
>>>> scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
>>>> tesseract /tmp/page.tiff stdout
>>>> 
>>>> 
>>>> On 11/02/2015 02:53 PM, Cheryl Homiak wrote:
>>>>> Would you mind enlarging on this if you can and have time? What kind
>>>>> of file did you use and what did you put in your command-line? I am
>>>>> asking this because I have tried to use tesseract a couple of times
>>>>> with tiff files and have gotten mostly gibberish so obviously I am
>>>>> doing something wrong. I am running debian testing if that makes a
>>>>> difference.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> --
>>>> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim,
>>>> sip:jheim at sip.linphone.org
>>>> _______________________________________________
>>>> Speakup mailing list
>>>> Speakup at linux-speakup.org
>>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>>> 
>>> _______________________________________________
>>> Speakup mailing list
>>> Speakup at linux-speakup.org
>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>> 
> 
> -- 
> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, sip:jheim at sip.linphone.org
> _______________________________________________
> Speakup mailing list
> Speakup at linux-speakup.org
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup



More information about the Speakup mailing list