command line scanned pdf to text

Jude DaShiell jdashiel at panix.com
Wed Nov 4 10:01:11 EST 2015


What data pack for tesseract has the english language in it?  I'm being 
prompted to download a data pack and I figure best get what language I 
understand rather than the whole data set since both memory and disk 
space over here are not unlimited.

On Mon, 2 Nov 2015, Cheryl Homiak wrote:

> Date: Mon, 2 Nov 2015 17:39:38
> From: Cheryl Homiak <cah4110 at icloud.com>
> Reply-To: Speakup is a screen review system for Linux.
>     <speakup at linux-speakup.org>
> To: Speakup is a screen review system for Linux. <speakup at linux-speakup.org>
> Subject: Re: command line scanned pdf to text
> 
> Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
> I suspect the error was mine so I won't give up on it yet.
>
> Thanks.
>
> -- 
> Cheryl
>
> May the words of my mouth
> and the meditation of my heart
> be acceptable to You, Lord,
> my rock and my Redeemer.
> (Psalm 19:14 HCSB)
>
>
>
>
>
>> On Nov 2, 2015, at 4:06 PM, John G Heim <jheim at math.wisc.edu> wrote:
>> 
>> Huh, it strikes me as strange that tesseract didn't work for you. I used tesseract last week to read a page in a pdf document that was stored as an image. I used pdftohtml to extract the image and then tesseract to convert it to text. I also pretty routinely use tesseract to read screen capture images. It's not very accurate there but it's usually good enough to make sense of.
>> 
>> Just "tesseract <infile> <outfile>" should work. The infile can be the string "stdin" in which case it read from standard input. The outfile can be "stdout" in which case it writes the text to stdout. Right off hand, I do not have the command line I use to scan the D&D book. It's on a computer at home that is turned off at the moment.  But I can post the whole thing tonight. Here are some lines from a backup version of the script:
>> 
>> scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
>> tesseract /tmp/page.tiff stdout
>> 
>> 
>> On 11/02/2015 02:53 PM, Cheryl Homiak wrote:
>>> Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
>>> 
>>> Thanks.
>>> 
>> 
>> -- 
>> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, sip:jheim at sip.linphone.org
>> _______________________________________________
>> Speakup mailing list
>> Speakup at linux-speakup.org
>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>
> _______________________________________________
> Speakup mailing list
> Speakup at linux-speakup.org
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

-- 



More information about the Speakup mailing list