command line scanned pdf to text

Cheryl Homiak cah4110 at icloud.com
Tue Nov 3 02:25:31 EST 2015


i've tried with both cuneiform and tesseract with the same results. I wonder if it's a rotation problem. 

-- 
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)





> On Nov 2, 2015, at 10:15 PM, Tom Fowle <wa6ivgtf at fastmail.fm> wrote:
> 
> Sheryl,
> I  arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
> pdf.
> 
> Then I just did
> tesseract filename.jpg  outfile
> produces
> outfile.txt
> 
> sorry havn't tried .tif and I couldn't find a list of supported file types.
> 
> tom fowle
> 
> On Mon, Nov 02, 2015 at 02:53:45PM -0600, Cheryl Homiak wrote:
>> Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
>> 
>> Thanks.
>> 
>> -- 
>> Cheryl
>> 
>> May the words of my mouth
>> and the meditation of my heart
>> be acceptable to You, Lord,
>> my rock and my Redeemer.
>> (Psalm 19:14 HCSB)
>> 
>> 
>> 
>> 
>> 
>>> On Nov 2, 2015, at 2:13 PM, John G Heim <jheim at math.wisc.edu> wrote:
>>> 
>>> 
>>> I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
>>> 
>>> 
>>> 
>>> On 11/01/2015 11:24 PM, Tom Fowle wrote:
>>>> Am I the last to find this?
>>>> command line ocr tesseract
>>>> won't directly support .pdf but
>>>> pdftocairo
>>>> produces .jpg among others which tesseract will read.
>>>> 
>>>> May not do well with collumns but not too bad.
>>>> 
>>>> Is there anything better?
>>>> 
>>>> Thanks
>>>> tom Fowle
>>>> _______________________________________________
>>>> Speakup mailing list
>>>> Speakup at linux-speakup.org
>>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>>>> 
>>> 
>>> -- 
>>> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, sip:jheim at sip.linphone.org
>>> _______________________________________________
>>> Speakup mailing list
>>> Speakup at linux-speakup.org
>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>> 
>> _______________________________________________
>> Speakup mailing list
>> Speakup at linux-speakup.org
>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
> _______________________________________________
> Speakup mailing list
> Speakup at linux-speakup.org
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup



More information about the Speakup mailing list