command line scanned pdf to text

John G Heim jheim at math.wisc.edu
Mon Nov 2 15:13:04 EST 2015


I've been scanning in the D&D 5th Edition player's handbook. I tried 
every open source OCR program I could find and tesseract was easily the 
best. On pages that are just prose, it probably does about 99% accuracy. 
Even on pages where that are 2 columns of prose, it does really well if 
you tell it to look for that. Somebody sent me a pdf of the same book 
done with a professional OCR program for Windows. The results are 
approximately equal. Tesseract may lack the bells & whistles of 
commercial products but for accuracy, it's pretty good.



On 11/01/2015 11:24 PM, Tom Fowle wrote:
> Am I the last to find this?
>   command line ocr tesseract
> won't directly support .pdf but
> pdftocairo
> produces .jpg among others which tesseract will read.
>
> May not do well with collumns but not too bad.
>
> Is there anything better?
>
> Thanks
> tom Fowle
> _______________________________________________
> Speakup mailing list
> Speakup at linux-speakup.org
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>

-- 
John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, 
sip:jheim at sip.linphone.org


More information about the Speakup mailing list