command line scanned pdf to text
Willem van der Walt
wvdwalt at csir.co.za
Tue Nov 3 00:14:13 EST 2015
cuneiform is IMHO a better OCR engine than tesseract.
It is available as a package under ubuntu.
Regards, Willem
On Mon, 2 Nov 2015, Cheryl Homiak wrote:
> I am sure tiff is supported. It is really strange. I get what look like words and what I get is the same every time I do a scan of the same image but they are nonsense. I even tried adding the designation for English thinking somehow it wasn't using English but got the same results. I know the image file is okay because it comes out fine using ABBY FineReader Express on my Mac.
>
> --
> Cheryl
>
> May the words of my mouth
> and the meditation of my heart
> be acceptable to You, Lord,
> my rock and my Redeemer.
> (Psalm 19:14 HCSB)
>
>
>
>
>
>> On Nov 2, 2015, at 10:15 PM, Tom Fowle <wa6ivgtf at fastmail.fm> wrote:
>>
>> Sheryl,
>> I arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
>> pdf.
>>
>> Then I just did
>> tesseract filename.jpg outfile
>> produces
>> outfile.txt
>>
>> sorry havn't tried .tif and I couldn't find a list of supported file types.
>>
>> tom fowle
>>
>> On Mon, Nov 02, 2015 at 02:53:45PM -0600, Cheryl Homiak wrote:
>>> Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.
>>>
>>> Thanks.
>>>
>>> --
>>> Cheryl
>>>
>>> May the words of my mouth
>>> and the meditation of my heart
>>> be acceptable to You, Lord,
>>> my rock and my Redeemer.
>>> (Psalm 19:14 HCSB)
>>>
>>>
>>>
>>>
>>>
>>>> On Nov 2, 2015, at 2:13 PM, John G Heim <jheim at math.wisc.edu> wrote:
>>>>
>>>>
>>>> I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.
>>>>
>>>>
>>>>
>>>> On 11/01/2015 11:24 PM, Tom Fowle wrote:
>>>>> Am I the last to find this?
>>>>> command line ocr tesseract
>>>>> won't directly support .pdf but
>>>>> pdftocairo
>>>>> produces .jpg among others which tesseract will read.
>>>>>
>>>>> May not do well with collumns but not too bad.
>>>>>
>>>>> Is there anything better?
>>>>>
>>>>> Thanks
>>>>> tom Fowle
>>>>> _______________________________________________
>>>>> Speakup mailing list
>>>>> Speakup at linux-speakup.org
>>>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>>>>>
>>>>
>>>> --
>>>> John Heim, jheim at math.wisc.edu, 608-263-4189, skype:john.g.heim, sip:jheim at sip.linphone.org
>>>> _______________________________________________
>>>> Speakup mailing list
>>>> Speakup at linux-speakup.org
>>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>>>
>>> _______________________________________________
>>> Speakup mailing list
>>> Speakup at linux-speakup.org
>>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>> _______________________________________________
>> Speakup mailing list
>> Speakup at linux-speakup.org
>> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>
> _______________________________________________
> Speakup mailing list
> Speakup at linux-speakup.org
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
>
> --
> This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard.
> The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.
>
> This message has been scanned for viruses and dangerous content by MailScanner,
> and is believed to be clean.
>
> Please consider the environment before printing this email.
More information about the Speakup
mailing list