I found the public domain latin ↔ Portuguese dictionary in PDF format, which I would like to convert to plain text, analyze and use as a program database. However, after some tests, I'm a little skeptical. Take a look at the original file and the resulting gocr text . Is there any hope that I can achieve 99% + accuracy in some method? I was thinking about the reCaptcha database, but I think this is a Google property, right?
Thank!
source
share