We have millions of simple txt documents containing various data structures extracted from PDF, the text is printed line by line, so all formatting is lost (because when we tried to use the tools to support the format, they just messed it up). We need to extract the fields and values from this text document, but there are some changes in the structure of these files (a new line here and there, noise on some sheets, so the spelling is incorrect).
I thought that we would create some kind of template structure with information about the coordinates (string, word / number of words) of keywords and values and use this information to find and collect keyword values like these, using different algorithms to compensate for inconsistent formatting.
Is there a standard way to do this, any links that might help? any other ideas?
source
share