Extract information from millions of simple but inconsistent text files

We have millions of simple txt documents containing various data structures extracted from PDF, the text is printed line by line, so all formatting is lost (because when we tried to use the tools to support the format, they just messed it up). We need to extract the fields and values ​​from this text document, but there are some changes in the structure of these files (a new line here and there, noise on some sheets, so the spelling is incorrect).

I thought that we would create some kind of template structure with information about the coordinates (string, word / number of words) of keywords and values ​​and use this information to find and collect keyword values ​​like these, using different algorithms to compensate for inconsistent formatting.

Is there a standard way to do this, any links that might help? any other ideas?

+3
source share
5 answers

noise can be corrected or ignored using fuzzy text search tools such as agrep: http://www.tgries.de/agrep/ However, the problem with additional new lines will remain.

, , - . , , . , . , , , . , .

+1

Perl - . , .

Sed , Perl - .

+1

Unix Perl, , , Google Refine. .

+1

recoomnd accpetion. , ..

+1

Talend. (.. !). Java, , Java.

I used it and found it very useful for low-budget, very complex data integration projects. Here is a link to their website; Talend

Good luck.

+1
source

All Articles