Extract information from millions of simple but inconsistent text files

Question

Extract information from millions of simple but inconsistent text files

We have millions of simple txt documents containing various data structures extracted from PDF, the text is printed line by line, so all formatting is lost (because when we tried to use the tools to support the format, they just messed it up). We need to extract the fields and values from this text document, but there are some changes in the structure of these files (a new line here and there, noise on some sheets, so the spelling is incorrect).

I thought that we would create some kind of template structure with information about the coordinates (string, word / number of words) of keywords and values and use this information to find and collect keyword values like these, using different algorithms to compensate for inconsistent formatting.

Is there a standard way to do this, any links that might help? any other ideas?

+3

information-extraction data-mining data-modeling

zode64 May 06 '11 at 20:53

source share

5 answers

Perl - . , .

Sed , Perl - .

+1

Nicholas Carey 06 '11 21:11

Unix Perl, , , Google Refine. .

+1

Yuval F 07 '11 4:58

recoomnd accpetion. , ..

+1

yura 07 '11 10:53

Talend. (.. !). Java, , Java.

I used it and found it very useful for low-budget, very complex data integration projects. Here is a link to their website; Talend

Good luck.

+1

mevdiven May 09 '11 at 12:53

source share

Philip Derbeko · Accepted Answer · 2011-05-06T21:11:11+0000

noise can be corrected or ignored using fuzzy text search tools such as agrep: http://www.tgries.de/agrep/ However, the problem with additional new lines will remain.

, , - . , , . , . , , , . , .

Extract information from millions of simple but inconsistent text files

More articles: