How to parse .pdf files in Perl?

Question

How to parse .pdf files in Perl?

How to parse .pdf files in Perl? Is perl more efficient or should I use any other language?

+3

perl pdf

Mandar pande May 12, '11 at 12:28

source share

4 answers

PDF, pdftohtml ( Poppler) -xml . XML , XML::Twig ( XML, , XML:: Simple).

XML . <page> PDF, <fontspec>, , <text> . <text> <b> <i> ( XML:: Simple ).

top left <text>, , . 0,0 . PostScript (72 ).

+9

cjm 12 '11 18:50

CPAN , , OCR, . PDF:: OCR2

+5

ADW 12 '11 12:33

I do not know a single module that analyzes, that is, if you extract text from them. There are several modules that allow you to manipulate them. Try PDF :: API2 .

+4

shawnhcorey May 12, '11 at 12:42

source share

weismat · Accepted Answer · 2011-05-12T13:09:07+0000

I personally use CAM :: PDF.

my $doc=CAM::PDF->new($fileName) || die "$CAM::PDF::errStr\n"; CAM::PDF>asciify(/$pdfString);`

Pdfs are not intended for parsing, but for display / printing - in this way, everything always tries and error, and it is quite possible that it is impossible to parse if everything is graphics.
A good indicator is if you can copy and paste the contents from pdf into the editor. If this works, then you are in business.

How to parse .pdf files in Perl?

More articles: