Perl pdf by line parser?

Question

Perl pdf by line parser?

I have a pdf, consists only of text, without special characters, images, etc. Is there any Perl module (looked at cpan to no avail) to help me sort each page line by line? (Converting PDF to text gives poor results and inappropriate data)

Thank,

+2

perl pdf pdf-parsing

snoofkin Feb 16 '11 at 20:27

source share

1 answer

cjm · Accepted Answer · 2011-02-16T22:39:40+0000

When I want to extract text from a PDF, I pass it pdftohtml(part of Poppler ) using -xmloutput. This creates an XML file that I parse using XML :: Twig (or any other XML parser that you like except XML :: Simple).

XML . <page> PDF, <fontspec>, , <text> . <text> <b> <i> ( XML:: Simple ).

top left <text>, , . 0,0 . PostScript (72 ).

Perl pdf by line parser?

More articles: