Script to search for text from PDF

Problem

On a Mac OS X platform, I would like to write a script, either in Python or in Tcl, to search for text in a PDF file and extract the relevant parts. I appreciate any help.

Background

I write scripts to look at the PDF to determine if it is a bill of exchange, which company and for what period. Based on this information, I will rename the PDF and move it to the appropriate directory. For example, a file such as Statement_03948293929384.pdfcan become 2012-07-15 Water Bill.pdfand is transferred to my folder Utilities.

What have i done so far?

  • I searched for PDF-to-plain-text tools but didn't find anything
  • I looked at the Tcl wiki and found an example, but could not get it to work (I searched the text in PDF, but could not find it).
  • I look pdf-parser.pyfrom Didier Stevens
  • I heard about a Python package called pyPdf and will look at it further.

Update

I found a command line tool called pdftotext written by Glyph and Cog, LLC; Built and packaged by Carsten Bluem . This tool is straightforward and it solves my problem. I am still looking at those tools that can directly search for PDF, without having to convert to a text file.

+5
source share
1 answer

PyODConverter / PDF ( Java). , PDF , . , iText , .

+1

All Articles