Problem
On a Mac OS X platform, I would like to write a script, either in Python or in Tcl, to search for text in a PDF file and extract the relevant parts. I appreciate any help.
Background
I write scripts to look at the PDF to determine if it is a bill of exchange, which company and for what period. Based on this information, I will rename the PDF and move it to the appropriate directory. For example, a file such as Statement_03948293929384.pdfcan become 2012-07-15 Water Bill.pdfand is transferred to my folder Utilities.
What have i done so far?
- I searched for PDF-to-plain-text tools but didn't find anything
- I looked at the Tcl wiki and found an example, but could not get it to work (I searched the text in PDF, but could not find it).
- I look
pdf-parser.pyfrom Didier Stevens - I heard about a Python package called pyPdf and will look at it further.
Update
I found a command line tool called pdftotext written by Glyph and Cog, LLC; Built and packaged by Carsten Bluem . This tool is straightforward and it solves my problem. I am still looking at those tools that can directly search for PDF, without having to convert to a text file.
source
share