How to determine programmatically if PDF is searchable?

I have a CSV with a list of URLs with PDF files:

  • Some of these PDF files are searchable.
  • Some of these PDFS are not searchable.

I want to determine which PDF files are searchable from my list of PDF files. Is there an easy way to do this?

+5
source share
1 answer

On the command line, I use pdffontsto determine which fonts the file uses. It works pretty fast ...

Example 1: PDF containing text

pdffonts bash-manpage.pdf 

  name type encoding emb sub uni object ID
  ------------------------------- ------------- ------ --------- --- --- --- ---------
  Times-Roman                     Type 1        Custom          no  no  no       8  0
  Times-Bold                      Type 1        Standard        no  no  no       9  0
  Helvetica                       Type 1        Custom          no  no  no      11  0
  Helvetica-Bold                  Type 1        Standard        no  no  no      30  0

2: PDF,

pdffonts scanned-book.pdf

  pdffonts handmade.pdf 
  name                            type           encoding       emb sub uni object ID
  ------------------------------- -------------- -------------- --- --- --- ---------

  • 1 . , IS .

  • 2 . , ( OCR , ... !), ...

., , , , . , , - CID Type "" . stackoverflow , PDF...

+6

All Articles