Analysis of the pdf file format and extraction of text and images

I need to extract “articles” from this journal, which has both text and images. The content of the image should be placed separately, the text is extracted (as much as possible) and placed separately.

How should I do it? Is there a commercial / api service that does this already? Logging into the program / service will only be a file.

Input Example: http://edition.pagesuite-professional.co.uk/pdfspool/rQBvRbttuPUWUoJlU6dBVSRnIlE=.pdf

(the actual file will be a regular pdf file, not a memorized one)

+3
source share
4 answers

The Docotic.Pdf library can extract images and text from PDF files for you.

:

JPEG TIFF. . .

: Bit Miracle, .

+1

, Amyuni PDF Creator pdf (, ..)), , PDF .

0

You can use Aspose.Pdf.Kit to extract text and images separately from the PDF file . The API is pretty simple. You can also find samples, tutorials, and support on the Aspose website.

Note. I work as an evangelist developer at Aspose.

0
source

All Articles