Analysis of the pdf file format and extraction of text and images

Question

Analysis of the pdf file format and extraction of text and images

I need to extract “articles” from this journal, which has both text and images. The content of the image should be placed separately, the text is extracted (as much as possible) and placed separately.

How should I do it? Is there a commercial / api service that does this already? Logging into the program / service will only be a file.

Input Example: http://edition.pagesuite-professional.co.uk/pdfspool/rQBvRbttuPUWUoJlU6dBVSRnIlE=.pdf

(the actual file will be a regular pdf file, not a memorized one)

+3

pdf text-extraction

siliconpi May 04 '11 at 5:54

source share

4 answers

Bobrovsky · Answer 1 · 2011-09-01T17:41:34+0000

The Docotic.Pdf library can extract images and text from PDF files for you.

:

JPEG TIFF. . .

: Bit Miracle, .

user438959 · Answer 2 · 2011-05-04T07:23:00+0000

:

http://asp.syncfusion.com/sfaspnetsamplebrowser/9.1.0.20/Web/Pdf.Web/samples/4.0/Importing/TextExtraction/CS/Default.aspx?args=7

.

!

yms · Answer 3 · 2011-05-04T21:09:19+0000

, Amyuni PDF Creator pdf (, ..)), , PDF .

Shahzad latif · Answer 4 · 2011-05-05T12:27:01+0000

You can use Aspose.Pdf.Kit to extract text and images separately from the PDF file . The API is pretty simple. You can also find samples, tutorials, and support on the Aspose website.

Note. I work as an evangelist developer at Aspose.

Analysis of the pdf file format and extraction of text and images

More articles: