I need to extract “articles” from this journal, which has both text and images. The content of the image should be placed separately, the text is extracted (as much as possible) and placed separately.
How should I do it? Is there a commercial / api service that does this already? Logging into the program / service will only be a file.
Input Example: http://edition.pagesuite-professional.co.uk/pdfspool/rQBvRbttuPUWUoJlU6dBVSRnIlE=.pdf
(the actual file will be a regular pdf file, not a memorized one)
source
share