johnnydozenredroses

johnnydozenredroses t1_j2njvbx wrote

The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).

To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.

So it's a hard AI problem.

2