Borrowedshorts t1_j2lnqf6 wrote on January 2, 2023 at 6:07 AM

2023 and we still can't automate working with PDF documents. Sad.

johnnydozenredroses t1_j2njvbx wrote on January 2, 2023 at 5:36 PM

The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).

To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.

So it's a hard AI problem.