johnnydozenredroses t1_j2njvbx wrote on January 2, 2023 at 5:36 PM

Reply to comment by Borrowedshorts in [D] Data cleaning techniques for PDF documents with semantically meaningful parts by cm_34978

The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).

To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.

So it's a hard AI problem.