Submitted by cm_34978 t3_100rbhp in MachineLearning
Borrowedshorts t1_j2lnqf6 wrote
2023 and we still can't automate working with PDF documents. Sad.
johnnydozenredroses t1_j2njvbx wrote
The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).
To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.
So it's a hard AI problem.
Viewing a single comment thread. View all comments