Submitted by cm_34978 t3_100rbhp in MachineLearning
I am seeking insights and best practices for data preprocessing and cleaning in PDF documents. I am interested in extracting only the body text content from a PDF and discarding everything else, such as page numbers, footnotes, headers, and footers (see attached image for an example of semantically meaningful sections).
I have noticed that in Microsoft Word, a user can simply drag in a PDF and Word seems to automatically understand which parts are headers, footnotes, etc. I am speculating that Word may be utilizing machine learning techniques to analyze the layout and formatting of the PDF and classify different sections accordingly. Alternatively, Word may be utilizing pre-defined rules or patterns to identify common elements such as headers and footnotes. I know of related techniques for example to extract layout information from receipts and the like (LayoutLM, Xu et al., https://arxiv.org/abs/1912.13318) and tabular data (TableNet, Paliwal et al., https://ieeexplore.ieee.org/document/8978013), but nothing to solve layout extraction in this particular domain.
I am curious to know if there are any techniques or algorithms that can replicate this behavior in Word. Any suggestions or recommendations for data cleaning in PDF documents, would be greatly appreciated.
lopnax t1_j2jt94g wrote
Did you try using PyMuPDF? Maybe you could discard the parts using some RegEx. https://pymupdf.readthedocs.io/en/latest/index.html
Also you you can crop the page to only get certain square with PyPDF2 and extract the text. https://pypdf2.readthedocs.io/en/stable/user/cropping-and-transforming.html