I am seeking insights and best practices for data preprocessing and cleaning in PDF documents. I am interested in extracting only the body text content from a PDF and discarding everything else, such as page numbers, footnotes, headers, and footers (see attached image for an example of semantically meaningful sections).

I have noticed that in Microsoft Word, a user can simply drag in a PDF and Word seems to automatically understand which parts are headers, footnotes, etc. I am speculating that Word may be utilizing machine learning techniques to analyze the layout and formatting of the PDF and classify different sections accordingly. Alternatively, Word may be utilizing pre-defined rules or patterns to identify common elements such as headers and footnotes. I know of related techniques for example to extract layout information from receipts and the like (LayoutLM, Xu et al., https://arxiv.org/abs/1912.13318) and tabular data (TableNet, Paliwal et al., https://ieeexplore.ieee.org/document/8978013), but nothing to solve layout extraction in this particular domain.

I am curious to know if there are any techniques or algorithms that can replicate this behavior in Word. Any suggestions or recommendations for data cleaning in PDF documents, would be greatly appreciated.

Image of PDF with semantically meaningful sections

Comments

lopnax t1_j2jt94g wrote on January 1, 2023 at 9:33 PM

Did you try using PyMuPDF? Maybe you could discard the parts using some RegEx. https://pymupdf.readthedocs.io/en/latest/index.html

Also you you can crop the page to only get certain square with PyPDF2 and extract the text. https://pypdf2.readthedocs.io/en/stable/user/cropping-and-transforming.html

CatalyzeX_code_bot t1_j2jbyzw wrote on January 1, 2023 at 7:41 PM

Found relevant code at https://github.com/microsoft/unilm/tree/master/layoutlm + all code implementations here

To opt out from receiving code links, DM me

low_effort_shit-post t1_j2ks3qv wrote on January 2, 2023 at 1:39 AM

I'm a data engineer by trade, usually we say no. Pdf isn't a data type or type of storage it is a print format. Go to the source and ask for the source. Once the $$$ is discussed and an understanding of how much harder pdfs are to work with and maintain a process it only makes sense to grab the data from wherever pdf does.

30katz t1_j2lilwg wrote on January 2, 2023 at 5:15 AM

Our company is stuck with PDF’s but it’s actually not too hard to work with using Amazon’s textract or Adobe Extract API. But maybe that’s a sign that it is hard because the technology is owned by the two biggest tech giants in the space.

VacuousWaffle t1_j2m3shr wrote on January 2, 2023 at 9:29 AM

I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.

low_effort_shit-post t1_j2mpzz5 wrote on January 2, 2023 at 1:58 PM

We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored

Borrowedshorts t1_j2lnqf6 wrote on January 2, 2023 at 6:07 AM

2023 and we still can't automate working with PDF documents. Sad.

johnnydozenredroses t1_j2njvbx wrote on January 2, 2023 at 5:36 PM

The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).

To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.

So it's a hard AI problem.

SupplyChainNext t1_j2lik4g wrote on January 2, 2023 at 5:14 AM

As someone who’s done this extensively - I less it was made in word the pdf can be complete gibberish or have massive paragraph / sentence errors. Heck - the OCR can outright misread words or take a 3 letter word and make it 2 with 15 random ascii characters in between.

It’s a crap shoot.

God speed.

cm_34978 OP t1_j2n0cym wrote on January 2, 2023 at 3:22 PM

Update for the interested - after trying a few different packages suggested in the comments, I settled on the inelegant, yet functional solution of automating the import of PDFs to Microsoft Word, saving the PDF as a Word file, then using a library to extract only the body text from the Word file.

Definitely not ideal since this will not work on Linux and will only run as fast as Microsoft Word can open, convert, and save them. But it works.

ypanagis t1_j2nkyk0 wrote on January 2, 2023 at 5:43 PM

I was about to propose the same. For those who are interested, this seems to work for MacOS, too, but Windows is definitely a goto. A VBA script can also come in handy, for someone to get several PDFs, open them from Word and save as TXT.

cm_34978 OP t1_j2nsi8g wrote on January 2, 2023 at 6:31 PM

Definitely. With windows, you get the advantage of the win32com library whereas with MacOS, you need need to play with AppleScript, which (in my hands) can be brittle and finicky.

ai-lover t1_j2m5u8s wrote on January 2, 2023 at 9:57 AM

You can use a PDF parsing library: There are several libraries available that can help you extract text and data from PDF documents. Some popular ones include pdfminer, PyPDF2, and PDFMiner.

niszozz t1_j2nf8j2 wrote on January 2, 2023 at 5:06 PM

I believe we can open pdfs through Google docs and save as a doc file

marineman4808ny t1_j2nfe7m wrote on January 2, 2023 at 5:07 PM

pdfminer, PyPDF2, and PDFMiner.

[deleted] t1_j31awa1 wrote on January 5, 2023 at 10:57 AM

[removed]

[deleted] t1_j2lv15a wrote on January 2, 2023 at 7:32 AM

[deleted]

Terrible-List-1653 t1_j2lhtzw wrote on January 2, 2023 at 5:07 AM

Hoorah! I’ve been testing this across a few sectors that I have 0 experience in. My “experiment” is to see how well AI can answer questions in a group chat with professionals in their field. As an artist/director/entrepreneur, myself and the people I work with are been effected in real-time…wondering what the effect is in other sectors.

30katz t1_j2n9hpj wrote on January 2, 2023 at 4:27 PM

Dude, stop. No one needs more garbage information. We can use ChatGPT and Google without your help. You’re not being an AI entrepreneur by spamming ChatGPT responses.

[deleted] t1_j2lbha5 wrote on January 2, 2023 at 4:11 AM

[removed]

avatarOfIndifference t1_j2lhflc wrote on January 2, 2023 at 5:03 AM

This is definitely a chatGPT response

Disastrous_Elk_6375 t1_j2m49ih wrote on January 2, 2023 at 9:36 AM

Oh yeah. "There are several", "some of the most ... include", "it may be necessary", "overall the best approach" are 100% markers of chatgpt that I've seen in most answers that I got.