Submitted by cm_34978 t3_100rbhp in MachineLearning

I am seeking insights and best practices for data preprocessing and cleaning in PDF documents. I am interested in extracting only the body text content from a PDF and discarding everything else, such as page numbers, footnotes, headers, and footers (see attached image for an example of semantically meaningful sections).

I have noticed that in Microsoft Word, a user can simply drag in a PDF and Word seems to automatically understand which parts are headers, footnotes, etc. I am speculating that Word may be utilizing machine learning techniques to analyze the layout and formatting of the PDF and classify different sections accordingly. Alternatively, Word may be utilizing pre-defined rules or patterns to identify common elements such as headers and footnotes. I know of related techniques for example to extract layout information from receipts and the like (LayoutLM, Xu et al., https://arxiv.org/abs/1912.13318) and tabular data (TableNet, Paliwal et al., https://ieeexplore.ieee.org/document/8978013), but nothing to solve layout extraction in this particular domain.

I am curious to know if there are any techniques or algorithms that can replicate this behavior in Word. Any suggestions or recommendations for data cleaning in PDF documents, would be greatly appreciated.

Image of PDF with semantically meaningful sections

125

Comments

You must log in or register to comment.

low_effort_shit-post t1_j2ks3qv wrote

I'm a data engineer by trade, usually we say no. Pdf isn't a data type or type of storage it is a print format. Go to the source and ask for the source. Once the $$$ is discussed and an understanding of how much harder pdfs are to work with and maintain a process it only makes sense to grab the data from wherever pdf does.

11

30katz t1_j2lilwg wrote

Our company is stuck with PDF’s but it’s actually not too hard to work with using Amazon’s textract or Adobe Extract API. But maybe that’s a sign that it is hard because the technology is owned by the two biggest tech giants in the space.

5

VacuousWaffle t1_j2m3shr wrote

I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.

2

low_effort_shit-post t1_j2mpzz5 wrote

We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored

3

Borrowedshorts t1_j2lnqf6 wrote

2023 and we still can't automate working with PDF documents. Sad.

9

johnnydozenredroses t1_j2njvbx wrote

The PDF standard optimizes portability over everything else, so it throws away a lot of semantic information that later needs to be reconstructed (like "this piece of text is a page number", "this other piece is a caption that is attached to this image", etc).

To complicate matters, there are too many softwares that can be used to generate PDFs and humans have ingenious ways of fitting things onto a Word document or powerpoint page and converting to PDF.

So it's a hard AI problem.

2

SupplyChainNext t1_j2lik4g wrote

As someone who’s done this extensively - I less it was made in word the pdf can be complete gibberish or have massive paragraph / sentence errors. Heck - the OCR can outright misread words or take a 3 letter word and make it 2 with 15 random ascii characters in between.

It’s a crap shoot.

God speed.

8

cm_34978 OP t1_j2n0cym wrote

Update for the interested - after trying a few different packages suggested in the comments, I settled on the inelegant, yet functional solution of automating the import of PDFs to Microsoft Word, saving the PDF as a Word file, then using a library to extract only the body text from the Word file.

Definitely not ideal since this will not work on Linux and will only run as fast as Microsoft Word can open, convert, and save them. But it works.

3

ypanagis t1_j2nkyk0 wrote

I was about to propose the same. For those who are interested, this seems to work for MacOS, too, but Windows is definitely a goto. A VBA script can also come in handy, for someone to get several PDFs, open them from Word and save as TXT.

3

cm_34978 OP t1_j2nsi8g wrote

Definitely. With windows, you get the advantage of the win32com library whereas with MacOS, you need need to play with AppleScript, which (in my hands) can be brittle and finicky.

2

ai-lover t1_j2m5u8s wrote

You can use a PDF parsing library: There are several libraries available that can help you extract text and data from PDF documents. Some popular ones include pdfminer, PyPDF2, and PDFMiner.

2

niszozz t1_j2nf8j2 wrote

I believe we can open pdfs through Google docs and save as a doc file

1

Terrible-List-1653 t1_j2lhtzw wrote

Hoorah! I’ve been testing this across a few sectors that I have 0 experience in. My “experiment” is to see how well AI can answer questions in a group chat with professionals in their field. As an artist/director/entrepreneur, myself and the people I work with are been effected in real-time…wondering what the effect is in other sectors.

−3

30katz t1_j2n9hpj wrote

Dude, stop. No one needs more garbage information. We can use ChatGPT and Google without your help. You’re not being an AI entrepreneur by spamming ChatGPT responses.

3

[deleted] t1_j2lbha5 wrote

[removed]

−12

avatarOfIndifference t1_j2lhflc wrote

This is definitely a chatGPT response

12

Disastrous_Elk_6375 t1_j2m49ih wrote

Oh yeah. "There are several", "some of the most ... include", "it may be necessary", "overall the best approach" are 100% markers of chatgpt that I've seen in most answers that I got.

1