VacuousWaffle t1_j2m3shr wrote on January 2, 2023 at 9:29 AM

Reply to comment by low_effort_shit-post in [D] Data cleaning techniques for PDF documents with semantically meaningful parts by cm_34978

I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.

low_effort_shit-post t1_j2mpzz5 wrote on January 2, 2023 at 1:58 PM

We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored