Submitted by cm_34978 t3_100rbhp in MachineLearning
VacuousWaffle t1_j2m3shr wrote
Reply to comment by low_effort_shit-post in [D] Data cleaning techniques for PDF documents with semantically meaningful parts by cm_34978
I remember at a hospital job I was asked to mine text from PDFs that were generated internally after another team spent a few months trying to build a solution themselves. The source data they were trying to mine was already in the data warehouse in a surprisingly well-formatted table.
low_effort_shit-post t1_j2mpzz5 wrote
We get pdf feeds all the time with promises that have financial implications to get a proper data feed. Usually we kick the can down the road and when we get the feed just pull it in and it moves along out usual etl process within a day. PDFs are to be ignored
Viewing a single comment thread. View all comments