I'm thinking about building an open source library to generate structured ml datasets from sources across the internet.

I know that lots of projects utilise crawlers to get decent datasets, while you might still need to create your own for specific use cases I'm wondering whether it'd be useful to have an open source library that lets you launch crawlers with predefined schemas for popular sources like LinkedIn, YouTube (I know yt also has an api), shopify stores, twitter, reddit, news sites and more.

Kind of like a unified interface with extendable starter templates.

The lib would dump json objects into a location you specify, like your local machine, mongo, or s3.

Something like:

{ title: some video, source: https//youtube.com/jfg78, views: 245676, comments: {}

Goal would be to make it easier/faster to get datasets from sources that don't natively have an api.

This might be a useless idea, but would love to hear your thoughts.

Comments

You must log in or register to comment.

FluffyVista t1_j9ou9b2 wrote on February 23, 2023 at 2:54 PM

That's useful, thanks

KPTN25 t1_j9p8zgp wrote on February 23, 2023 at 4:31 PM

Good luck crawling Linkedin. Not saying it's impossible, but you'll definitely be making your life difficult if you try to publish a tool that is scraping from LI.

dmart89 OP t1_j9po770 wrote on February 23, 2023 at 6:04 PM

Just a library not a commercial tool. Anyone using it would be scraping themselves, not via a 3rd party service or something.

[deleted] t1_j9qxsw0 wrote on February 23, 2023 at 10:47 PM

[removed]

[deleted] t1_j9nhjtc wrote on February 23, 2023 at 5:54 AM

[removed]

[deleted] t1_j9nnksc wrote on February 23, 2023 at 7:03 AM

[removed]

Sal-Hardin t1_j9oqwmt wrote on February 23, 2023 at 2:29 PM

How do you envision searching?

dmart89 OP t1_j9oxk6s wrote on February 23, 2023 at 3:17 PM

Probably keeping it simple to start with and just use filters during the crawl.

muwnd t1_j9qvxwf wrote on February 23, 2023 at 10:35 PM

Better save yourself from all the crawling trouble and use data from Commoncrawl. So you can focus on the extraction part.

noxiousmomentum t1_j9nil84 wrote on February 23, 2023 at 6:05 AM

useless. what can easily be done needs no automation and what is hard to do isn't helped by this approach

dmart89 OP t1_j9nkm2u wrote on February 23, 2023 at 6:28 AM

Fair. Thanks for your thoughts. I personally find constructing scrapers and parsing data annoyingly tedious, but it's probably just me (:

ch9ki7 t1_j9nw6hu wrote on February 23, 2023 at 8:56 AM

building and maintaining scrapers is tedious! I would also like some better solution. the idea is not bad, just maybe difficult to solve.

dmart89 OP t1_j9olr3r wrote on February 23, 2023 at 1:50 PM

Possibly, yes, I would need to check. I recently built parsing services for tiktok, and it was super annoying to deal with.

ch9ki7 t1_j9oqe44 wrote on February 23, 2023 at 2:25 PM

maybe something like scraperapi but with some kind of Dsl one could send as post payload.

but als a Problem is that you often need a scraped result as input for another request

step21 t1_j9nwh4u wrote on February 23, 2023 at 9:00 AM

Also, some of it might give you legal trouble if you f e make a public crawler for linkedin

dmart89 OP t1_j9olf7e wrote on February 23, 2023 at 1:48 PM

There was a court ruling a year or two ago that concluded that scraping public linkedin profiles is legal :) LN obviously still doesn't want you to scrape their data, so building scrapers for it is extra tedious because you need to navigate their blocking.

KPTN25 t1_j9qy2xi wrote on February 23, 2023 at 10:49 PM

> court ruling a year or two ago that concluded that scraping public linkedin profiles is legal

Forgot about this. I may be dating myself with problems of the past.

Still imagine they're doing their best to make it really hard to do, though.