Submitted by dmart89 t3_119o54q in MachineLearning
I'm thinking about building an open source library to generate structured ml datasets from sources across the internet.
I know that lots of projects utilise crawlers to get decent datasets, while you might still need to create your own for specific use cases I'm wondering whether it'd be useful to have an open source library that lets you launch crawlers with predefined schemas for popular sources like LinkedIn, YouTube (I know yt also has an api), shopify stores, twitter, reddit, news sites and more.
Kind of like a unified interface with extendable starter templates.
The lib would dump json objects into a location you specify, like your local machine, mongo, or s3.
Something like:
{ title: some video, source: https//youtube.com/jfg78, views: 245676, comments: {}
Goal would be to make it easier/faster to get datasets from sources that don't natively have an api.
This might be a useless idea, but would love to hear your thoughts.
FluffyVista t1_j9ou9b2 wrote
That's useful, thanks