dmart89
dmart89 OP t1_j9oxk6s wrote
Reply to comment by Sal-Hardin in [D] Python library to collect structured datasets across the internet by dmart89
Probably keeping it simple to start with and just use filters during the crawl.
dmart89 OP t1_j9olr3r wrote
Reply to comment by ch9ki7 in [D] Python library to collect structured datasets across the internet by dmart89
Possibly, yes, I would need to check. I recently built parsing services for tiktok, and it was super annoying to deal with.
dmart89 OP t1_j9olf7e wrote
Reply to comment by step21 in [D] Python library to collect structured datasets across the internet by dmart89
There was a court ruling a year or two ago that concluded that scraping public linkedin profiles is legal :) LN obviously still doesn't want you to scrape their data, so building scrapers for it is extra tedious because you need to navigate their blocking.
dmart89 OP t1_j9nkm2u wrote
Reply to comment by noxiousmomentum in [D] Python library to collect structured datasets across the internet by dmart89
Fair. Thanks for your thoughts. I personally find constructing scrapers and parsing data annoyingly tedious, but it's probably just me (:
Submitted by dmart89 t3_119o54q in MachineLearning
dmart89 t1_j98eltr wrote
I think what you're asking is how to implement ML instead of building something from the ground up. I don't know your industry, but there are lots of suppliers and startups that would happily partner with you to help you adopt these capabilities without you needing to hire a team to build your own infrastructure. Many other industries already do!
dmart89 t1_j4nio9p wrote
Reply to comment by lumin0va in [D] Can ChatGPT flag it's own writings? by MrSpotgold
Idk, I guess the point is that if text is 100% gpt written and not reviewed by a human, then there is a risk that gpt learns from bad gpt examples. If you review and modify it to remove the watermark, then it is effectively human reviewed/labelled content and ok for re-ingestion in future iterations.
But tbh the guys at openai are pretty capable, I'm sure they'll think of something. I don't know anything more than the headline I read.
dmart89 t1_j4mkxyd wrote
Reply to comment by EmbarrassedHelp in [D] Can ChatGPT flag it's own writings? by MrSpotgold
I guess we don't know how they'll do it yet, but from what I understand, the purpose is to prevent future gpt versions to train on gpt generated text because gpt trains on text from the Internet.
dmart89 t1_j4l4vyz wrote
Reply to [D] Can ChatGPT flag it's own writings? by MrSpotgold
Right now, no. They're working on a digital watermark for model outputs to distinguish whether gpt wrote something or a human.
dmart89 t1_j27at4r wrote
This is cool.
dmart89 t1_iy6k204 wrote
No need to reinvent the wheel. Code is a tool to get to an outcome. if you can do it with packages, great.
dmart89 OP t1_j9po770 wrote
Reply to comment by KPTN25 in [D] Python library to collect structured datasets across the internet by dmart89
Just a library not a commercial tool. Anyone using it would be scraping themselves, not via a 3rd party service or something.