Viewing a single comment thread. View all comments

Arcosim t1_j4wm8ap wrote

>table Diffusion, Midjourney, etc. are just using datasets from LAIO

LAION uses Common Crawl to crawl the net and Common Crawl obeys the robot.txt rules of any site it crawls. Getty images have no case here, if they didn't want their content crawled they should have specified it in their robots.txt file.

Furthermore, Getty is one of the scummiest companies out there, they pretended to have the copyright of tens of millions of images in the Library of Congress, they also take the photos of photographers who publish them under the CC license and then try to shake these photographers for money.

10

leroy_hoffenfeffer t1_j4wmjso wrote

I know how they obtained URLs using CommonCrawl. CommonCrawl isn't the issue.

CommonCrawl only returns URLs. LAION had to take the URLs and download the content contained on the webpage therein.

1

Arcosim t1_j4wqzbl wrote

The point is, if they didn't want that content scrapped, they should have put a rule disallowing it in their robots.txt

1

leroy_hoffenfeffer t1_j4wvkoo wrote

A few issues with this thought process:

  1. Even if folks were to retroactively add or edit robots.txt files to disallow scraping, that does nothing to address the content already scraped and downloaded. So the aspect of LAION downloading potentially copyrighted works is still in play.

  2. I think it's an extremely flaky argument to say "Well, those artists should have edited their robots.txt files to disallow the thing they didn't know was happening". It's a very real possibility that the artists in question didn't even know this kind of stuff was happening, let alone there being something they could do about it. I'm not sure a court would view that argument as being sound.

  3. I think LAION is a European company. Why this is relevant is because of their FAQs:

> If you found your name only on the ALT text data, and the corresponding picture does NOT contain your image, this is not considered personal data under GDPR terms. Your name associated with other identifiable data is. If the URL or the picture has your image, you may request a takedown of the dataset entry in the GDPR page. As per GDPR, we provide a takedown form you can use.

So, LAION is beholden to GDPR terms. I think the potential exists for someone to ask "Well... If my picture and data is considered personal data, why isn't the content I produce also considered personal data?" Current GDPR guidelines behave this way, but I think we may end up seeing edits or rewrites of GDPR guidelines given cases like this.

It's neither reasonable nor sound to say "Artists should have taken this very technical detail into account in order to protect their work."

1