dojoteef t1_j1uy04f wrote on December 27, 2022 at 4:24 PM

Very interesting idea. It could easily be applied to images since digital watermarks already exist. Not sure how feasible it is for AI generated text.

Tbh, I imagine it behooves companies to do this so they are less likely to train on media (text, images, audio, etc) produced from a model. The more ubiquitous the use of AI generation becomes, the more of an issue this poses. Currently that problem is likely quite minimal and probably acts to inject a small bit of noise into training (and the knowledge distillation effect could make slightly improve training efficiency).

Though I guess a new data cleaning step could be running a classification model to classify if the media trained on is likely AI generated, though that would likely be less efficient than a hash produced at the time of generation.