Submitted by Desi___Gigachad t3_126rgih in MachineLearning
There is this research post by Neel Nanda which states that Othello-GPT has a linear emergent world representation. What does it mean (as I am mostly a novice) and what do you all think about it?
link :- https://www.neelnanda.io/mechanistic-interpretability/othello
Jadien t1_jeastxv wrote
I've only skimmed the link (and its sub-links), but the basic idea is this:
If you've trained a model to predict the next move in an Othello game, given the board state as an input, you can not necessarily conclude that the model also has the ability to perform similar tasks, like "Determine whether a given move is legal" or "Determine what the board state will be after executing a move". Those abilities might help a model predict the next move but are not required.
However:
> Context: A recent paper trained a model to play legal moves in Othello by predicting the next move, and found that it had spontaneously learned to compute the full board state - an emergent world representation.
In the process of optimizing the model's ability to predict moves, the model did also develop the ability to compute the next board state, given the
initial stateprevious moves and predicted move (Thank you /u/ditchfieldcaleb).The author's contribution:
> I find that actually, there's a linear representation of the board state! > This is evidence for the linear representation hypothesis: that models, in general, compute features and represent them linearly, as directions in space! (If they don't, mechanistic interpretability would be way harder)
Which is to say that the model's internal prediction of the next board state is fairly interpretable by humans: There's some square-ish set of activations in the model that correspond to the square-ish Othello board. That's another property of the model that is a reasonable outcome but isn't a foregone conclusion.