Submitted by xutw21 t3_yjryrd in MachineLearning
Paper: https://arxiv.org/abs/2211.00241
Project Page: goattack.alignmentfund.org
​
>We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. See this https URL for example games.
dojoteef t1_iupqxxr wrote
It seems most commenters are pointing out reasoning why the proposed setup seems deficient in one way or the other.
But the point of the research is to highlight potential blind spots even in seemingly "superhuman" models, even if the failure modes are weird edge cases that are not broadly applicable.
By first identifying the gaps, mitigation strategies can be devised that make training more robust. In that sense, the research is quite useful even if a knowledgable GO player might not be impressed by the demonstrations highlighted in the paper.