Viewing a single comment thread. View all comments

ummicantthinkof1 t1_ivjy33b wrote

On the contrary, I would not expect "search + distillation" to inherently create a policy network that is robust without search. It seems reasonable to imagine that during training Katago has hypothesized Tromp-Taylor "just drop a stone in their territory" attacks, read out refutations through search, and discarded that line of play. The refutation would not get distilled into the policy, because it's a line that is never chosen. But - it's never chosen because in its expected environment of full playout Katago is already capable of refuting. In a no-search environment, hypothesizing the attack would directly create great to counter it.

We have certainly seen odd behavior when Go playing AI are well ahead, to the extent of just idly filling in territory to lose points or ignoring the death of groups. But - at a certain point the game becomes close again, we return to in-distribution, and it wins easily. So it seems like using a ruleset that can move directly from well outside of distribution to scoring would be a likely weakness - but, if this attack isn't successful with higher playout rates then Katago may very well already be robust against that weakness, and it isn't necessarily true that there are others (again, since most 'leave distribution by playing poorly' attacks seem to pass back through a realistic distribution on their way to victory.

I'm very sympathetic to the unreasonable cost of doing research in this domain, but "trained on playouts of 600 or 800 or whatever and then defeated at 64" seems like it has an Occam's Razor explanation of "Using a policy network in an environment unlike the one it was trained on doesn't work"

1