Viewing a single comment thread. View all comments

ARGleave t1_ius7nup wrote

I'm pretty sympathetic to this perspective. The concerning thing is that scaling up neural networks like GPT-3 is getting a lot more attention (and resources) than neurosymbolic approaches or other search-like algorithms that might solve this problem. Pure neural net scaling does seem like it's enough to get good average-case performance on-distribution for many tasks. So it's tempting to also believe that with enough scale, once you hit human-level performance on the average-case you'll also get human-level robustness for free, as the network learns the right representation. This isn't universally believed, but I've spoken to many scaling adherents who hold some version of this view. Part of the motivation of the paper was to show this is false, that even highly capable networks are quite vulnerable by themselves, and that something else (whether search, or a different training technique) is needed to get robustness.

3

[deleted] t1_iuslara wrote

[removed]

2

ARGleave t1_iusoxjq wrote

I'm talking about policy networks as in many systems that is all there is. OpenAI Five and AlphaStar both played without search, and adding search to those systems is a research problem in its own right. If a policy network cannot be robust without search, then I'd argue we need to put more effort as a community into developing methods like MuZero that might allow us to apply search to a broader range of settings, and less on just scaling up policies.

But granted, KataGo itself was designed with search, so (as your other comment also hinted at) might the policy network be vulnerable because it was not trained to win without search? The training is designed to distill the search process into the policy, so I don't think the policy should be uniquely vulnerable without search -- to the extent this distillation succeeds, the policy network without search should be comparable to an earlier checkpoint with search. However, I do think our attack faltering at 128 visits and beyond on the latest networks is a weakness, and one we're looking to address.

2

[deleted] t1_iuswbpm wrote

[removed]

2

ARGleave t1_iut3tue wrote

>AI-Five & AlphaStar are continuous systems; their policy networks are basically driving the whole show and has fewer redundancies/failsafes built in. We should expect greater robustness there!

I'm confused by how you're using continuous. My understanding is that both Dota and Starcraft have discrete action spaces. Observation space is technically discrete too (it's from a video game) but maybe is sufficiently large it's better to model as continuous in some cases. Why do you expect greater robustness? It seems more challenging to be robust in a high-dimensional space and if I remember correctly some human players even figured out ways to exploit OpenAI Five.

>The hope -- the whole point of the method! -- is that the policy & value become sufficiently general that it can do useful search in parts of the state space that are out-of-distribution.

This is a good point, and I'm excited by attempting to scale the attack to victims with more search to address whether the method as a whole is robust at sufficient levels of search. My intuition is that if the policy and value network are deeply flawed then search will only reduce the severity of the problem not eliminate it: you can't search to the end of the game most of the time, so you have to rely on the value network to judge the leaf nodes. But ultimately this is still an open empirical question.

>It's plausible that "policy without search is comparable to an earlier checkpoint with search", but showing that policy-only needs more training does not show anything -- you need to show me that the future-policy-only would not be able to have learned your adversarial example. If you showed that the bad-policy with search produced data that still produced bad-policy, that would be really interesting!

I'm not sure I fully understand this. We train our adversarial policy for about 0.5% of the training time of the victim. Do you think 0.5% additional self-play training would solve this problem? I think the issue is that self-play gets stuck in a narrow region of state space and stops exploring.

Now you could absolutely train KataGo against our adversary, repeat the attack against this hardened version of KataGo, train KataGo against the new adversary, etc. This is no longer self-play in the conventional sense though -- it's closer to something like policy-space response oracle. That's an interesting direction to explore in future work, and we're considering it, but it has its own challenges -- doing iterated best response is much more computationally challenging than the approximate best response in conventional self-play.

1

ummicantthinkof1 t1_ivjy33b wrote

On the contrary, I would not expect "search + distillation" to inherently create a policy network that is robust without search. It seems reasonable to imagine that during training Katago has hypothesized Tromp-Taylor "just drop a stone in their territory" attacks, read out refutations through search, and discarded that line of play. The refutation would not get distilled into the policy, because it's a line that is never chosen. But - it's never chosen because in its expected environment of full playout Katago is already capable of refuting. In a no-search environment, hypothesizing the attack would directly create great to counter it.

We have certainly seen odd behavior when Go playing AI are well ahead, to the extent of just idly filling in territory to lose points or ignoring the death of groups. But - at a certain point the game becomes close again, we return to in-distribution, and it wins easily. So it seems like using a ruleset that can move directly from well outside of distribution to scoring would be a likely weakness - but, if this attack isn't successful with higher playout rates then Katago may very well already be robust against that weakness, and it isn't necessarily true that there are others (again, since most 'leave distribution by playing poorly' attacks seem to pass back through a realistic distribution on their way to victory.

I'm very sympathetic to the unreasonable cost of doing research in this domain, but "trained on playouts of 600 or 800 or whatever and then defeated at 64" seems like it has an Occam's Razor explanation of "Using a policy network in an environment unlike the one it was trained on doesn't work"

1