Viewing a single comment thread. View all comments

ARGleave t1_iuseu7k wrote

Our adversary is a forked version of KataGo, and we've not changed the scoring rules at all in our fork, so I believe the scoring is the same as KataGo used during training. When our adversary wins, I believe the victims' territory is not pass-alive -- the game ends well before that. Note pass-alive here is a pretty rigorous condition: there has to be no sequence of legal moves of the opposing color that result in emptying the territory. This is a much more stringent condition than what human players would usually mean by a territory being dead or alive.

If we look at https://goattack.alignmentfund.org/adversarial-policy-katago?row=0#no_search-board then the white territory in the bottom-left is not pass-alive. There are a sequence of moves by black that would capture all the white stones, if white played sufficiently poorly (e.g. playing next to its groups and letting black surround it). Of course, white can easily win -- and if we simply modify KataGo to prevent it from passing prematurely, it does win against this adversary.

> But it is most definitely not a small perturbation of inputs within the training distribution.

Agreed, and I don't think we ever claimed it was. This is building on the adversarial policies threat model we introduced a couple of years ago. The norm-bounded perturbation threat model is an interesting lens, but we think it's pretty limited: Gilmer et al (2018) had an interesting exploration of alternative threat models for supervised learning, and we view our work as similar in spirit to unrestricted adversarial examples.

2

KellinPelrine t1_iusv4mq wrote

I see, that's definitely meaningful that you're using KataGo fork with no scoring changes. I think I did not fully understand pass-alive - I indeed took it in a more human sense that there is no single move that capture or break it. However, if I understand now what you're saying is that there has to be no sequence of moves of arbitrary length where one side continually passes and the other continually plays moves trying to destroy their territory? If that is the definition though it seems black also has no territory in the example you linked. If white has unlimited moves with black passing every time, white can capture every black stone in the upper right (and the rest of the board). So then it would seem to me that neither side has anything on the board, formally, in which case white (KataGo) should win by komi?

1

ARGleave t1_iusxvdj wrote

I agree the top-right black territory is also not pass-alive. However, it gets counted as territory for black because there are no white stones in that region. If white had even a single stone there (even if it was dead as far as humans are concerned) then that wouldn't be counted as territory for black, and white would win by komi.

The scoring rules used are described in https://lightvector.github.io/KataGo/rules.html -- check "Tromp-Taylor rules" and then enable "SelfPlayOpts". Specifically, the scoring rules are:

>(if ScoringRule is Area)
The game ends and is scored as follows:
(if SelfPlayOpts is Enabled): Before scoring, for each color, empty all points of that color within pass-alive-territory of the opposing color.
(if TaxRule is None): A player's score is the sum of:
+1 for every point of their color.
+1 for every point in empty regions bordered by their color and not by the opposing color.
If the player is White, Komi.
The player with the higher score wins, or the game is a draw if equal score.

So, first pass-alive regions are "emptied" of opponent stones, and then each player gets points for stones of their color and in empty regions bordered by their color.

Pass-alive is defined as:

>A black or white region R is a pass-alive-group if there does not exist any sequence of consecutive pseudolegal moves of the opposing color that results in emptying R.[2]
A {maximal-non-black, maximal-non-white} region R is pass-alive-territory for {Black, White} if all {black, white} regions bordering it are pass-alive-groups, and all or all but one point in R is adjacent to a {black, white} pass-alive-group, respectively.[3]

It can be computed by Benson's algorithm.

2

KellinPelrine t1_iutis26 wrote

That makes sense. I think this gives a lot of evidence then that there's something more than just an exploit against the rules going on. It looks like it can't evaluate pass-alive properly, even though that seems to be part of the training. I saw in the games some cases (even in the "professional level" version) where even two moves in a row is enough to capture something and change the human-judgment status of a group, and not particularly unusual local situations either, definitely things that could come up in a real game. I would be curious if it ever passes "early" in a way that changes the score (even if not the outcome) in its self-play games (after being trained). Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.

1

ARGleave t1_iutmvdj wrote

>Or if its estimated value is off from what it should be. Perhaps for some reason it learns to play on the edge, so to speak, by throwing parts of its territory away when it doesn't need it to still win, and that leads to the lack of robustness here where it throws away territory it really does need.

That's quite possible -- although it learns to predict the score as an auxiliary head, the value function being optimized is the predicted win rate, so if it thinks it's very ahead on score it would be happy to sacrifice some points to get what it thinks is a surer win. Notably the victim's value function (predicted win rate) is usually >99.9% even on the penultimate move where it passes and has effectively thrown the game.

1