velcher

velcher t1_j934snb wrote

You might be interested in V-information, which specifically looks at information from a computational efficiency point of view.

For example, classical mutual information will say an encrypted version of the message and the original message will have high MI, but we know practically that it is hard to extract the message from the encryption. Therefore, there will be low V-info in this case.

2

velcher t1_j8xfdd7 wrote

In general, yes, being middle author in papers with > 3 authors is not great. It's better than having nothing though.

The best outcome you can get as 2nd author is 2nd author of a 3 author paper (PhD, Undergrad, Prof), contribute seriously to the project, and get a good letter of recommendation from the Professor that says you contributed seriously to the project.

2

velcher t1_j4ts9n0 wrote

Disclaimer: I'm a deep RL person, so I'm speaking from a pure RL viewpoint. I have never trained LLM with RLHF (yet ;) ).

You can think of rewards as a way of expressing preferences to the model. Then you can reason about what types of rewards to use.

Binary: either the output is good or bad. There is no preference between outputs that are good (they are all 1) or outputs that are bad (they are all 0). Scale of 1-5: there are 5 preferences of increasing order. In particular, the rank 1 choice is exactly 1 real value (see aside for what the real value does) more than rank 2. Ranking 4 different model outputs: Not sure what you mean here.

Aside: So reward scale can affect the RL process. RL policies are commonly trained through something called the "Policy Gradient", which weights the policy update by the scale of the return (sum of rewards). So the larger your reward scaling, the larger this gradient. Too large rewards can cause the gradient to be too large and lead to an unstable policy, too small rewards can result in small gradients and therefore slow-to-converge policies. This reward scale can be counteracted by the learning rate, or reward normalization. But all of this needs to be tuned for the specific task.

Reward scaling can also affect your RL algorithm, particularly if it uses an entropy penalty for exploration (SAC, TD3, PPO, TRPO etc.).

5

velcher t1_j2jxium wrote

> Stripping away the neural network and running the underlying algorithm could be useful, since classical algorithms tend to run much faster and with less memory.

I would disagree with this. A lot of research is focused on distilling classical algorithms (e.g. search) into neural networks because a forward pass on a GPU is much faster than running the algorithm itself.

3

velcher t1_ixj98ee wrote

Great results! Some feedback:

  • I'm somewhat unsatisfied with the amount of human engineering / annotation pipelines that went into the agent. Importantly the "intention" mechanisms, which seem to be a key part of making the dialogue -> planning part tractable.
  • This annoyance somewhat extends to the "message filtering mechanisms" to prevent non-sensical, incoherent messages, as this seems more of a hack. Really, the agent should learn to converse from the objective of being an optimal player (amongst other humans). Because if it starts speaking gibberish, then other human players can tell it is an AI. This would most likely be a bad outcome for the agent (unless the humans are blue-pilled).
  • From what I gather, it seems like it is only trained on "truthful" subset of the dialogue data, which means the agent cannot lie. Deceit seems pretty important for winning Diplomacy.
  • The sections on planning are not easy to understand concretely, specifically "Dialogue-conditional planning" and "Self-play reinforcement learning for improved value estimation". The authors seem to paraphrase the math and logic in words and omit equations to keep it high level, but this just makes everything more vague. Luckily, the supplemental seems to have the details.
  • Thanks for publishing the code. This is very important for the research community. I hope FAIR continues to do this.

Also, the PDF from science.org is terrible. I can't even highlight lines with my Mac's preview app. Please fix that if you get a chance!

4

velcher t1_ir5und6 wrote

Thanks for the post!

In the section: Marginals of the time-changed OU-process

> The empirical measure \hat\mu = 1/J \sum_{j=1}^J \delta_{x_i}

  • What is \delta_{x_i}? It doesn't seem to be mentioned before. Is this just the difference between the J samples x^j?
2