velcher
velcher t1_j8xfdd7 wrote
Reply to [D] Coauthor Paper? by [deleted]
In general, yes, being middle author in papers with > 3 authors is not great. It's better than having nothing though.
The best outcome you can get as 2nd author is 2nd author of a 3 author paper (PhD, Undergrad, Prof), contribute seriously to the project, and get a good letter of recommendation from the Professor that says you contributed seriously to the project.
velcher t1_j8glba7 wrote
Reply to comment by dojoteef in [D] Quality of posts in this sub going down by MurlocXYZ
Could ML or simple rule-based filters help us out here?
velcher t1_j7ccfbs wrote
Yes, it is useful. The breakdown between PhD types would depend on the specific needs of the hiring organization.
velcher t1_j57upp2 wrote
Reply to comment by ukamal6 in [D] ICLR 2023 results. by East-Beginning9987
congrats! spotlight here as well. I guess we'll see each other in Kigali :)
velcher t1_j4ts9n0 wrote
Reply to [D] RLHF - What type of rewards to use? by JClub
Disclaimer: I'm a deep RL person, so I'm speaking from a pure RL viewpoint. I have never trained LLM with RLHF (yet ;) ).
You can think of rewards as a way of expressing preferences to the model. Then you can reason about what types of rewards to use.
Binary: either the output is good or bad. There is no preference between outputs that are good (they are all 1) or outputs that are bad (they are all 0). Scale of 1-5: there are 5 preferences of increasing order. In particular, the rank 1 choice is exactly 1 real value (see aside for what the real value does) more than rank 2. Ranking 4 different model outputs: Not sure what you mean here.
Aside: So reward scale can affect the RL process. RL policies are commonly trained through something called the "Policy Gradient", which weights the policy update by the scale of the return (sum of rewards). So the larger your reward scaling, the larger this gradient. Too large rewards can cause the gradient to be too large and lead to an unstable policy, too small rewards can result in small gradients and therefore slow-to-converge policies. This reward scale can be counteracted by the learning rate, or reward normalization. But all of this needs to be tuned for the specific task.
Reward scaling can also affect your RL algorithm, particularly if it uses an entropy penalty for exploration (SAC, TD3, PPO, TRPO etc.).
velcher t1_j2jxium wrote
Reply to [D] Is there any research into using neural networks to discover classical algorithms? by currentscurrents
> Stripping away the neural network and running the underlying algorithm could be useful, since classical algorithms tend to run much faster and with less memory.
I would disagree with this. A lot of research is focused on distilling classical algorithms (e.g. search) into neural networks because a forward pass on a GPU is much faster than running the algorithm itself.
velcher t1_j217txd wrote
Reply to [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge? by derpderp3200
https://arxiv.org/abs/2001.06782 Gradient Surgery for Multi-Task Learning
Some related work in Multi-task RL. But I remember my impression of it was that it only moderately helps Multi-task RL.
velcher t1_iydz6bc wrote
Reply to [D] CPU - which one to choose? by krzaki_
I built my servers with AMD Ryzens. No issues here. AMD Ryzen gives you more bang for your buck as well.
velcher t1_ixj98ee wrote
Reply to [R] Human-level play in the game of Diplomacy by combining language models with strategic reasoning — Meta AI by hughbzhang
Great results! Some feedback:
- I'm somewhat unsatisfied with the amount of human engineering / annotation pipelines that went into the agent. Importantly the "intention" mechanisms, which seem to be a key part of making the dialogue -> planning part tractable.
- This annoyance somewhat extends to the "message filtering mechanisms" to prevent non-sensical, incoherent messages, as this seems more of a hack. Really, the agent should learn to converse from the objective of being an optimal player (amongst other humans). Because if it starts speaking gibberish, then other human players can tell it is an AI. This would most likely be a bad outcome for the agent (unless the humans are blue-pilled).
- From what I gather, it seems like it is only trained on "truthful" subset of the dialogue data, which means the agent cannot lie. Deceit seems pretty important for winning Diplomacy.
- The sections on planning are not easy to understand concretely, specifically "Dialogue-conditional planning" and "Self-play reinforcement learning for improved value estimation". The authors seem to paraphrase the math and logic in words and omit equations to keep it high level, but this just makes everything more vague. Luckily, the supplemental seems to have the details.
- Thanks for publishing the code. This is very important for the research community. I hope FAIR continues to do this.
Also, the PDF from science.org is terrible. I can't even highlight lines with my Mac's preview app. Please fix that if you get a chance!
velcher t1_ir5und6 wrote
Thanks for the post!
In the section: Marginals of the time-changed OU-process
> The empirical measure \hat\mu = 1/J \sum_{j=1}^J \delta_{x_i}
- What is \delta_{x_i}? It doesn't seem to be mentioned before. Is this just the difference between the J samples x^j?
velcher t1_j934snb wrote
Reply to [D] Formalising information flow in NN by bjergerk1ng
You might be interested in V-information, which specifically looks at information from a computational efficiency point of view.
For example, classical mutual information will say an encrypted version of the message and the original message will have high MI, but we know practically that it is hard to extract the message from the encryption. Therefore, there will be low V-info in this case.