Viewing a single comment thread. View all comments

Competitive-Rub-1958 t1_izil2ps wrote

I feel this paper could've been written significantly more clearly and fairly. While I do understand that the authors wanted to create a punchy title declaring "poor" 0-shot performance, it reads slightly a bit like LLMs can't understand context or reason very well (this is just my impression and opinion though).

From 4.2, The average human gets 86.2% correct - the best LLM gets 80.6% w/ natural language, and 81.7% with a structured prompt, both few-shot.

My main gripe is that disambiguating implicature is fundamentally a reasoning task. Due to the inherent ambiguity, you have to create multiple hypotheses and test them to see which fits the best. With enough context, that task becomes simpler.

So they should've evaluated with Chain-of-thought prompting. They even mention it in the paper they try finding other prompt templates as alternatives to it - but don't test w/CoT? This is a very recent paper, with some famous authors. We've seen CoT help in almost all tasks - additionally, with U-shaped inverse scaling too. I don't see why this task gets a pass.

If someone tests this against ChatGPT to further confirm the RLHF hypothesis, and against CoT, I will be satisfied that understanding implicature 0-shot is indeed hard for LLMs.

88

RomanRiesen t1_izkj431 wrote

I was about to write "neither title nor abstract manage to 1-shot communicate their ideas or research to me" but it felt mean so I didn't. Also haven't read the paper yet.

14

egrefen t1_iznuuxy wrote

While your quip is as witty as it is potentially mean-spirited, I’d love to understand what about the title and abstract you actually found unclear.

2

leliner t1_iznom12 wrote

Did test against chatGPT. Cannot fully compare to humans or the experimental setup used in the paper (especially not as comprehensively as using 9 prompts on 600 examples). Preliminary results show there's still a gap with humans, especially with particularised examples (see last paragraph of section 4.1 in the paper). Feel free to try CoT, definitely something we have thought about, and for a response to that I refer to Ed's comment https://www.reddit.com/r/MachineLearning/comments/zgr7nr/comment/iznhuqz/?context=1.

3