Submitted by mrx-ai t3_zgr7nr in MachineLearning
Paper: Large language models are not zero-shot communicators (arXiv)
Abstract:
Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. Models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.
Authors: Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, Edward Grefenstette
Competitive-Rub-1958 t1_izil2ps wrote
I feel this paper could've been written significantly more clearly and fairly. While I do understand that the authors wanted to create a punchy title declaring "poor" 0-shot performance, it reads slightly a bit like LLMs can't understand context or reason very well (this is just my impression and opinion though).
From 4.2, The average human gets 86.2% correct - the best LLM gets 80.6% w/ natural language, and 81.7% with a structured prompt, both few-shot.
My main gripe is that disambiguating implicature is fundamentally a reasoning task. Due to the inherent ambiguity, you have to create multiple hypotheses and test them to see which fits the best. With enough context, that task becomes simpler.
So they should've evaluated with Chain-of-thought prompting. They even mention it in the paper they try finding other prompt templates as alternatives to it - but don't test w/CoT? This is a very recent paper, with some famous authors. We've seen CoT help in almost all tasks - additionally, with U-shaped inverse scaling too. I don't see why this task gets a pass.
If someone tests this against ChatGPT to further confirm the RLHF hypothesis, and against CoT, I will be satisfied that understanding implicature 0-shot is indeed hard for LLMs.