LLMs Struggle with Real Conversations – MultiChallenge Exposes Their Biggest Weaknesses
LLMs seem great in chat, but what happens when a conversation gets complex? A new benchmark, MultiChallenge, reveals that even top AI models fail to maintain consistency, recall past details, and follow instructions over multiple turns. The best model? o1 (December 2024) – at just 44.93% accuracy. Let’s break down what this means for AI’s future.


Posted by
Daniel Welsh
Fri, 07 Feb 2025
Why AI Still Struggles with Real Conversations
Ever had an AI assistant forget key details in a long chat? Or change its stance when asked the same question twice? Turns out, even the best AI models today have serious issues handling multi-turn conversations. Enter MultiChallenge, a new benchmark from Scale AI that tests how well Large Language Models (LLMs) manage real-world dialogue.
And the results? Not great.
Even the top-performing model, o1 (December 2024), scored just 44.93% accuracy – proving that LLMs still have a long way to go before they can truly hold complex, natural conversations.
The 4 Major AI Failures in Multi-Turn Conversations
MultiChallenge isn’t just another chatbot test. It pushes LLMs into four tough conversation challenges that mimic real user interactions:
1️⃣ Instruction Retention – Can the AI follow an instruction given at the start of the chat, even 10 messages later?
2️⃣ Inference Memory – Can the AI recall and connect details from earlier in the conversation when responding to new questions?
3️⃣ Reliable Versioned Editing – Can the AI track multiple revisions in a conversation without losing details or contradicting itself?
4️⃣ Self-Coherence – Can the AI remain logically consistent, or does it agree with the user even when that contradicts its own previous response?
Spoiler: All frontier models failed.
How AI Models Performed (And Why It Matters)
Despite performing well on older benchmarks, today’s leading LLMs struggled badly on MultiChallenge:
🥇 o1 (December 2024) – 44.93% accuracy (best overall) 🥈 Claude 3.5 Sonnet (October 2024) – 43.20% accuracy 🥉 Gemini 2.0 Pro Experimental (February 2025) – 40.67% accuracy 4️⃣ o3-mini (medium) – 40.09% accuracy 5️⃣ Gemini 2.0 Flash Thinking Experimental (January 2025) – 37.78% accuracy 6️⃣ o1-preview – 37.28% accuracy 7️⃣ Gemini 2.0 Flash (February 2025) – 36.88% accuracy 8️⃣ o1-mini – 34.49% accuracy
Why does this matter? If LLMs can’t reliably track details, edit without mistakes, or hold consistent positions, they can’t be trusted for professional tasks like legal research, medical assistance, or customer service.
The Future: Can AI Improve Multi-Turn Conversations?
The MultiChallenge results expose a critical gap in LLM development. Instead of just improving single-turn question-answering, AI researchers now need to focus on true conversational intelligence – ensuring models: ✅ Remember and apply long-term instructions ✅ Connect past details with current responses ✅ Stay consistent even in multi-turn interactions
This is the next big challenge for AI – and benchmarks like MultiChallenge will help track progress.
📄 Paper: http://arxiv.org/abs/2501.17399 🏆Leaderboard: https://scale.com/leaderboard/multichallenge
Share this post