Introduction to Meta’s Llama 4 Models
Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash “across a broad range of widely reported benchmarks.” This release positioned Meta’s open-weight Llama 4 as a serious challenger to the state-of-the-art, closed models from OpenAI, Anthropic, and Google.
Maverick’s Impressive Performance
Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Meta’s press release, the company highlighted Maverick’s ELO score of 1417, which placed it above OpenAI’s 4o and just under Gemini 2.5 Pro. A higher ELO score means the model wins more often in the arena when going head-to-head with competitors.
The Discovery of an Optimized Model
However, AI researchers digging through Meta’s documentation discovered something unusual. In fine print, Meta acknowledges that the version of Maverick tested on LMArena isn’t the same as what’s available to the public. According to Meta’s own materials, it deployed an “experimental chat version” of Maverick to LMArena that was specifically “optimized for conversationality.” This raised concerns about the fairness of the benchmarking process.
LMArena’s Response
LMArena posted on X two days after the model’s release, stating that “Meta’s interpretation of our policy did not match what we expect from model providers.” The site updated its leaderboard policies to reinforce its commitment to fair, reproducible evaluations, ensuring that such confusion doesn’t occur in the future. A spokesperson for Meta, Ashley Gabriel, said in an emailed statement that “we experiment with all types of custom variants.”
The Concerns of AI Researchers
Independent AI researcher Simon Willison tells The Verge, “It’s the most widely respected general benchmark because all of the other ones suck.” Willison was impressed by Llama 4’s performance but notes that the fact that it came second in the arena, just after Gemini 2.5 Pro, was misleading due to the use of an optimized model. The AI community started talking about a rumor that Meta had also trained its Llama 4 models to perform better on benchmarks while hiding their real limitations.
Meta’s Response to Accusations
VP of generative AI at Meta, Ahmad Al-Dahle, addressed the accusations in a post on X, stating that “We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that.” Meta’s best understanding is that the variable quality people are seeing is due to needing to stabilize implementations. The company’s path to releasing Llama 4 wasn’t exactly smooth, with the model failing to meet internal expectations and its launch being repeatedly pushed back.
Conclusion
The release of Llama 4 has highlighted the importance of fair and transparent benchmarking in the AI community. Using an optimized model in LMArena puts developers in a difficult position, as they rely on benchmarks to guide their selection of models for their applications. As AI development accelerates, this episode shows how benchmarks are becoming battlegrounds, and companies like Meta are eager to be seen as leaders, even if it means gaming the system. Ultimately, the use of optimized models in benchmarking undermines the integrity of the process and can lead to misleading results.