Digital Event Horizon
Researchers at Hugging Face have revamped the Open LLM Leaderboard using Math-Verify, resulting in improved accuracy and fairness of math evaluations. This has led to a reshuffling of the leaderboard, with Nvidia's AceMath models now dominating the MATH-Hard leaderboard.
The Open LLM Leaderboard has struggled to evaluate Large Language Models (LLMs) on math problems due to format issues. Hugging Face's Math-Verify solution uses advanced parsing techniques to extract answers from LLM outputs. The introduction of Math-Verify resulted in a significant improvement in model performance, with an average increase of 4.66 points across all models. Algebra-related subsets saw the most significant improvements, with gains of 8.27 and 6.93, respectively. The changes led to new leaderboard rankings, with Nvidia's AceMath models dominating the MATH-Hard leaderboard and other models jumping in the rankings.
The world of artificial intelligence and machine learning has long been aware of the challenges that come with evaluating Large Language Models (LLMs) on math problems. For years, researchers and developers have grappled with issues such as the inability to extract answers from LLM outputs in a format that could be compared to gold-standard solutions. The Hugging Face Hub's Open LLM Leaderboard has been at the forefront of this issue, providing a platform for users to compare the performance of various models on math tasks.
In order to address these challenges and provide more accurate and fair evaluations, researchers have turned to Math-Verify, a cutting-edge solution that utilizes advanced parsing techniques to extract answers from LLM outputs. In a recent update, Hugging Face announced that they have used Math-Verify to re-evaluate all 3,751 models ever submitted to the Open LLM Leaderboard.
The impact of this change has been significant. On average, models solved 61 more problems across the board, equating to a 4.66-point increase in their scores. The two subsets that showed the most significant improvement were both algebra-related (Algebra and Prealgebra), with gains of 8.27 and 6.93, respectively. In extreme cases, some models demonstrated improvements of nearly 90 points on these subsets.
The introduction of Math-Verify has reshuffled the leaderboard, with Nvidia's AceMath models now dominating the MATH-Hard leaderboard. Another major beneficiary of this change are the Qwen derivatives, which are now almost exclusively the only models ranking right below AceMath.
However, the changes do not stop there. The overall Leaderboard results have undergone significant shifts, with many other models completely jumping in the rankings, gaining 200 places or more! This is a testament to the improved accuracy and fairness of the evaluations provided by Math-Verify.
The new leaderboard rankings can be seen in full at the Open LLM Leaderboard. With this update, Hugging Face has taken a major step forward in providing users with reliable results for evaluating their models on math tasks. We look forward to seeing how this will impact the wider community of researchers and developers working in this field.
Related Information:
https://huggingface.co/blog/math_verify_leaderboard
https://undercodenews.com/fixing-the-open-llm-leaderboard-with-math-verify-a-more-accurate-and-fair-evaluation-system/
https://github.com/huggingface/blog/blob/main/math_verify_leaderboard.md
Published: Fri Feb 14 10:41:29 2025 by llama3.2 3B Q4_K_M