Yuvraj Khanna, Raghav Rastogi, Dhruv Kumar, Peter Relan, Hung Tran
Introduction
When math has to deal with language we have found a fascinating way to encode words as numbers, but when language has to deal with math, how good is it? At MathGPT.ai, our R&D team has spent over two years working with LLMs exploring how language models handle mathematical reasoning.
Ready for a shake-up in how you view AI math skills? Let’s look beyond the leaderboard hype – which AI math model stands firm when the words change but the numbers don’t? Our carefully developed MathGPT.ai stress tests show Microsoft’s Phi 4 reasoning model crushed Apple’s advanced GSM SYM p2 math reasoning benchmark at 91.1% – a clear win, right? Not so fast. At MathGPT.ai, when we merely rephrased the problems (keeping all the math and logic identical), this champion, alongside Alibaba’s Qwen 3 8B, both saw their accuracies dramatically plunge by an identical 13.33% in accuracy. Meanwhile, Nvidia’s OpenMath Nemo 7B (83.33% on variation stress test) and Microsoft’s own ultra-stable Phi 4 reasoning plus (a mere -1.11% drop vs -13.3% for Ph 4 without the plus) reveal what true mathematical resilience looks like.This deep dive explores which open-source AI models truly are robust reasoners, and which are less than robust when similar, but out of distribution problems are presented.
Before diving into our findings, it’s crucial to delineate the boundaries of this MathGPT.ai evaluation: we are exclusively examining open-source Small Language Models (SLMs) to ensure transparency and maintain our specific research focus. In addition, the models evaluated in this MathGPT.ai report fall under the 15 billion parameter mark, categorizing them primarily as Small Language Models (SLMs). This allows us to specifically investigate the mathematical reasoning and robustness of models broadly available to the community without requiring enormous compute. Therefore, prominent and typically much larger closed-source models, including potential May 2025 frontrunners like Claude 4, Gemini 2.5 Pro, or new OpenAI “o” series models (e.g., o3, o4-mini), were not part of this evaluation as they fall outside the intended scope of this particular experiment.
Understanding the Benchmarks
To fairly evaluate these language models, a diverse set of mathematical problem-solving datasets was used. Each dataset tests different aspects of mathematical reasoning and stability:
- GSM1k (Grade School Math 8k test set): Developed by OpenAI and released in 2021, this benchmark uses a test set of 1,000 problems. It focuses on assessing how well models can solve multi-step mathematical problems typically found in grade school curricula. (Further details: https://arxiv.org/abs/2110.14168)
- GSM SYM (Symbolic Stability for Grade School Math): Introduced by Apple in 2024, this suite of test sets (main, p1, and p2) probes the mathematical stability of models. The datasets vary mathematical numbers and concepts to ensure a thorough evaluation. (Further details: https://machinelearning.apple.com/research/gsm-symbolic). Building on problems from GSM8k, it tests:
- GSM SYM main: Performance when problem wording and numbers are varied but keeping the mathematical solution logic the same.
- GSM SYM p1: Performance when one additional reasoning step is added.
- GSM SYM p2: Performance when two additional reasoning steps are added.
- MATH500: Created by Dan Hendrycks and released in 2021, this dataset comprises 500 problems sampled from the much larger MATH test set. It is designed to evaluate models on more challenging high school and introductory college-level mathematics, with problems often sourced from competitions like the AIME (American Invitational Mathematics Examination) and AMC (American Mathematics Competitions). (Further details: https://github.com/hendrycks/math)
- MathGPT Variation Testing: Released by MathGPT.ai in 2024, this custom test set specifically measures how a model’s accuracy is affected when only the non-mathematical elements (wording, context) of a problem with words in it are altered, while the underlying mathematical numbers, concepts, and solution path remain identical. The original problems are derived from the MATH test set. A significant drop in accuracy on these variations highlights a model’s difficulty in consistently applying the same mathematical logic to differently phrased, yet fundamentally identical, problems. (Further details: https://www.mathgpt.ai/blog/)
- Original 90: Performance on 90 cherry picked original problems from MATH test set
- Variant 4 90: Performance on 90 Variation 4 problems of the Original 90 problems. The full question language is reformulated along with the variables used and names mentioned are changed. Any number or mathematical concept used is unchanged. This means the solution does not change (up to change in language semantics)
- Accuracy Drop: The drop in accuracy for Variant 4 problems is particularly pronounced because the underlying mathematical numbers, concepts, and solution remain identical to the original problems, unlike the variations in wording in GSM SYM main compared to GSM8k. This highlights a dependence on pattern familiarity rather than logical reasoning.
GSM1K: Why Apple’s GSM SYM p2 is a superior evaluation benchmark
Apple’s GSM SYM p2 benchmark stands out as a particularly insightful tool for evaluating mathematical reasoning in language models due to its targeted design. By systematically adding two extra reasoning steps to problems derived from the well-established GSM8k dataset, GSM SYM p2 directly probes a model’s capacity for deeper, sequential logical deduction—a known weak point for many current AI systems. This controlled increase in complexity allows for a clearer assessment of how a model’s reasoning abilities scale (or falter) when faced with more intricate problems, rather than just variations in wording or numerical values alone. Its focus on problems requiring more extensive thought processes makes it a more challenging and thus more discriminative measure of a model’s true mathematical problem-solving skills, moving beyond surface-level pattern matching to test for more robust inferential capabilities.
Why AIME 24 or AIME 25 are not included in this blog
Mathematical proficiency can be viewed across a spectrum of increasing difficulty, from foundational grade school math (like GSM8k), to more complex high school and early college-level problems (covered by benchmarks such as MATH/MATH500), and finally to highly advanced Olympiad-level mathematics (which includes problems similar to those in AIME).
At MathGPT.ai, our primary mission is to maximize the positive impact of AI in education and enhance its utility for everyday mathematical problem-solving and student sucess. For this reason, our current evaluations, especially for Small Language Models (SLMs), concentrate on the first two tiers—grade school math with enhanced reasoning complexity (like GSM SYM p2) and challenging high school/college math. We believe mastery in these areas is crucial for models intended to assist with common educational and real-world quantitative tasks. While Olympiad-level problems like those from AIME 24/25 are excellent for pushing the boundaries of AI reasoning at the highest levels, an overemphasis on such elite competition math might not provide the most relevant picture of an SLM’s practical utility for a broader audience. As AI capabilities continue to evolve, we plan to rigorously include Olympiad-level challenges in our evaluations for research purposes.
Standout models
Our comprehensive evaluations have pinpointed several models that demonstrate remarkable capabilities or intriguing performance characteristics, especially when considering the balance between raw problem-solving power and robustness to linguistic variations. The following tables summarize the performance of all evaluated open-source SLMs under 15B parameters, first on Grade School Math (GSM) benchmarks, and then on advanced math (MATH 500) and our linguistic variation tests.
Overall Performance on Grade School Math Benchmarks (GSM Series)
This table highlights model performance across various GSM datasets, which primarily test reasoning on grade-school level mathematics with increasing complexity.
Model | Params (B) | GSM SYM main | GSM SYM p1 | GSM SYM p2 | GSM1k |
---|---|---|---|---|---|
Qwen3 0.6B | 0.75 | 72.06% | 58.24% | 33.39% | 74.98% |
Nvidia OpenMath Nemo 1.5B | 1.54 | 78.53% | 66.05% | 48.39% | 78.76% |
Qwen2.5 1.5B | 1.54 | 28.52% | 26.06% | 13.37% | 33.99% |
Qwen2.5 Math 1.5B | 1.54 | 79.36% | 67.28% | 50.48% | 84.29% |
DeepScaler 1.5B | 1.78 | 84.61% | 76.99% | 60.90% | 86.65% |
Qwen 3 1.7B | 2.03 | 87.50% | 82.05% | 71.11% | 89.84% |
Qwen2.5 3B | 3.09 | 75.91% | 67.60% | 44.08% | 82.56% |
Phi 4 mini | 3.84 | 86.46% | 79.17% | 63.54% | 89.16% |
Phi 4 mini reasoning | 3.84 | 91.52% | 90.09% | 85.31% | 88.55% |
Qwen 3 4B | 4.02 | 91.34% | 88.46% | 83.10% | 95.00% |
Nvidia Llama 3.1 Nemo Nano 4B | 4.51 | 78.14% | 68.53% | 50.76% | 83.32% |
Nvidia AceMath RL Nemo 7B | 7.62 | 89.00% | 86.59% | 84.23% | 92.64% |
Nvidia OpenMath Nemo 7B | 7.62 | 87.21% | 81.78% | 76.52% | 88.86% |
Nvidia Llama 3.1 Nemo Nano 8B | 8.03 | 80.02% | 80.04% | 78.71% | 79.98% |
DeepSeek R1 0528 Qwen 8B | 8.19 | 91.65% | 86.86% | 83.44% | 93.40% |
Qwen 3 8B | 8.19 | 91.51% | 90.58% | 87.90% | 94.69% |
Phi 4 | 14.7 | 94.54% | 91.60% | 88.32% | 95.07% |
Phi 4 reasoning | 14.7 | 90.31% | 91.20% | 91.10% | 90.70% |
Phi 4 reasoning plus | 14.7 | 82.57% | 82.87% | 84.68% | 83.36% |
DeepSeek R1 Qwen 14B | 14.8 | 93.34% | 92.16% | 90.56% | 95.38% |
Qwen 3 14B | 14.8 | 93.94% | 92.94% | 89.36% | 96.29% |
Overall Performance on Advanced Math (MATH 500) & Linguistic Robustness (MathGPT.ai Variation Testing)
This table shifts focus to performance on more challenging high school/college-level math and, crucially, how models withstand linguistic variations in problems where the underlying math remains identical.
Model | Params (B) | MATH 500 | Original 90 | Variant 4 90 | Accuracy Drop |
---|---|---|---|---|---|
Qwen3 0.6B | 0.75 | 69.74% | 61.11% | 62.22% | 1.11% |
Nvidia OpenMath Nemo 1.5B | 1.54 | 86.80% | 77.78% | 71.11% | -6.67% |
Qwen2.5 1.5B | 1.54 | 40.60% | 28.89% | 30.00% | 1.11% |
Qwen2.5 Math 1.5B | 1.54 | 72.89% | 67.78% | 64.44% | -3.33% |
DeepScaler 1.5B | 1.78 | 85.60% | 84.44% | 72.22% | -12.22% |
Qwen 3 1.7B | 2.03 | 84.57% | 77.78% | 71.11% | -6.67% |
Qwen2.5 3B | 3.09 | 60.52% | 53.33% | 50.00% | -3.33% |
Phi 4 mini | 3.84 | 58.32% | 56.67% | 47.78% | -8.89% |
Phi 4 mini reasoning | 3.84 | 88.60% | 88.89% | 76.67% | -12.22% |
Qwen 3 4B | 4.02 | 91.37% | 84.44% | 75.56% | -8.89% |
Nvidia Llama 3.1 Nemo Nano 4B | 4.51 | 85.37% | 81.11% | 71.11% | -10.00% |
Nvidia AceMath RL Nemo 7B | 7.62 | 93.40% | 90.00% | 81.11% | -8.89% |
Nvidia OpenMath Nemo 7B | 7.62 | 92.00% | 90.00% | 83.33% | -6.67% |
Nvidia Llama 3.1 Nemo Nano 8B | 8.03 | 89.20% | 86.67% | 78.89% | -7.78% |
DeepSeek R1 0528 Qwen 8B | 8.19 | 87.40% | 85.56% | 76.67% | -8.89% |
Qwen 3 8B | 8.19 | 88.80% | 87.78% | 74.44% | -13.33% |
Phi 4 | 14.7 | 80.40% | 77.78% | 75.56% | -2.22% |
Phi 4 reasoning | 14.7 | 89.96% | 92.22% | 78.89% | -13.33% |
Phi 4 reasoning plus | 14.7 | 82.96% | 83.33% | 82.22% | -1.11% |
DeepSeek R1 Qwen 14B | 14.8 | 93.90% | 89.56% | 82.45% | -7.11% |
Qwen 3 14B | 14.8 | 92.59% | 83.33% | 77.78% | -5.55% |
Note: All evaluations are conducted using hybrid reasoning generation. Therefore, the accuracy will be lower compared to all-thinking or forced-thinking generation (outputs beginning with <think> … </think>
) but higher than forced non-thinking generation (outputs beginning with <think></think>
).
Evaluation Performance Analysis
For 1B to 2B parameter models
Models in this range (0.75B to 2.03B parameters based on the available data) show a developing ability to tackle mathematical problems, with some clear standouts for various tasks.
Model | Params (B) | GSM SYM main | GSM SYM p1 | GSM SYM p2 | GSM1k |
---|---|---|---|---|---|
Qwen3 0.6B | 0.75 | 72.06% | 58.24% | 33.39% | 74.98% |
Nvidia OpenMath Nemo 1.5B | 1.54 | 78.53% | 66.05% | 48.39% | 78.76% |
Qwen2.5 1.5B | 1.54 | 28.52% | 26.06% | 13.37% | 33.99% |
Qwen2.5 Math 1.5B | 1.54 | 79.36% | 67.28% | 50.48% | 84.29% |
DeepScaler 1.5B | 1.78 | 84.61% | 76.99% | 60.90% | 86.65% |
Qwen 3 1.7B | 2.03 | 87.50% | 82.05% | 71.11% | 89.84% |
GSM SYM p2
- Alibaba’s Qwen 3 1.7B (2.03B) is the leader in this group for complex grade school reasoning, scoring an impressive 71.11%.
- DeepScaler 1.5B (1.78B) follows with a strong 60.90%.
- Other models like Alibaba’s Qwen2.5 Math 1.5B (50.48%) and Nvidia’s OpenMath Nemo 1.5B (48.39%) show moderate performance, while Alibaba’s Qwen3 0.6B (33.39%) shows foundational capabilities.
Model | Params (B) | MATH 500 | Original 90 | Variant 4 90 | Accuracy Drop |
---|---|---|---|---|---|
Qwen3 0.6B | 0.75 | 69.74% | 61.11% | 62.22% | 1.11% |
Nvidia OpenMath Nemo 1.5B | 1.54 | 86.80% | 77.78% | 71.11% | -6.67% |
Qwen2.5 1.5B | 1.54 | 40.60% | 28.89% | 30.00% | 1.11% |
Qwen2.5 Math 1.5B | 1.54 | 72.89% | 67.78% | 64.44% | -3.33% |
DeepScaler 1.5B | 1.78 | 85.60% | 84.44% | 72.22% | -12.22% |
Qwen 3 1.7B | 2.03 | 84.57% | 77.78% | 71.11% | -6.67% |
MATH 500
- Nvidia’s OpenMath Nemo 1.5B (1.54B) excels here with a leading score of 86.80% on advanced math problems.
- DeepScaler 1.5B (1.78B) is a strong second at 85.60%.
- Alibaba’s Qwen 3 1.7B (2.03B) also performs well with 84.57%.
MathGPT.ai Variation testing
- Models showing a good balance of high accuracy on varied problems and low accuracy drop include Alibaba’s Qwen 3 1.7B (Variant 4: 71.11% / Drop: -6.67%) and Nvidia’s OpenMath Nemo 1.5B (Variant 4: 71.11% / Drop: -6.67%). DeepScaler 1.5B also achieves a high Variant 4 score of 72.22% but with a larger drop of -12.22%.
- Alibaba’s Qwen3 0.6B (0.75B) is exceptionally robust, showing a +1.11% increase in accuracy on variations (Variant 4 at 62.22%). Similarly, Alibaba’s Qwen2.5 1.5B also showed a +1.11% increase, indicating excellent stability for problems these models can solve, though their overall GSM SYM p2 and MATH 500 scores are lower.
For 3B to 5B parameter models
This category (3.09B to 4.51B parameters based on available data) demonstrates a significant step up in mathematical reasoning capabilities.
Model | Params (B) | GSM SYM main | GSM SYM p1 | GSM SYM p2 | GSM1k |
---|---|---|---|---|---|
Qwen2.5 3B | 3.09 | 75.91% | 67.60% | 44.08% | 82.56% |
Phi 4 mini | 3.84 | 86.46% | 79.17% | 63.54% | 89.16% |
Phi 4 mini reasoning | 3.84 | 91.52% | 90.09% | 85.31% | 88.55% |
Qwen 3 4B | 4.02 | 91.34% | 88.46% | 83.10% | 95.00% |
Nvidia Llama 3.1 Nemo Nano 4B | 4.51 | 78.14% | 68.53% | 50.76% | 83.32% |
GSM SYM p2
- Microsoft’s Phi 4 mini reasoning (3.84B) leads impressively in handling complex grade-school problems, scoring 85.31%.
- Alibaba’s Qwen 3 4B (4.02B) is very competitive at 83.10%.
- Nvidia’s Llama 3.1 Nemo Nano 4B (4.51B) scores 50.76% on this challenging benchmark.
Model | Params (B) | MATH 500 | Original 90 | Variant 4 90 | Accuracy Drop |
---|---|---|---|---|---|
Qwen2.5 3B | 3.09 | 60.52% | 53.33% | 50.00% | -3.33% |
Phi 4 mini | 3.84 | 58.32% | 56.67% | 47.78% | -8.89% |
Phi 4 mini reasoning | 3.84 | 88.60% | 88.89% | 76.67% | -12.22% |
Qwen 3 4B | 4.02 | 91.37% | 84.44% | 75.56% | -8.89% |
Nvidia Llama 3.1 Nemo Nano 4B | 4.51 | 85.37% | 81.11% | 71.11% | -10.00% |
MATH 500
- Alibaba’s Qwen 3 4B (4.02B) takes the top spot in advanced mathematics with a strong 91.37%.
- Microsoft’s Phi 4 mini reasoning (3.84B) follows closely with an excellent 88.60%.
- Nvidia’s Llama 3.1 Nemo Nano 4B (4.51B) also performs well with 85.37%.
MathGPT.ai Variation testing
- Microsoft’s Phi 4 mini reasoning achieves 76.67% on Variant 4, though with a notable drop of -12.22%.
- Alibaba’s Qwen 3 4B scores 75.56% on Variant 4 with a drop of -8.89%.
- Nvidia’s Llama 3.1 Nemo Nano 4B scores 71.11% on Variant 4 with a -10.00% drop.
- Alibaba’s Qwen2.5 3B shows good stability with a -3.33% drop, while its Variant 4 accuracy is 50.00%.
For 7B to 9B parameter models
Models in this range (7.62B to 8.19B parameters) demonstrate advanced capabilities, representing a significant step up in complexity and potential, though performance and robustness can vary.
Model | Params (B) | GSM SYM main | GSM SYM p1 | GSM SYM p2 | GSM1k |
---|---|---|---|---|---|
Nvidia AceMath RL Nemo 7B | 7.62 | 89.00% | 86.59% | 84.23% | 92.64% |
Nvidia OpenMath Nemo 7B | 7.62 | 87.21% | 81.78% | 76.52% | 88.86% |
Nvidia Llama 3.1 Nemo Nano 8B | 8.03 | 80.02% | 80.04% | 78.71% | 79.98% |
DeepSeek R1 0528 Qwen 8B | 8.19 | 91.65% | 86.86% | 83.44% | 93.40% |
Qwen 3 8B | 8.19 | 91.51% | 90.58% | 87.90% | 94.69% |
GSM SYM p2
- Alibaba’s Qwen 3 8B (8.19B) is the strongest performer in this category on complex multi-step grade school problems, scoring 87.90%.
- Nvidia’s AceMath RL Nemo 7B (7.62B) also performs very well at 84.23%.
- DeepSeek’s R1 0528 Qwen 8B (8.19B) is close behind with 83.44%.
Model | Params (B) | MATH 500 | Original 90 | Variant 4 90 | Accuracy Drop |
---|---|---|---|---|---|
Nvidia AceMath RL Nemo 7B | 7.62 | 93.40% | 90.00% | 81.11% | -8.89% |
Nvidia OpenMath Nemo 7B | 7.62 | 92.00% | 90.00% | 83.33% | -6.67% |
Nvidia Llama 3.1 Nemo Nano 8B | 8.03 | 89.20% | 86.67% | 78.89% | -7.78% |
DeepSeek R1 0528 Qwen 8B | 8.19 | 87.40% | 85.56% | 76.67% | -8.89% |
Qwen 3 8B | 8.19 | 88.80% | 87.78% | 74.44% | -13.33% |
MATH 500
- Nvidia’s AceMath RL Nemo 7B (7.62B) leads this category on advanced math problems with 93.40%.
- Nvidia’s OpenMath Nemo 7B (7.62B) also shows strong performance at 92.00%.
- DeepSeek’s R1 0528 Qwen 8B (8.19B) scores 87.40%, while Alibaba’s Qwen 3 8B achieves 88.80%.
MathGPT.ai Variation testing
- Nvidia’s OpenMath Nemo 7B shows an excellent balance of high Variant 4 accuracy (83.33%) and a relatively low accuracy drop (-6.67%), making it a standout for robustness in this category.
- Other Nvidia models, AceMath RL Nemo 7B (Variant 4: 81.11% / Drop: -8.89%) and Llama 3.1 Nemo Nano 8B (Variant 4: 78.89% / Drop: -7.78%), also offer good robustness. DeepSeek’s R1 0528 Qwen 8B also shows a manageable drop of -8.89% with a Variant 4 score of 76.67%.
- Alibaba’s Qwen 3 8B, while a strong performer on GSM SYM p2, shows higher sensitivity to linguistic variations with a drop of -13.33% (Variant 4 at 74.44%).
For 14B to 15B parameter models
This higher parameter range (14.7B to 14.8B based on available data) includes some of the overall top-performing and most robust models in our evaluation.
Model | Params (B) | GSM SYM main | GSM SYM p1 | GSM SYM p2 | GSM1k |
---|---|---|---|---|---|
Phi 4 | 14.7 | 94.54% | 91.60% | 88.32% | 95.07% |
Phi 4 reasoning | 14.7 | 90.31% | 91.20% | 91.10% | 90.70% |
Phi 4 reasoning plus | 14.7 | 82.57% | 82.87% | 84.68% | 83.36% |
DeepSeek R1 Qwen 14B | 14.8 | 93.34% | 92.16% | 90.56% | 95.38% |
Qwen 3 14B | 14.8 | 93.94% | 92.94% | 89.36% | 96.29% |
GSM SYM p2
- Microsoft’s Phi 4 reasoning (14.7B) is the leader in complex grade school reasoning with an impressive 91.10%, the highest in this entire evaluation on this metric.
- DeepSeek’s R1 Qwen 14B (14.8B) is very strong at 90.56%.
- Alibaba’s Qwen 3 14B (14.8B) scores 89.36%, and the base Microsoft Phi 4 (14.7B) scores a solid 88.32%.
Model | Params (B) | MATH 500 | Original 90 | Variant 4 90 | Accuracy Drop |
---|---|---|---|---|---|
Phi 4 | 14.7 | 80.40% | 77.78% | 75.56% | -2.22% |
Phi 4 reasoning | 14.7 | 89.96% | 92.22% | 78.89% | -13.33% |
Phi 4 reasoning plus | 14.7 | 82.96% | 83.33% | 82.22% | -1.11% |
DeepSeek R1 Qwen 14B | 14.8 | 93.90% | 89.56% | 82.45% | -7.11% |
Qwen 3 14B | 14.8 | 92.59% | 83.33% | 77.78% | -5.55% |
MATH 500
- DeepSeek’s R1 Qwen 14B (14.8B) achieves the highest score in advanced mathematics with an outstanding 93.90%.
- Alibaba’s Qwen 3 14B (14.8B) is also very strong at 92.59%.
- Microsoft’s Phi 4 reasoning (14.7B) follows with 89.96%.
MathGPT.ai Variation testing
- Microsoft’s Phi 4 reasoning plus (14.7B) stands out as exceptionally robust, achieving 82.22% on Variant 4 with an incredibly low accuracy drop of only -1.11%, the best stability shown by any model in this evaluation.
- The base Microsoft Phi 4 (14.7B) is also very robust with a -2.22% drop (Variant 4 at 75.56%).
- DeepSeek’s R1 Qwen 14B shows good robustness with a Variant 4 score of 82.45% and a drop of -7.11%.
- Alibaba’s Qwen 3 14B (14.8B) also demonstrates good robustness for its size with a -5.55% drop (Variant 4 at 77.78%).
- Microsoft’s Phi 4 reasoning, despite its high peak accuracy on GSM SYM p2, shows significant sensitivity to linguistic variations with a large drop of -13.33%.
For Phi 4 Family of models
The Phi 4 family, encompassing models from 3.84B (“mini”) to 14.7B parameters, presents a study in how architectural variations, specialized tuning (“reasoning,” “reasoning plus”), and scale interact to influence mathematical prowess.
Scaling and Specialization
- Moving from Phi 4 mini (3.84B) to the larger Phi 4 (14.7B) generally shows an increase in capability. For instance, on GSM SYM p2, Phi 4 mini scores 63.54%, while Phi 4 scores 88.32%.
- The “reasoning” variants significantly boost performance on complex tasks. Phi 4 mini reasoning (3.84B) achieves an impressive 85.31% on GSM SYM p2, far surpassing its non-reasoning counterpart and even outperforming some larger, more general models. This trend continues with Phi 4 reasoning (14.7B), which hits 91.10% on GSM SYM p2, the highest in this specific benchmark among the Phi family.
Performance on Key Benchmarks
- Complex Grade School Math (GSM SYM p2): The “reasoning” models shine here: Phi 4 mini reasoning (85.31%) and Phi 4 reasoning (91.10%) are standouts. Phi 4 reasoning plus (14.7B) scores a solid 84.68%.
- Advanced Math (MATH 500): Again, reasoning-focused models lead within their size class. Phi 4 mini reasoning (88.60%) is very strong for its size. Phi 4 reasoning (89.96%) leads the 14.7B Phi models, followed by Phi 4 reasoning plus (82.96%) and the base Phi 4 (80.40%).
Robustness (MathGPT Variation Testing)
- Phi 4 reasoning plus (14.7B) is the star for robustness. It achieves a Variant 4 accuracy of 82.22% with an exceptionally low accuracy drop of just -1.11%, the best in the entire table. This suggests a very consistent and reliable reasoning ability that is not easily perturbed by changes in problem phrasing.
- The base Phi 4 (14.7B) model also demonstrates excellent robustness with a Variant 4 score of 75.56% and a minimal drop of -2.22%.
- In contrast, the higher-performing “reasoning” variants show more sensitivity to variations: Phi 4 mini reasoning has a drop of -12.22% (Variant 4 at 76.67%), and Phi 4 reasoning has a drop of -13.33% (Variant 4 at 78.89%).
The Phi 4 family highlights a clear trade-off. Specialized “reasoning” fine-tuning can significantly elevate performance on complex math problems. However, this can sometimes come at the cost of robustness to linguistic variations. The “reasoning plus” variant seems to strike a different balance, achieving slightly lower peak accuracies than the “reasoning” version but demonstrating remarkable consistency and stability, implying a potentially deeper and more generalized understanding. The “mini” versions, particularly Phi 4 mini reasoning, show that smaller models can achieve remarkable mathematical competence when properly specialized.
For Qwen 3 Family of models
The Qwen 3 series, spanning from 0.6B to 14.8B parameters in this evaluation, provides a clear view of how mathematical capabilities can evolve with scale and targeted design. Each member of the family carves out a niche, showcasing different strengths.
Overall Family Strengths & Scaling Trends
- The Qwen 3 family generally demonstrates strong performance on grade-school mathematics (GSM8k, GSM1k, GSM SYM main) and shows consistent improvement in handling more complex reasoning steps (GSM SYM p1, p2) as model size increases.
- The scaling is particularly evident on GSM1k, where the 14B model approaches near-perfect scores, and on GSM SYM p2, which is a challenging multi-step reasoning benchmark.
- There’s a consistent upward trend in capabilities, affirming that increased parameter count, when part of a coherent model family, generally leads to better mathematical reasoning skills.
Performance on Key Benchmarks
- GSM SYM p2
- Qwen3 0.6B: 33.39% (foundational, but limited on complex steps)
- Qwen3 1.7B: 71.11% (strong leader in <2B class, good multi-step reasoning)
- Qwen3 4B: 83.10% (excellent, competes with larger models)
- Qwen 3 8B: 87.90% (highest in 7-8B class, very strong reasoning)
- Qwen 3 14B: 89.36% (highest in Qwen 3 family, excellent complex reasoning)
- MATH 500
- Qwen3 0.6B: 69.74% (shows some capability on harder math)
- Qwen3 1.7B: 84.57% (very competent for its size)
- Qwen3 4B: 91.37% (top performer in 3-4B class, strong on advanced math)
- Qwen 3 8B: 88.80% (good, but notably lower than its 4B sibling on this specific test, indicating performance is not always monotonic with size on all benchmarks)
- Qwen 3 14B: 92.59% (highest in Qwen 3 family, strong advanced math capability)
Robustness on MathGPT.ai Variation Testing
- Qwen3 0.6B: Exceptional robustness with a +1.11% accuracy increase on Variant 4 (62.22% accuracy). This suggests that for the problems it can solve, its understanding is quite stable against rephrasing. However, its overall accuracy on harder benchmarks is low.
- Qwen3 1.7B: Good robustness with a Variant 4 accuracy of 71.11% and a drop of -6.67%.
- Qwen3 4B: Solid Variant 4 accuracy at 75.56%, with a drop of -8.89%.
- Qwen 3 8B: Shows more sensitivity, with a Variant 4 accuracy of 74.44% and a larger drop of -13.33%. This is the least robust model in the Qwen 3 family based on this metric.
- Qwen 3 14B: Achieves good robustness for its size, with a Variant 4 accuracy of 77.78% and a relatively low drop of -5.55%. This indicates that the largest model in the family also maintains good stability.
In summary, the Qwen 3 family offers a compelling range of models, with smaller versions like the 1.7B and 4B providing excellent performance-to-size ratios, and the larger 14B model delivering top-tier results with good robustness. The 0.6B model’s stability is a noteworthy characteristic for a model of its size.
Curious case of 7B to 9B being a dead zone?
The 7B to 9B parameter models in our evaluation (Nvidia AceMath RL Nemo 7B, Nvidia OpenMath Nemo 7B, Nvidia Llama 3.1 Nemo Nano 8B, DeepSeek R1 0528 Qwen 8B, and Qwen 3 8B) present a nuanced performance landscape. While these models often show improvements over the 3-5B class, the gains aren’t always proportional to the increase in parameters, and they don’t consistently match the robustness or specialized peaks of the 14B+ models.
Alibaba’s Qwen 3 8B leads this pack on the complex GSM SYM p2 benchmark at 87.90%. However, its MATH 500 score (88.80%) is actually lower than Alibaba’s smaller Qwen3 4B (91.37%). Similarly, DeepSeek R1 0528 Qwen 8B scores 83.44% on GSM SYM p2 and 87.40% on MATH 500. Nvidia’s offerings in this range, like AceMath RL Nemo 7B (93.40% on MATH 500) and OpenMath Nemo 7B (92.00% on MATH 500), demonstrate strong capabilities on advanced mathematics.
A key concern in this range is robustness. Nvidia’s OpenMath Nemo 7B is a standout for stability, achieving 83.33% on our MathGPT Variant 4 test with a relatively low accuracy drop of -6.67%. In contrast, Alibaba’s Qwen 3 8B shows significant fragility with a -13.33% drop. DeepSeek R1 0528 Qwen 8B is more moderate with a -8.89% drop.
This mixed picture suggests a few possibilities:
- Optimization Challenges: The 7-9B range might be an “in-between” scale where achieving optimal math-specific tuning is particularly complex, or where current architectures don’t scale as predictably for math reasoning compared to smaller or much larger sizes.
- Developer Strategy: Some developers might prioritize very efficient smaller models or push directly to larger scales (14B+) for flagship capabilities, leading to less focus on perfecting this mid-tier for specialized tasks. For instance, Microsoft’s Phi series jumps from ~4B to ~14.7B.
- Cost-Benefit Trade-offs: The resource cost to train and fine-tune 7-9B models might lead to expectations of performance leaps that are only consistently met by even larger models, making this tier a difficult value proposition for some applications if a highly performant 4B model or an ultra-robust 14B+ model is available.
While capable models exist in the 7-9B range, particularly from Nvidia regarding robustness and advanced math, users seeking a significant all-around jump from the 4B class, especially in combined peak performance and linguistic stability, might find the 14B+ models like Microsoft’s Phi 4 reasoning plus (with its minimal -1.11% drop) or DeepSeek R1 Qwen 14B (93.90% MATH 500, -7.11% drop) to be more compelling alternatives.
Standout models
In reviewing the standout models highlighted in green below we now provide some summary thoughts.
Qwen3 1.7B: Leading Small Model for Grade School Reasoning
The Qwen3 1.7B model shines as a top performer among models under 2 billion parameters. Its most notable achievement is its 71.11% score on the GSM SYM p2 benchmark, which tests complex, multi-step grade school math problems. This performance is the best in its size class and even surpasses some larger models, indicating a strong reasoning capability for its parameter count. Its proficiency is further supported by high scores on other GSM benchmarks (e.g., 89.84% on GSM1k). In terms of robustness, it shows a -6.67% accuracy drop on the MathGPT variation test (with 71.11% on Variant 4), which is respectable, though it indicates some sensitivity to how problems are phrased.
Qwen3 4B: A Powerful All-Rounder Bridging Performance Gaps
Qwen3 4B demonstrates a significant advancement in mathematical abilities within the 3-4 billion parameter category. It achieves excellent scores across grade school math evaluations, including over 90% on GSM SYM main (91.34%) and GSM1k (95.00%). Its 83.10% on GSM SYM p2 is particularly impressive, making it competitive with, and sometimes comparable to, models nearly twice its size in handling intricate reasoning. A substantial improvement is also observed on the MATH 500 dataset, where it scores 91.37%, indicating strong capabilities in advanced high school and college-level mathematics. While a robust performer, it does experience an accuracy drop of -8.89% on the MathGPT variation test (Variant 4 at 75.56%), highlighting a common challenge among high-performing models when faced with linguistic variations.
Phi 4 reasoning plus: The Benchmark for Robust Mathematical Reasoning
The Phi 4 reasoning plus (14.7B) model carves out a unique position due to its exceptional robustness and consistent performance. While it may not always top the charts for raw accuracy on every single benchmark, its standout feature is the minimal accuracy drop of just -1.11% on the MathGPT variation test, alongside a strong 82.22% score on the Variant 4 problems themselves. This remarkable stability suggests a deeper, more generalizable mathematical understanding, allowing it to interpret and solve problems reliably even when the wording changes. It maintains solid accuracies generally in the low to mid-80s across challenging evaluations like GSM SYM p2 (84.68%) and MATH 500 (82.96%). This combination of consistent high performance and unparalleled stability implies that Phi 4 reasoning plus may rely more on genuine, methodical reasoning rather than just recognizing patterns from its training data, making it an excellent candidate for applications where reliable mathematical interpretation is paramount.
Future work
We are exploring new reasoning strategies, experimenting with innovative data and variations of math problems, and expect to share breakthrough benchmarks in our next blog. Stay tuned!!