Yuvraj Khanna, Raghav Rastogi, Dhruv Kumar, Peter Relan, Hung Tran

Introduction

When math has to deal with language we have found a fascinating way to encode words as numbers, but when language has to deal with math, how good is it? At MathGPT.ai, our R&D team has spent over two years working with LLMs exploring how language models handle mathematical reasoning.

Ready for a shake-up in how you view AI math skills? Let’s look beyond the leaderboard hype – which AI math model stands firm when the words change but the numbers don’t? Our carefully developed MathGPT.ai stress tests show Microsoft’s Phi 4 reasoning model crushed Apple’s advanced GSM SYM p2 math reasoning benchmark at 91.1% – a clear win, right? Not so fast. At MathGPT.ai, when we merely rephrased the problems (keeping all the math and logic identical), this champion, alongside Alibaba’s Qwen 3 8B, both saw their accuracies dramatically plunge by an identical 13.33% in accuracy. Meanwhile, Nvidia’s OpenMath Nemo 7B (83.33% on variation stress test) and Microsoft’s own ultra-stable Phi 4 reasoning plus (a mere -1.11% drop vs -13.3% for Ph 4 without the plus) reveal what true mathematical resilience looks like.This deep dive explores which open-source AI models truly are robust reasoners, and which are less than robust when similar, but out of distribution problems are presented.

Before diving into our findings, it’s crucial to delineate the boundaries of this MathGPT.ai evaluation: we are exclusively examining open-source Small Language Models (SLMs) to ensure transparency and maintain our specific research focus. In addition, the models evaluated in this MathGPT.ai report fall under the 15 billion parameter mark, categorizing them primarily as Small Language Models (SLMs). This allows us to specifically investigate the mathematical reasoning and robustness of models broadly available to the community without requiring enormous compute. Therefore, prominent and typically much larger closed-source models, including potential May 2025 frontrunners like Claude 4, Gemini 2.5 Pro, or new OpenAI “o” series models (e.g., o3, o4-mini), were not part of this evaluation as they fall outside the intended scope of this particular experiment.

Understanding the Benchmarks

To fairly evaluate these language models, a diverse set of mathematical problem-solving datasets was used. Each dataset tests different aspects of mathematical reasoning and stability:

GSM1K: Why Apple’s GSM SYM p2 is a superior evaluation benchmark

Apple’s GSM SYM p2 benchmark stands out as a particularly insightful tool for evaluating mathematical reasoning in language models due to its targeted design. By systematically adding two extra reasoning steps to problems derived from the well-established GSM8k dataset, GSM SYM p2 directly probes a model’s capacity for deeper, sequential logical deduction—a known weak point for many current AI systems. This controlled increase in complexity allows for a clearer assessment of how a model’s reasoning abilities scale (or falter) when faced with more intricate problems, rather than just variations in wording or numerical values alone. Its focus on problems requiring more extensive thought processes makes it a more challenging and thus more discriminative measure of a model’s true mathematical problem-solving skills, moving beyond surface-level pattern matching to test for more robust inferential capabilities.

Why AIME 24 or AIME 25 are not included in this blog

Mathematical proficiency can be viewed across a spectrum of increasing difficulty, from foundational grade school math (like GSM8k), to more complex high school and early college-level problems (covered by benchmarks such as MATH/MATH500), and finally to highly advanced Olympiad-level mathematics (which includes problems similar to those in AIME).

At MathGPT.ai, our primary mission is to maximize the positive impact of AI in education and enhance its utility for everyday mathematical problem-solving and student sucess. For this reason, our current evaluations, especially for Small Language Models (SLMs), concentrate on the first two tiers—grade school math with enhanced reasoning complexity (like GSM SYM p2) and challenging high school/college math. We believe mastery in these areas is crucial for models intended to assist with common educational and real-world quantitative tasks. While Olympiad-level problems like those from AIME 24/25 are excellent for pushing the boundaries of AI reasoning at the highest levels, an overemphasis on such elite competition math might not provide the most relevant picture of an SLM’s practical utility for a broader audience. As AI capabilities continue to evolve, we plan to rigorously include Olympiad-level challenges in our evaluations for research purposes.

Standout models


Our comprehensive evaluations have pinpointed several models that demonstrate remarkable capabilities or intriguing performance characteristics, especially when considering the balance between raw problem-solving power and robustness to linguistic variations. The following tables summarize the performance of all evaluated open-source SLMs under 15B parameters, first on Grade School Math (GSM) benchmarks, and then on advanced math (MATH 500) and our linguistic variation tests.

Overall Performance on Grade School Math Benchmarks (GSM Series)

This table highlights model performance across various GSM datasets, which primarily test reasoning on grade-school level mathematics with increasing complexity.

ModelParams (B)GSM SYM mainGSM SYM p1GSM SYM p2GSM1k
Qwen3 0.6B0.7572.06%58.24%33.39%74.98%
Nvidia OpenMath Nemo 1.5B1.5478.53%66.05%48.39%78.76%
Qwen2.5 1.5B1.5428.52%26.06%13.37%33.99%
Qwen2.5 Math 1.5B1.5479.36%67.28%50.48%84.29%
DeepScaler 1.5B1.7884.61%76.99%60.90%86.65%
Qwen 3 1.7B2.0387.50%82.05%71.11%89.84%
Qwen2.5 3B3.0975.91%67.60%44.08%82.56%
Phi 4 mini3.8486.46%79.17%63.54%89.16%
Phi 4 mini reasoning3.8491.52%90.09%85.31%88.55%
Qwen 3 4B4.0291.34%88.46%83.10%95.00%
Nvidia Llama 3.1 Nemo Nano 4B4.5178.14%68.53%50.76%83.32%
Nvidia AceMath RL Nemo 7B7.6289.00%86.59%84.23%92.64%
Nvidia OpenMath Nemo 7B7.6287.21%81.78%76.52%88.86%
Nvidia Llama 3.1 Nemo Nano 8B8.0380.02%80.04%78.71%79.98%
DeepSeek R1 0528 Qwen 8B8.1991.65%86.86%83.44%93.40%
Qwen 3 8B8.1991.51%90.58%87.90%94.69%
Phi 414.794.54%91.60%88.32%95.07%
Phi 4 reasoning14.790.31%91.20%91.10%90.70%
Phi 4 reasoning plus14.782.57%82.87%84.68%83.36%
DeepSeek R1 Qwen 14B14.893.34%92.16%90.56%95.38%
Qwen 3 14B14.893.94%92.94%89.36%96.29%

Overall Performance on Advanced Math (MATH 500) & Linguistic Robustness (MathGPT.ai Variation Testing)

This table shifts focus to performance on more challenging high school/college-level math and, crucially, how models withstand linguistic variations in problems where the underlying math remains identical.

ModelParams (B)MATH 500Original 90Variant 4 90Accuracy Drop
Qwen3 0.6B0.7569.74%61.11%62.22%1.11%
Nvidia OpenMath Nemo 1.5B1.5486.80%77.78%71.11%-6.67%
Qwen2.5 1.5B1.5440.60%28.89%30.00%1.11%
Qwen2.5 Math 1.5B1.5472.89%67.78%64.44%-3.33%
DeepScaler 1.5B1.7885.60%84.44%72.22%-12.22%
Qwen 3 1.7B2.0384.57%77.78%71.11%-6.67%
Qwen2.5 3B3.0960.52%53.33%50.00%-3.33%
Phi 4 mini3.8458.32%56.67%47.78%-8.89%
Phi 4 mini reasoning3.8488.60%88.89%76.67%-12.22%
Qwen 3 4B4.0291.37%84.44%75.56%-8.89%
Nvidia Llama 3.1 Nemo Nano 4B4.5185.37%81.11%71.11%-10.00%
Nvidia AceMath RL Nemo 7B7.6293.40%90.00%81.11%-8.89%
Nvidia OpenMath Nemo 7B7.6292.00%90.00%83.33%-6.67%
Nvidia Llama 3.1 Nemo Nano 8B8.0389.20%86.67%78.89%-7.78%
DeepSeek R1 0528 Qwen 8B8.1987.40%85.56%76.67%-8.89%
Qwen 3 8B8.1988.80%87.78%74.44%-13.33%
Phi 414.780.40%77.78%75.56%-2.22%
Phi 4 reasoning14.789.96%92.22%78.89%-13.33%
Phi 4 reasoning plus14.782.96%83.33%82.22%-1.11%
DeepSeek R1 Qwen 14B14.893.90%89.56%82.45%-7.11%
Qwen 3 14B14.892.59%83.33%77.78%-5.55%

Note: All evaluations are conducted using hybrid reasoning generation. Therefore, the accuracy will be lower compared to all-thinking or forced-thinking generation (outputs beginning with <think> … </think>) but higher than forced non-thinking generation (outputs beginning with <think></think>).

Evaluation Performance Analysis

For 1B to 2B parameter models

Models in this range (0.75B to 2.03B parameters based on the available data) show a developing ability to tackle mathematical problems, with some clear standouts for various tasks.

ModelParams (B)GSM SYM mainGSM SYM p1GSM SYM p2GSM1k
Qwen3 0.6B0.7572.06%58.24%33.39%74.98%
Nvidia OpenMath Nemo 1.5B1.5478.53%66.05%48.39%78.76%
Qwen2.5 1.5B1.5428.52%26.06%13.37%33.99%
Qwen2.5 Math 1.5B1.5479.36%67.28%50.48%84.29%
DeepScaler 1.5B1.7884.61%76.99%60.90%86.65%
Qwen 3 1.7B2.0387.50%82.05%71.11%89.84%

GSM SYM p2

ModelParams (B)MATH 500Original 90Variant 4 90Accuracy Drop
Qwen3 0.6B0.7569.74%61.11%62.22%1.11%
Nvidia OpenMath Nemo 1.5B1.5486.80%77.78%71.11%-6.67%
Qwen2.5 1.5B1.5440.60%28.89%30.00%1.11%
Qwen2.5 Math 1.5B1.5472.89%67.78%64.44%-3.33%
DeepScaler 1.5B1.7885.60%84.44%72.22%-12.22%
Qwen 3 1.7B2.0384.57%77.78%71.11%-6.67%

MATH 500

MathGPT.ai Variation testing

For 3B to 5B parameter models

This category (3.09B to 4.51B parameters based on available data) demonstrates a significant step up in mathematical reasoning capabilities.

ModelParams (B)GSM SYM mainGSM SYM p1GSM SYM p2GSM1k
Qwen2.5 3B3.0975.91%67.60%44.08%82.56%
Phi 4 mini3.8486.46%79.17%63.54%89.16%
Phi 4 mini reasoning3.8491.52%90.09%85.31%88.55%
Qwen 3 4B4.0291.34%88.46%83.10%95.00%
Nvidia Llama 3.1 Nemo Nano 4B4.5178.14%68.53%50.76%83.32%


GSM SYM p2

ModelParams (B)MATH 500Original 90Variant 4 90Accuracy Drop
Qwen2.5 3B3.0960.52%53.33%50.00%-3.33%
Phi 4 mini3.8458.32%56.67%47.78%-8.89%
Phi 4 mini reasoning3.8488.60%88.89%76.67%-12.22%
Qwen 3 4B4.0291.37%84.44%75.56%-8.89%
Nvidia Llama 3.1 Nemo Nano 4B4.5185.37%81.11%71.11%-10.00%

MATH 500

MathGPT.ai Variation testing

For 7B to 9B parameter models

Models in this range (7.62B to 8.19B parameters) demonstrate advanced capabilities, representing a significant step up in complexity and potential, though performance and robustness can vary.

ModelParams (B)GSM SYM mainGSM SYM p1GSM SYM p2GSM1k
Nvidia AceMath RL Nemo 7B7.6289.00%86.59%84.23%92.64%
Nvidia OpenMath Nemo 7B7.6287.21%81.78%76.52%88.86%
Nvidia Llama 3.1 Nemo Nano 8B8.0380.02%80.04%78.71%79.98%
DeepSeek R1 0528 Qwen 8B8.1991.65%86.86%83.44%93.40%
Qwen 3 8B8.1991.51%90.58%87.90%94.69%

GSM SYM p2

ModelParams (B)MATH 500Original 90Variant 4 90Accuracy Drop
Nvidia AceMath RL Nemo 7B7.6293.40%90.00%81.11%-8.89%
Nvidia OpenMath Nemo 7B7.6292.00%90.00%83.33%-6.67%
Nvidia Llama 3.1 Nemo Nano 8B8.0389.20%86.67%78.89%-7.78%
DeepSeek R1 0528 Qwen 8B8.1987.40%85.56%76.67%-8.89%
Qwen 3 8B8.1988.80%87.78%74.44%-13.33%

MATH 500

MathGPT.ai Variation testing

For 14B to 15B parameter models

This higher parameter range (14.7B to 14.8B based on available data) includes some of the overall top-performing and most robust models in our evaluation.

ModelParams (B)GSM SYM mainGSM SYM p1GSM SYM p2GSM1k
Phi 414.794.54%91.60%88.32%95.07%
Phi 4 reasoning14.790.31%91.20%91.10%90.70%
Phi 4 reasoning plus14.782.57%82.87%84.68%83.36%
DeepSeek R1 Qwen 14B14.893.34%92.16%90.56%95.38%
Qwen 3 14B14.893.94%92.94%89.36%96.29%

GSM SYM p2

ModelParams (B)MATH 500Original 90Variant 4 90Accuracy Drop
Phi 414.780.40%77.78%75.56%-2.22%
Phi 4 reasoning14.789.96%92.22%78.89%-13.33%
Phi 4 reasoning plus14.782.96%83.33%82.22%-1.11%
DeepSeek R1 Qwen 14B14.893.90%89.56%82.45%-7.11%
Qwen 3 14B14.892.59%83.33%77.78%-5.55%

MATH 500

MathGPT.ai Variation testing

For Phi 4 Family of models

The Phi 4 family, encompassing models from 3.84B (“mini”) to 14.7B parameters, presents a study in how architectural variations, specialized tuning (“reasoning,” “reasoning plus”), and scale interact to influence mathematical prowess.

Scaling and Specialization

Performance on Key Benchmarks

Robustness (MathGPT Variation Testing)

The Phi 4 family highlights a clear trade-off. Specialized “reasoning” fine-tuning can significantly elevate performance on complex math problems. However, this can sometimes come at the cost of robustness to linguistic variations. The “reasoning plus” variant seems to strike a different balance, achieving slightly lower peak accuracies than the “reasoning” version but demonstrating remarkable consistency and stability, implying a potentially deeper and more generalized understanding. The “mini” versions, particularly Phi 4 mini reasoning, show that smaller models can achieve remarkable mathematical competence when properly specialized.

For Qwen 3 Family of models

The Qwen 3 series, spanning from 0.6B to 14.8B parameters in this evaluation, provides a clear view of how mathematical capabilities can evolve with scale and targeted design. Each member of the family carves out a niche, showcasing different strengths.

Overall Family Strengths & Scaling Trends

Performance on Key Benchmarks

Robustness on MathGPT.ai Variation Testing

In summary, the Qwen 3 family offers a compelling range of models, with smaller versions like the 1.7B and 4B providing excellent performance-to-size ratios, and the larger 14B model delivering top-tier results with good robustness. The 0.6B model’s stability is a noteworthy characteristic for a model of its size.

Curious case of 7B to 9B being a dead zone?

The 7B to 9B parameter models in our evaluation (Nvidia AceMath RL Nemo 7B, Nvidia OpenMath Nemo 7B, Nvidia Llama 3.1 Nemo Nano 8B, DeepSeek R1 0528 Qwen 8B, and Qwen 3 8B) present a nuanced performance landscape. While these models often show improvements over the 3-5B class, the gains aren’t always proportional to the increase in parameters, and they don’t consistently match the robustness or specialized peaks of the 14B+ models.

Alibaba’s Qwen 3 8B leads this pack on the complex GSM SYM p2 benchmark at 87.90%. However, its MATH 500 score (88.80%) is actually lower than Alibaba’s smaller Qwen3 4B (91.37%). Similarly, DeepSeek R1 0528 Qwen 8B scores 83.44% on GSM SYM p2 and 87.40% on MATH 500. Nvidia’s offerings in this range, like AceMath RL Nemo 7B (93.40% on MATH 500) and OpenMath Nemo 7B (92.00% on MATH 500), demonstrate strong capabilities on advanced mathematics.

A key concern in this range is robustness. Nvidia’s OpenMath Nemo 7B is a standout for stability, achieving 83.33% on our MathGPT Variant 4 test with a relatively low accuracy drop of -6.67%. In contrast, Alibaba’s Qwen 3 8B shows significant fragility with a -13.33% drop. DeepSeek R1 0528 Qwen 8B is more moderate with a -8.89% drop.

This mixed picture suggests a few possibilities:

While capable models exist in the 7-9B range, particularly from Nvidia regarding robustness and advanced math, users seeking a significant all-around jump from the 4B class, especially in combined peak performance and linguistic stability, might find the 14B+ models like Microsoft’s Phi 4 reasoning plus (with its minimal -1.11% drop) or DeepSeek R1 Qwen 14B (93.90% MATH 500, -7.11% drop) to be more compelling alternatives.

Standout models

In reviewing the standout models highlighted in green below we now provide some summary thoughts.

Qwen3 1.7B: Leading Small Model for Grade School Reasoning

The Qwen3 1.7B model shines as a top performer among models under 2 billion parameters. Its most notable achievement is its 71.11% score on the GSM SYM p2 benchmark, which tests complex, multi-step grade school math problems. This performance is the best in its size class and even surpasses some larger models, indicating a strong reasoning capability for its parameter count. Its proficiency is further supported by high scores on other GSM benchmarks (e.g., 89.84% on GSM1k). In terms of robustness, it shows a -6.67% accuracy drop on the MathGPT variation test (with 71.11% on Variant 4), which is respectable, though it indicates some sensitivity to how problems are phrased.

Qwen3 4B: A Powerful All-Rounder Bridging Performance Gaps

Qwen3 4B demonstrates a significant advancement in mathematical abilities within the 3-4 billion parameter category. It achieves excellent scores across grade school math evaluations, including over 90% on GSM SYM main (91.34%) and GSM1k (95.00%). Its 83.10% on GSM SYM p2 is particularly impressive, making it competitive with, and sometimes comparable to, models nearly twice its size in handling intricate reasoning. A substantial improvement is also observed on the MATH 500 dataset, where it scores 91.37%, indicating strong capabilities in advanced high school and college-level mathematics. While a robust performer, it does experience an accuracy drop of -8.89% on the MathGPT variation test (Variant 4 at 75.56%), highlighting a common challenge among high-performing models when faced with linguistic variations.

Phi 4 reasoning plus: The Benchmark for Robust Mathematical Reasoning

The Phi 4 reasoning plus (14.7B) model carves out a unique position due to its exceptional robustness and consistent performance. While it may not always top the charts for raw accuracy on every single benchmark, its standout feature is the minimal accuracy drop of just -1.11% on the MathGPT variation test, alongside a strong 82.22% score on the Variant 4 problems themselves. This remarkable stability suggests a deeper, more generalizable mathematical understanding, allowing it to interpret and solve problems reliably even when the wording changes. It maintains solid accuracies generally in the low to mid-80s across challenging evaluations like GSM SYM p2 (84.68%) and MATH 500 (82.96%). This combination of consistent high performance and unparalleled stability implies that Phi 4 reasoning plus may rely more on genuine, methodical reasoning rather than just recognizing patterns from its training data, making it an excellent candidate for applications where reliable mathematical interpretation is paramount.

Future work

We are exploring new reasoning strategies, experimenting with innovative data and variations of math problems, and expect to share breakthrough benchmarks in our next blog. Stay tuned!!