Accuracy scores on GSM8K (1,319 samples) and GSM-Plus (10,552 samples), which includes samples from eight different perturbation types.
Model | GSM8K | GSM-Plus | Num. Sub. | Digit Exp. | IDC Conv. | Add. Op. | Rev. Op. | Prob. Underst. | Dist. Ins. | Crit. Thinking |
Human | 96.8 | 98.8 | 92.9 | 100.0 | 100.0 | 87.5 | 100.0 | 100.0 | 100.0 | 100.0 |
GPT-4 | 93.3 | 85.6 | 89.8 | 90.5 | 89.0 | 79.5 | 83.7 | 93.9 | 90.8 | 67.5 |
GPT-3.5-Turbo | 73.6 | 61.2 | 69.5 | 70.4 | 62.3 | 48.5 | 55.2 | 74.2 | 62.2 | 47.3 |
Mistral-7B | 39.6 | 26.2 | 35.2 | 35.9 | 29.9 | 14.4 | 21.8 | 38.7 | 28.1 | 5.5 |
LLaMA-2-7B | 13.42 | 8.12 | 13.0 | 10.0 | 10.4 | 3.0 | 7.0 | 13.9 | 7.6 | 0.0 |
CodeLlama-7B | 25.3 | 15.1 | 22.3 | 23.8 | 19.1 | 8.6 | 9.3 | 25.9 | 11.4 | 1.4 |
LLaMA-2-13B | 25.4 | 16.6 | 25.9 | 22.9 | 17.7 | 9.5 | 13.4 | 27.4 | 15.4 | 0.3 |
CodeLlama-13B | 35.9 | 21.1 | 29.6 | 29.9 | 28.2 | 14.6 | 15.4 | 34.6 | 16.7 | 4.4 |
LLaMA-2-70B | 56.7 | 40.0 | 53.3 | 53.1 | 42.1 | 31.5 | 36.6 | 56.6 | 46.9 | 0.3 |
MetaMath-Mistral-7B | 77.8 | 56.3 | 71.0 | 70.0 | 61.9 | 45.1 | 58.1 | 77.5 | 55.6 | 10.6 |
MetaMath-7B | 66.7 | 44.4 | 59.1 | 58.8 | 49.7 | 30.9 | 49.7 | 64.9 | 36.7 | 5.2 |
Abel-7B | 59.5 | 37.1 | 56.1 | 51.0 | 38.6 | 24.6 | 33.5 | 58.7 | 33.3 | 1.0 |
ToRA-7B | 67.5 | 43.6 | 62.0 | 64.8 | 54.1 | 32.2 | 41.4 | 68.2 | 26.0 | 0.0 |
MAmmoTH-7B | 52.8 | 32.1 | 45.2 | 49.3 | 38.2 | 21.5 | 26.8 | 51.8 | 24.4 | 0.0 |
MAmmoTH-Coder-7B | 59.9 | 38.7 | 54.8 | 56.6 | 45.8 | 29.0 | 31.5 | 58.5 | 33.7 | 0.0 |
SEGO-7B | 68.7 | 44.7 | 60.4 | 64.3 | 51.7 | 35.9 | 41.0 | 67.2 | 37.2 | 0.0 |
MetaMath-13B | 70.8 | 48.6 | 61.5 | 64.3 | 53.1 | 36.3 | 53.9 | 71.7 | 42.9 | 4.9 |
Abel-13B | 66.7 | 45.4 | 62.4 | 59.7 | 50.0 | 34.8 | 41.6 | 67.3 | 45.5 | 1.8 |
ToRA-13B | 71.8 | 47.9 | 65.3 | 67.8 | 59.7 | 39.9 | 45.8 | 72.7 | 34.7 | 0.0 |
MAmmoTH-13B | 62.4 | 40.8 | 54.9 | 58.5 | 48.7 | 31.4 | 34.1 | 61.9 | 37.1 | 0.0 |
MAmmoTH-Coder-13B | 64.9 | 44.0 | 59.4 | 62.0 | 50.3 | 36.9 | 36.2 | 63.8 | 43.2 | 0.0 |
SEGO-13B | 72.5 | 49.3 | 65.5 | 68.5 | 58.6 | 43.6 | 45.9 | 71.6 | 40.6 | 0.0 |
MetaMath-70B | 82.1 | 59.4 | 74.9 | 74.5 | 65.0 | 51.0 | 58.0 | 79.6 | 61.9 | 9.9 |
Abel-70B | 83.9 | 60.0 | 76.7 | 76.9 | 63.6 | 53.1 | 60.0 | 81.4 | 64.8 | 3.0 |
MAmmoTH-70B | 75.9 | 53.4 | 67.4 | 71.7 | 59.2 | 47.8 | 49.1 | 75.5 | 56.6 | 0.0 |