GSM-Plus

A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

1The University of Hong Kong
2Tencent AI Lab

Introduction

Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. However, there are increasing debates regarding whether these models truly understand and apply mathematical knowledge or merely rely on shortcuts for mathematical reasoning. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.

In this work, we perturb the most popularly used GSM8K dataset, yielding an adversarial dataset for grade school math GSM-Plus. Motivated by the capability taxonomy for solving math problems mentioned in Polya's principles, we identify 5 perspectives to guide the development of GSM-Plus:

  1. Numerical Variation refers to altering the numerical data or its types, including 3 subcategories: Numerical Substitution, Digit Expansion, and Integer-decimal-fraction Conversion.
  2. Arithmetic Variation refers to reversing or introducing additional operations (e.g., addition, subtraction, multiplication, and division) to math problems, including 2 subcategories: Adding Operation and Reversing Operation.
  3. Problem Understanding refers to rephrasing the text description of the math problems.
  4. Distractor Insertion refers to inserting topic-related but useless sentences to the problems.
  5. Critical Thinking focuses on question or doubt ability when the question lacks necessary statements.
Based on the 1,319 test questions from GSM8K, we create eight variations for each question, the yielding GSM-Plus comprises 10,552 question variations.

We use GSM-Plus to evaluate the robustness of 25 LLMs with different model scales and task-specific fine-tuning, along with 4 popular prompting techniques to obtain LLMs' math reasoning results. Overall, we find that LLMs can accurately solve the GSM8K questions while struggling with answering the variations in GSM-Plus. Our benchmark revealed a gap of up to 20% between the accuracy reported by the current model and the accuracy observed in our setting, while human performance remains unaffected due to the unchanged inherent difficulty level of the questions.

You can download the dataset on 🤗 Hugging Face Dataset.

GSM-Plus Dataset

All the data examples contain the following attributes:

  • question: the adversarial question
  • solution: the solution chain for the adversarial question
  • answer: the gold answer of the adversarial question
  • perturbation_type: the perturbation type
  • seed_question: the seed question used to craft the adversarial question
  • seed_solution: the solution chain for the seed question
  • seed_answer: the gold answer of the seed question

Examples

Results on Existing Models

Accuracy scores on GSM8K (1,319 samples) and GSM-Plus (10,552 samples), which includes samples from eight different perturbation types.

Model GSM8K GSM-Plus Num. Sub. Digit Exp. IDC Conv. Add. Op. Rev. Op. Prob. Underst. Dist. Ins. Crit. Thinking
Human 96.8 98.8 92.9 100.0 100.0 87.5 100.0 100.0 100.0 100.0
GPT-4 93.3 85.6 89.8 90.5 89.0 79.5 83.7 93.9 90.8 67.5
GPT-3.5-Turbo 73.6 61.2 69.5 70.4 62.3 48.5 55.2 74.2 62.2 47.3
Mistral-7B 39.6 26.2 35.2 35.9 29.9 14.4 21.8 38.7 28.1 5.5
LLaMA-2-7B 13.42 8.12 13.0 10.0 10.4 3.0 7.0 13.9 7.6 0.0
CodeLlama-7B 25.3 15.1 22.3 23.8 19.1 8.6 9.3 25.9 11.4 1.4
LLaMA-2-13B 25.4 16.6 25.9 22.9 17.7 9.5 13.4 27.4 15.4 0.3
CodeLlama-13B 35.9 21.1 29.6 29.9 28.2 14.6 15.4 34.6 16.7 4.4
LLaMA-2-70B 56.7 40.0 53.3 53.1 42.1 31.5 36.6 56.6 46.9 0.3
MetaMath-Mistral-7B 77.8 56.3 71.0 70.0 61.9 45.1 58.1 77.5 55.6 10.6
MetaMath-7B 66.7 44.4 59.1 58.8 49.7 30.9 49.7 64.9 36.7 5.2
Abel-7B 59.5 37.1 56.1 51.0 38.6 24.6 33.5 58.7 33.3 1.0
ToRA-7B 67.5 43.6 62.0 64.8 54.1 32.2 41.4 68.2 26.0 0.0
MAmmoTH-7B 52.8 32.1 45.2 49.3 38.2 21.5 26.8 51.8 24.4 0.0
MAmmoTH-Coder-7B 59.9 38.7 54.8 56.6 45.8 29.0 31.5 58.5 33.7 0.0
SEGO-7B 68.7 44.7 60.4 64.3 51.7 35.9 41.0 67.2 37.2 0.0
MetaMath-13B 70.8 48.6 61.5 64.3 53.1 36.3 53.9 71.7 42.9 4.9
Abel-13B 66.7 45.4 62.4 59.7 50.0 34.8 41.6 67.3 45.5 1.8
ToRA-13B 71.8 47.9 65.3 67.8 59.7 39.9 45.8 72.7 34.7 0.0
MAmmoTH-13B 62.4 40.8 54.9 58.5 48.7 31.4 34.1 61.9 37.1 0.0
MAmmoTH-Coder-13B 64.9 44.0 59.4 62.0 50.3 36.9 36.2 63.8 43.2 0.0
SEGO-13B 72.5 49.3 65.5 68.5 58.6 43.6 45.9 71.6 40.6 0.0
MetaMath-70B 82.1 59.4 74.9 74.5 65.0 51.0 58.0 79.6 61.9 9.9
Abel-70B 83.9 60.0 76.7 76.9 63.6 53.1 60.0 81.4 64.8 3.0
MAmmoTH-70B 75.9 53.4 67.4 71.7 59.2 47.8 49.1 75.5 56.6 0.0


Other Results

BibTeX

@misc{li2024gsmplus,
  author={Qintong Li and Leyang Cui and Xueliang Zhao and Lingpeng Kong and Wei Bi},
  title={GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers},
  year={2024},
  eprint={2401.13178},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Contact Us

Have any questions about GSM-Plus? Please contact us at qtli@cs.hku.hk or create an issue on Github.