GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Introduction

Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. However, there are increasing debates regarding whether these models truly understand and apply mathematical knowledge or merely rely on shortcuts for mathematical reasoning. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.

In this work, we perturb the most popularly used GSM8K dataset, yielding an adversarial dataset for grade school math GSM-Plus. Motivated by the capability taxonomy for solving math problems mentioned in Polya's principles, we identify 5 perspectives to guide the development of GSM-Plus:

Numerical Variation refers to altering the numerical data or its types, including 3 subcategories: Numerical Substitution, Digit Expansion, and Integer-decimal-fraction Conversion.
Arithmetic Variation refers to reversing or introducing additional operations (e.g., addition, subtraction, multiplication, and division) to math problems, including 2 subcategories: Adding Operation and Reversing Operation.
Problem Understanding refers to rephrasing the text description of the math problems.
Distractor Insertion refers to inserting topic-related but useless sentences to the problems.
Critical Thinking focuses on question or doubt ability when the question lacks necessary statements.

Based on the 1,319 test questions from GSM8K, we create eight variations for each question, the yielding GSM-Plus comprises 10,552 question variations.

We use GSM-Plus to evaluate the robustness of 25 LLMs with different model scales and task-specific fine-tuning, along with 4 popular prompting techniques to obtain LLMs' math reasoning results. Overall, we find that LLMs can accurately solve the GSM8K questions while struggling with answering the variations in GSM-Plus. Our benchmark revealed a gap of up to 20% between the accuracy reported by the current model and the accuracy observed in our setting, while human performance remains unaffected due to the unchanged inherent difficulty level of the questions.

You can download the dataset on 🤗 Hugging Face Dataset.

GSM-Plus Dataset

All the data examples contain the following attributes:

question: the adversarial question
solution: the solution chain for the adversarial question
answer: the gold answer of the adversarial question
perturbation_type: the perturbation type
seed_question: the seed question used to craft the adversarial question
seed_solution: the solution chain for the seed question
seed_answer: the gold answer of the seed question

Examples

# 1

# 2

# 3

Results on Existing Models

Accuracy scores on GSM8K (1,319 samples) and GSM-Plus (10,552 samples), which includes samples from eight different perturbation types.

Model	GSM8K	GSM-Plus	Num. Sub.	Digit Exp.	IDC Conv.	Add. Op.	Rev. Op.	Prob. Underst.	Dist. Ins.	Crit. Thinking
Human	96.8	98.8	92.9	100.0	100.0	87.5	100.0	100.0	100.0	100.0
GPT-4	93.3	85.6	89.8	90.5	89.0	79.5	83.7	93.9	90.8	67.5
GPT-3.5-Turbo	73.6	61.2	69.5	70.4	62.3	48.5	55.2	74.2	62.2	47.3
Mistral-7B	39.6	26.2	35.2	35.9	29.9	14.4	21.8	38.7	28.1	5.5
LLaMA-2-7B	13.42	8.12	13.0	10.0	10.4	3.0	7.0	13.9	7.6	0.0
CodeLlama-7B	25.3	15.1	22.3	23.8	19.1	8.6	9.3	25.9	11.4	1.4
LLaMA-2-13B	25.4	16.6	25.9	22.9	17.7	9.5	13.4	27.4	15.4	0.3
CodeLlama-13B	35.9	21.1	29.6	29.9	28.2	14.6	15.4	34.6	16.7	4.4
LLaMA-2-70B	56.7	40.0	53.3	53.1	42.1	31.5	36.6	56.6	46.9	0.3
MetaMath-Mistral-7B	77.8	56.3	71.0	70.0	61.9	45.1	58.1	77.5	55.6	10.6
MetaMath-7B	66.7	44.4	59.1	58.8	49.7	30.9	49.7	64.9	36.7	5.2
Abel-7B	59.5	37.1	56.1	51.0	38.6	24.6	33.5	58.7	33.3	1.0
ToRA-7B	67.5	43.6	62.0	64.8	54.1	32.2	41.4	68.2	26.0	0.0
MAmmoTH-7B	52.8	32.1	45.2	49.3	38.2	21.5	26.8	51.8	24.4	0.0
MAmmoTH-Coder-7B	59.9	38.7	54.8	56.6	45.8	29.0	31.5	58.5	33.7	0.0
SEGO-7B	68.7	44.7	60.4	64.3	51.7	35.9	41.0	67.2	37.2	0.0
MetaMath-13B	70.8	48.6	61.5	64.3	53.1	36.3	53.9	71.7	42.9	4.9
Abel-13B	66.7	45.4	62.4	59.7	50.0	34.8	41.6	67.3	45.5	1.8
ToRA-13B	71.8	47.9	65.3	67.8	59.7	39.9	45.8	72.7	34.7	0.0
MAmmoTH-13B	62.4	40.8	54.9	58.5	48.7	31.4	34.1	61.9	37.1	0.0
MAmmoTH-Coder-13B	64.9	44.0	59.4	62.0	50.3	36.9	36.2	63.8	43.2	0.0
SEGO-13B	72.5	49.3	65.5	68.5	58.6	43.6	45.9	71.6	40.6	0.0
MetaMath-70B	82.1	59.4	74.9	74.5	65.0	51.0	58.0	79.6	61.9	9.9
Abel-70B	83.9	60.0	76.7	76.9	63.6	53.1	60.0	81.4	64.8	3.0
MAmmoTH-70B	75.9	53.4	67.4	71.7	59.2	47.8	49.1	75.5	56.6	0.0

Other Results

Accuracy scores of LLMs on GSM8K and GSM-Plus and their performance decay rates (PDR) and the percentage of accurately solved pairs (ASP).

The PDR distribution across 8 perturbation types.
The bars below the line indicate an increase in performance for the corresponding perturbation compared to the performance on GSM8K.

LLMs’s performance across various types of question variations.
Darker cell colors indicate larger performance decay rates under corresponding question variations. The value in parentheses represents PDR values in performance compared to the performance on GSM8K. The cell in purple indicates a slight increase in performance for the corresponding question variation compared to the original GSM8K test set.
The majority of models struggle to indicate the absence of statements when confronted with critical thinking variations.

The reasoning transferability of LLMs between the question pairs of GSM8K and GSM-PLUS. The purple (both correct) and blue (both wrong) bars represent consistent model behavior, while the red (GSM8K correct & GSM-PLUS wrong) and yellow (GSK8K wrong & GSM-PLUS correct) bars represent the inconsistent model behavior.
The heights of the purple and red bars indicate the number of correctly solved GSM8K questions.

The accuracy of LLMs across various question variations and GSM8K questions differs based on the prompting techniques used. Complexity-based CoT and LTM use 8 and 1 in-context examples, respectively.

BibTeX

@misc{li2024gsmplus,
  author={Qintong Li and Leyang Cui and Xueliang Zhao and Lingpeng Kong and Wei Bi},
  title={GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers},
  year={2024},
  eprint={2401.13178},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

GSM-Plus

A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers