Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi

November 2024

Code

Abstract

Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluation that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. We also find that LLM-pre-drafting before human evaluation can help reduce the impact of human subjectivity and minimize annotation outliers in pure human evaluation, leading to more objective evaluation.

Type

Preprint

Publication

COLING 2025

Qintong Li

PhD student

My research interests are building machine learning models for open-ended text generation and commonsense reasoning.

Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

Abstract

Qintong Li

PhD student

Related