llm-as-judge

Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such as numerical reasoning. (COLING 2025)