Boosting Natural Language Generation Evaluation with LLM Paraphrasing

A recent research article presents an innovative method, Para-Ref, which utilizes large language models (LLMs) to paraphrase a single reference into multiple high-quality diverse expressions in a way that improves the correlation with human evaluation for several automatic evaluation metrics. The researchers contributing to this paper are affiliated with institutions such as Renmin University of China, The Chinese University of Hong Kong, ETH Zürich, and Microsoft Research Asia China.

Introduction: The Problem with Limited References

In the field of natural language generation (NLG), evaluation plays a crucial role in determining the quality of generated text from tasks like machine translation, text summarization, and image captioning. Traditional evaluation methods have relied on either human evaluation or automatic evaluation. While human evaluation provides reliable assessments, it can be time-consuming and expensive, prompting researchers to develop automatic evaluation metrics, which are more computationally efficient.

However, a key drawback when using automatic evaluation metrics lies in the limited number of references in the evaluation benchmarks. This limitation can lead to poor correlation with human evaluation, as a single or few references may not accurately reflect the quality of a model’s hypotheses.

Para-Ref Method: Enhancing Benchmarks

To address this limitation, the researchers propose the Para-Ref method. This approach leverages the paraphrasing capabilities of large language models (LLMs) to generate multiple high-quality, diverse expressions of a single ground-truth reference. By increasing the number of references, the method aims to improve the correlation between automatic evaluation metrics and human evaluation, thus providing a better reflection of an NLG model’s quality.

Applying LLM Paraphrasing

The research article explores the use of LLM paraphrasing prompts to generate diverse expressions while preserving the original semantics of the ground-truth reference. First, they use a basic prompt to paraphrase ten sentences from the WMT22 Metrics Shared Task, observing limited sentence structure modification. To enhance diversity, they employ diverse prompts inspired by heuristic rules covering changes in words, order, structure, voice, and style. The diverse prompts show a significant improvement in paraphrased sentence quality and better correlation with human evaluations.

Experimental Results: Improved Correlation with Human Evaluation

The researchers conducted extensive experiments on benchmarks from multiple NLG tasks and various commonly-used automatic evaluation metrics to demonstrate the effectiveness of their Para-Ref approach. The results showed a significant improvement in consistency between traditional evaluation metrics and human evaluation results across various NLG tasks, with a +7.82% increase in ratio. This demonstrated the potential of the Para-Ref method to enhance the evaluation benchmarks in NLG tasks.

Ablation Analysis: Impact of Various Factors

The study also examined the influence of several factors on the performance of the Para-Ref method. They looked at the selection of paraphrasing models, application of instruction prompts, choice of aggregation functions, and quantity of paraphrased references. Results showed a better performance in lexicon-based metrics with LLM paraphrasing and diverse prompts. Furthermore, generating 10-20 references offered the best cost-effectiveness in translation tasks.

Conclusion: A Step Forward for AI Evaluation

In conclusion, the innovative Para-Ref method leverages LLM paraphrasing to enhance evaluation benchmarks by generating diverse, high-quality references that significantly improve the correlation with human evaluation for various automatic evaluation metrics. This research represents a significant advancement in the field of NLG and has the potential to not only improve text generation evaluation but also extend to evaluating other modalities like speech and image. By bridging the gap between traditional evaluation methods and human evaluations, the Para-Ref method is set to improve AI capabilities in natural language generation tasks.

Original Paper

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing