Are Large Language Models Smart Enough? A New Challenging Benchmark for Problem Solving Abilities

Advancements in Large Language Models (LLMs) have led to impressive performance on various reasoning tasks. However, researchers Daman Arora and Himanshu Gaurav Singh have introduced a new, more challenging benchmark - JEEBench - designed to test the problem-solving abilities of LLMs. Their evaluation shows promising results for GPT-4 but also highlights areas for improvement.

JEEBench: A New Standard for LLM Evaluation

JEEBench is a dataset containing 450 pre-engineering mathematics, physics, and chemistry problems curated from the IIT JEE-Advanced Exam. Compared to other existing benchmarks, JEEBench is significantly harder, demanding long-horizon reasoning and deep in-domain knowledge to solve problems.

Existing benchmarks, like GSM-8K, MATH, and MMLU, contain problems focused on elementary mathematics or high school science, which primarily test factual knowledge or basic arithmetic. In contrast, JEEBench comprises more complex and diverse problems, demanding the correct application of high-level domain-specific concepts.

GPT-4’s Performance on JEEBench

For this study, the authors evaluate the GPT series of language models on the JEEBench dataset. They specifically focus on how GPT-4 fares against its predecessors, GPT-3 and GPT-3.5. Additionally, they test the efficacy of methods such as Chain-of-Thought prompting and Self-Consistency, aimed at improving the reasoning capabilities of LLMs.

The results show a notable performance improvement from GPT-3 to GPT-4. While GPT-3 exhibited near-random performance, GPT-3.5 and GPT-4 showed significantly better results, with GPT-4 demonstrating unparalleled capabilities. However, GPT-4’s problem-solving skills have room for improvement.

Chain-of-Thought prompting led to a noticeable increase in GPT-4’s performance. However, the Self-Consistency method did not yield substantial improvements.

Error Analysis: Where GPT-4 Falters

The researchers conducted a manual annotation of GPT-4’s errors on a subset of 100 problems. They classified errors in three categories:

Conceptual errors: Inability to retrieve the required concepts or facts for solving the problem
Grounding errors: Incorrect grounding of retrieved concepts in terms of equations or constraints
Computation errors: Incorrect algebraic manipulation and arithmetic

The majority of errors were conceptual and computational, accounting for over 80% of the total errors. Although GPT-4 could solve some questions correctly and provide human-like logical and mathematical reasoning, it occasionally committed severe errors in seemingly trivial steps, indicating a need for further improvement.

What Does It Mean for the Future of AI?

The researchers highlight several possible directions for enhancing LLMs. One interesting idea is the integration of large language models with a “blackbox” scientific calculator, as proposed by Toolformer. This would enable the LLM to leverage the calculator’s capabilities for improved performance.

Another suggestion is the development of a “self-refinement” system to verify and provide feedback on the correctness of mathematical reasoning in natural language. This could significantly augment the reasoning capabilities of LLMs.

Ultimately, the benchmark provided by JEEBench and the analysis presented by the authors shine a light on opportunities for future research. As advancements in AI and language models continue, studying and addressing their limitations will play a crucial role in improving their problem-solving and reasoning capabilities, making AI more useful and versatile for a wide range of applications.

Original Paper

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models