In a recent research article, a team of researchers from the University of Edinburgh discovered that Large Language Models (LLMs) fail to recognize identifier swaps in Python code generation tasks. As the model size increases, the models tend to become more confident in their incorrect predictions. This surprising finding goes against the commonly observed trend of increased prediction quality with increasing model size, and raises questions on LLMs’ true understanding of content and their applicability in tasks that deviate from their training data.

Novel Inverse Scaling Task and Experiment Design

The authors devised a novel inverse scaling task involving the generation of Python code fragments where default identifiers are redefined. This task has practical implications as redefinition of default identifiers is employed as a metaprogramming technique in popular libraries. Moreover, it highlights that LLMs fail to reason about the deep, abstract semantic structure of programming languages at a fundamental level.

To investigate this issue, the research team created a dataset by crawling repositories from GitHub and generating a prompt, “bad” output, and “good” output for each function. A good output is one that uses swapped builtin functions correctly, while a bad output ignores the swap statement and maintains the regular meaning of builtin functions. The prompt comprised a swap statement, function declaration, and docstring. The success of the model was assessed by computing the likelihood of each output.

How LLMs Fare in the Task and their Inability to Comprehend Program Semantics

The experiments were performed on LLM models from OpenAI, Salesforce, Meta AI, and Google. The evaluated models exhibited strong inverse scaling, with larger models significantly preferring the incorrect output. These results imply that LLMs are primarily reliant on weak, unstable, mostly lexical correlations in their training data rather than understanding the deep semantics present.

In addition to the main task, the team also conducted a series of qualitative experiments with the OpenAI GPT-3.5, ChatGPT-3.5, and GPT-4 models. Even when given correct examples in the prompt or after multiple rounds of dialogue, these models demonstrated a stubborn inability to provide the correct continuations.

Implications of the Study

The research findings are disconcerting for those who rely heavily on LLMs, including software developers and users of automatic code generation tools like GitHub Copilot. The results indicate that LLMs lack a deep understanding of the precise and well-defined semantic structure of programming languages. Moreover, increasing the size of the model does not appear to alleviate the problem.

While several practical applications, such as code generation and completion, benefit from LLMs, this study reveals a significant concern over LLMs’ real comprehension of the content they manipulate. Current LLMs may be excellent at handling typical tasks, but the results raise doubts about their suitability for tasks that are slightly different from their training data.

Future Research

The study opens the door for future research, focusing on identifying more efficient ways to improve the understanding of LLMs in programming language semantics. With newer and larger LLMs being developed, exploring alternative scaling effects and tasks in other programming languages will be crucial. More importantly, the research highlights the pressing need to address the limitations of current LLMs to create trustworthy and dependable AI tools in various domains, including code generation.

The takeaway from the study is clear: significant improvements are needed for LLMs to truly comprehend and excel at programming language tasks that deviate from their training data. This breakthrough in understanding AI capabilities urges the research community to address the shortcomings of LLMs, ensuring a future where AI has a deeper understanding and improved capabilities in handling diverse tasks.

Original Paper