Leveraging LLMs for Efficient Unit Test Generation

Leveraging LLMs for Efficient Unit Test Generation

Unlocking the Potential of Artificial Intelligence in Software Development

As software engineers, we're constantly seeking ways to optimize our workflow and reduce manual labour. One promising area of innovation is the application of Large Language Models (LLMs) to coding tasks, with the potential to significantly enhance developer productivity and efficiency. Tools such as GitHub's CoPilot have already demonstrated the potential of AI-assisted code generation, providing developers with intelligent suggestions and automating routine coding tasks.

Our goal is to leverage LLMs to automate entire aspects of software development, such as unit testing, with greater independence from human programmers. Unit testing is a prime starting point, offering a well-defined problem domain. We aim to identify the most suitable LLMs for this task and unlock their potential to streamline unit test generation.

The LLM Ecosystem: A Brief Overview

The landscape of LLMs is diverse, comprising both open-source (OSS) and commercial options. OSS models offer the possibility of self-hosting, enabling secure processing of confidential data – an approach that we at TNG utilize with various models. In contrast, commercial LLMs have demonstrated superior output quality in different benchmarks. [1][2][3][4]

Evaluating LLMs: Beyond Code Coverage

Evaluating the quality of generated tests is crucial to understanding how much we can benefit from the model that generated them. Code coverage alone is insufficient, as it only measures the percentage of code executed during testing without requiring any assertions or validation of the results. To address this limitation, we utilize mutation testing, where the code is intentionally modified to introduce errors, and the unit tests are run to see if they can detect these changes.

We also investigate the potential of utilizing LLMs as judges to analyze tests and provide feedback on their quality. Specifically, we employ LLMs to assess metrics such as simplicityreadability, matching of test name and content and test independence on a score from 0 to 4, allowing for a more nuanced evaluation of test quality and effectiveness.

By combining mutation testing with LLM-based evaluation, we can gain a more comprehensive understanding of our tests' strengths and weaknesses. 

Strategies for Effective LLM Utilization

To effectively leverage LLMs in unit testing, we employ a range of strategies, including:

  • Divide and Conquer: Breaking down complex tasks into smaller, manageable components until they can be addressed with a single LLM call.
  • One-Shot Prompting: Providing the model with examples for the desired output.
  • Explanation Requests: Gaining deeper insights into the model's decision-making process.

By combining these strategies, we can harness the full potential of LLMs to generate unit tests in our experiments.

Our Experience with LLMs in Coding

In our exploration of utilizing LLMs for coding tasks, we've found that dividing tasks into planning and coding phases is beneficial when working with models that have distinct strengths, allowing for more flexible task allocation and maximizing their potential.

Our experience, which currently focuses on generating unit tests for Java, has led us to favour these combinations:

On the open-source front, DeepSeek-R1 has emerged as a promising contender, demonstrating notable potential in reasoning capabilities for both planning and coding tasks. However, a significant gap in output quality remains between commercial and open-source models, as illustrated in Figure 1:

Article content
Article content
Figure 1: Output quality for unit test generation across different LLM combinations, averaged over different tested classes. Code and mutation coverage percentages are scaled to a score between 0 and 4.

Notably, while metrics evaluated using an LLM as a judge (GPT-4o) show minimal variation between models, substantial differences are observed in code and mutation coverage, highlighting the existing disparities in model performance.

Key Findings

The key takeaway from our research is that LLMs can be incredibly powerful tools when utilized properly, but their potential often needs to be unlocked by getting them to "think" about the task at hand, rather than just generating code. By creatively coaxing LLMs to understand and reflect on their output, we can tap into their full potential and revolutionize automated testing, paving the way for more efficient, reliable, and innovative coding practices.

[1] https://towardsdatascience.com/llms-for-coding-in-2024-performance-pricing-and-the-battle-for-the-best-fba9a38597b6/

[2] https://livebench.ai/#/

[3] https://www.vellum.ai/llm-leaderboard

[4] https://www.keywordsai.co/blog/top-benchmarks-for-the-best-open-source-coding-llms

To view or add a comment, sign in

More articles by TNG Technology Consulting

Others also viewed

Explore content categories