week 58 - Evaluating LLMs for Automated Unit Test Maintenance, Specification-Driven Development and Test-Driven Code Generation with LLMs

Marabesi Matheus 🚀

Published Feb 2, 2026

A Socio-Technical Grounded Theory on the Effect of Cognitive Dysfunctions in the Performance of Software Developers with ADHD and Autism

The concept of neurodiversity, encompassing conditions such as Autism Spectrum Disorder (ASD), Attention-Deficit/Hyperactivity Disorder (ADHD), dyslexia, and dyspraxia, challenges traditional views of these neurodevelopmental variations as disorders and instead frames them as natural cognitive differences that contribute to unique ways of thinking and problem-solving. Within the software development industry, known for its emphasis on innovation, there is growing recognition of the value neurodivergent individuals bring to technical teams. Despite this, research on the contributions of neurodivergent individuals in Software Engineering (SE) remains limited. This interdisciplinary Socio-Technical Grounded Theory study addresses this gap by exploring the experiences of neurodivergent software engineers with ASD and ADHD, examining the cognitive and emotional challenges they face in software teams. Based on interviews and a survey with 25 neurodivergent and 5 neurotypical individuals, our theory describes how neurodivergent cognitive dysfunctions affect SE performance, and how the individuals' individual journey and various accommodations can regulate this effect. We conclude our paper with a list of inclusive Agile practices, allowing organizations to better support neurodivergent employees and fully leverage their capabilities.

See full paper - Last Accessed (Jan 30, 2026)

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing.

See full paper at arxiv.org - Last Accessed (Jan 30, 2026)

An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models

We consider the task of generating functionally correct code using large language models (LLMs). The correctness of generated code is influenced by the prompt used to query the given base LLM. We formulate the problem of finding the appropriate prompt as combinatorial search process and propose a Bayesian optimization (BO) approach referred to as {\em BO for Code GENeration (BODE-GEN)}. BODE-GEN performs an adaptive data-driven search over prompts guided by training data in the form of prompts tried and the functional accuracy of the generated code over a set of given test cases. The key insight is to perform BO in continuous embedding space by using an auxiliary LLM to bridge the gap between discrete prompt space and continuous embedding space. We leverage two synergistic ideas, namely, random projections and dimensionality scaled priors, to build effective Gaussian process based surrogate models over the high-dimensional embedding space. Our experiments on the HumanEval+ benchmark using multiple base LLMs show that BODE-GEN can improve performance in terms of code generation accuracy compared to fixed prompts and manual prompt engineering. Additionally, we demonstrate that BODE-GEN is sample-efficient, requiring relatively few iterations of BO to demonstrate improvements in code accuracy.

week 58 - Evaluating LLMs for Automated Unit Test Maintenance, Specification-Driven Development and Test-Driven Code Generation with LLMs

Marabesi Matheus 🚀

A Socio-Technical Grounded Theory on the Effect of Cognitive Dysfunctions in the Performance of Software Developers with ADHD and Autism

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models

Recommended by LinkedIn

Recommending Move Method Refactoring Opportunities Based on Feature Fusion and DeepLearning

Agile Specification-Driven Development

Papers of the week

807 followers

More articles by Marabesi Matheus 🚀

Others also viewed

Why Spec Coding Matters Now

Vibe Coding, Human Design, and Data Management

Will "vibe coding" make developers obsolete?

AI Coding Agents: The Future of Software Development

ChatGPT for Developers 💻

Conversational Coding: Using AI to Translate Developer Intent into Cloud-Ready Code

Watsonx Code Assistant: Generative AI for Smarter Coding

What is Vibe Coding? AI-Driven Approach to Software Development

AI coding: Control over vibe

Why AI Coding Is Killing Your Job Title (And Why I'm Relieved)

Solving Coding Challenges With LLM Tools

Using LLMs as Microservices in Application Development

Advanced Thinking Strategies for Software Testers

How LLMs Improve Developer Collaboration and Code Reviews

Explore content categories

A Socio-Technical Grounded Theory on the Effect of Cognitive Dysfunctions in the Performance of Software Developers with ADHD and Autism

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models

Recommended by LinkedIn

Recommending Move Method Refactoring Opportunities Based on Feature Fusion and DeepLearning

Agile Specification-Driven Development

Papers of the week

807 followers

More articles by Marabesi Matheus 🚀

week 60 - Monorepos, LLMs for automated unit test generation and Object-oriented design

week 59 - Mock Objects, Test Behaviors, Not Methods and How the practice of TDD influences class design

week 57 - TDD reporting experiment, Generating unit tests with descriptive names and Quantifying the effect of test-driven development on quality

week 56 - Iterative and Incremental Development, the power of LLMs for test case generation and Is LLM code better?

week 55 - Unit Test Smells in Javascript, GAMe For LEarning and Smells in Test Code Generated by LLM

week 54 - TDD + LLMs , Improving TDD with Automated Test Case Generation and Log Smells taxonomy

week 53 - Special edition 2024

week 52 - Test Code Refactoring Unveiled, An Improvement to TDD Efficiency and Large Language Models in Detecting Test Smells

week 51 - why developers implement OS-specific tests, does Treatment Adherence Impact in TDD and A framework for compliance rules for TDD

week 50 - Agile vs Waterfall, Matching Production and Test Files and Test smells in LLM-Generated Unit Tests

Others also viewed

Why Spec Coding Matters Now

Vibe Coding, Human Design, and Data Management

Will "vibe coding" make developers obsolete?

AI Coding Agents: The Future of Software Development

ChatGPT for Developers 💻

Conversational Coding: Using AI to Translate Developer Intent into Cloud-Ready Code

Watsonx Code Assistant: Generative AI for Smarter Coding

What is Vibe Coding? AI-Driven Approach to Software Development

AI coding: Control over vibe

Why AI Coding Is Killing Your Job Title (And Why I'm Relieved)

Similar topics

Solving Coding Challenges With LLM Tools

Using LLMs as Microservices in Application Development

Advanced Thinking Strategies for Software Testers

How LLMs Improve Developer Collaboration and Code Reviews

Explore content categories