Technical Report on LLM Performance for Code Review

Introduction

This technical report evaluates the effectiveness of different Large Language Models (LLMs) for automated code review in Git pull requests. At CloudAEye, we provide AI-powered code review agents that integrate directly with GitHub, helping development teams identify bugs, security vulnerabilities, and potential improvements before code is merged.

Our AI agents analyze pull requests in real time, providing detailed feedback on code quality, and potential issues, and answering any queries related to code change. This approach helps development teams:

Catch bugs and security vulnerabilities earlier in the development cycle
Maintain consistent code quality standards across teams
Reduce the time spent on manual code reviews
Accelerate development velocity while improving code reliability

In this report, we compare several leading LLMs to determine their effectiveness for automated code review tasks. We focus on their ability to identify intentionally planted bugs, provide accurate bug reports, PR descriptions, and identify security vulnerabilities.

Cost Considerations

Cost is a critical factor in agentic workflows, where multiple LLM calls are typically made during the analysis of a single pull request. The table below compares the input and output costs per million tokens for various LLM providers and models.

Experiment: Writing Buggy Code on Purpose

We created code with intentional bugs to test our code review bot with different LLM configurations. The goal was to see if it could catch different types of problems while avoiding false alarms.

You can see the full pull request here: github.com/CloudAEye/sklearn-pandas/pull/8

Planted Bugs and security issues

We made changes to two files in a Python data science library:

Component 1: DataFrame Processor

Here are the key vulnerabilities we added:

Component 2: Numerical Transformer

More bugs in the math code:

PR Description Comparison

This section compares how different LLM-based code review bots describe the same pull request.

Analysis of PR Descriptions

Scoring Criteria Explanation

Completeness: Coverage of all significant changes in the PR (both files and all classes/- functions)
Technical accuracy: Correctness of technical details and implementation description
Clarity: How easily understandable the description is for developers
Conciseness: Organization of the information in a short but logically sound format
Focus on key changes: Emphasis on the most important aspects of the changes
Additional recommendations: Quality and relevance of suggestions for improvements

Comparative Analysis

Claude 3.7 Sonnet: Highly detailed and technically accurate. However, selective in coverage, focusing extensively on LogTransform while only briefly mentioning newly added files. Very detailed but verbose. Additional recommendations were the most helpful among all models.

GPT-4o: Provided complete coverage with excellent clarity. Technical details were slightly vague, and additional recommendations were minimal. Well-organized and concise.
DeepSeek Llama 70B (Groq): Performed poorly across most criteria. Failed to cover significant portions of the changes, had low technical accuracy, and lacked clarity. The weakest performer overall.
DeepSeek Qwen 32B (Groq): Concise but missed critical components, particularly the log transformation functionality. Low technical accuracy with almost no valuable recommendations.
Qwen 2.5 Coder (Groq): Excelled in clarity and conciseness. Good coverage of changes with an effective focus on key modifications. Additional recommendations could be improved but performed strongly overall, tied with Claude for second place.
Llama 3.3 70B (Groq): The top performer with balanced strengths across all criteria. Very concise while highlighting important changes effectively. Could have been more verbose in some areas, but otherwise nearly perfect.

Bug and Security Report Comparison

The CloudAeye bot identifies the code’s potential bugs and security risks when asked for a review, as shown in Figure 7,8.

Analysis of Code Review

Bot 1 (Claude 3.7 Sonnet) Review Assessment

1. Coverage: 4.5/5

The bot thoroughly examined both files (dataframe_processor.py and transformer.py)
Identified a wide range of issues across different severity levels
Covered security vulnerabilities, syntax errors, potential runtime issues, and design flaws
Detailed both prominent issues (eval, pickle) and more subtle problems (division by zero, indentation)
Only slight deduction because it didn’t explicitly mention the risks of updating dict directly from untrusted input in load_from_export

2. Accuracy (no false positives): 4.5/5

Most identified issues are legitimate concerns
Correctly flagged critical security issues with eval() and pickle
Correctly identified indentation errors in transformers.py
Minor deduction for the “Missing numpy import” issue - the bot notes that numpy is likely already imported, so this seems like a hedge rather than a clear false positive

3. Technical details: 5/5

Provided excellent depth in explanations
Included code snippets for each issue
Gave clear explanations of why each issue is problematic
Described potential consequences of each bug
Suggested remediation approaches
References to specific lines and methods were precise

4. Bug alert quality: 5/5

Structured format with tags, paths, line numbers, and priorities
A clear distinction between different types of issues
Appropriate prioritization of issues (critical security flaws rated “High”)
Detailed explanations for each bug
Good categorization of similar but distinct issues (e.g., separate entries for different problems with log_scaled)

5. Security alert quality: 5/5

Clear identification of major security vulnerabilities
Appropriate risk scores with justifications
Used standard vulnerability naming conventions
Provided detailed explanations of attack vectors
Distinguished between different types of security issues (injection, deserialization, data exposure)

6. No overlap between bug and security reports: 3.5/5

Some significant overlap between bug and security reports
The eval() issue appears in both reports
Pickle serialization/deserialization appears in both reports
The base64 encoding issue appears in both reports

Bot 2 (GPT-4O) Review Assessment

1. Coverage: 3.5/5

Covered key issues in both files (dataframe_processor.py and transformers.py)
Identified critical security vulnerabilities (eval, pickle)
Missed several important issues that Bot 1 caught:
-- Missing base64 import
-- Exposure of entire internal state via dict
-- No error handling in load_from_export
-- No validation of export_dir existence
-- Potential negative values in logarithm input
Generally focused on fewer issues but with decent depth

2. Accuracy (no false positives): 5/5

All identified issues are legitimate concerns
No false positives were detected in the report
Appropriate classification of issues by severity
Correctly distinguished between functional bugs and security vulnerabilities

3. Technical details: 4/5

Provided good explanations for each issue
Included code snippets for context
Clear descriptions of potential consequences
Detailed security risk scores with justifications
Slightly less comprehensive explanations than Bot 1, particularly for security issues

4. Bug alert quality: 4/5

Well-structured format with tags and priorities
Clear descriptions of each issue
Good separation of distinct problems
Appropriate prioritization (eval and division by zero as “High”)
Missing line numbers in most bug reports, which reduces traceability

5. Security alert quality: 4.5/5

Clear identification of major security vulnerabilities
Provided appropriate risk scores with justifications
Used standard vulnerability naming conventions
Good distinction between different types of security issues
Included detailed vulnerability descriptions
Slightly less comprehensive than Claude in describing attack vectors

6. No overlap between bug and security reports: 4/5

Limited overlap between bug and security reports
The eval() issue appears in both reports (primary overlap)
The pickle vulnerability appears only in the security report, while division by zero appears in both (but categorized differently)
Better separation of concerns than Claude, with distinct issues in each section

Bot 3 (DeepSeek Llama Distill 70B) Review Assessment

1. Coverage: 3/5

Covered some key issues in both files but missed several important problems
Identified critical security vulnerabilities (eval, pickle, base64 encoding)
Missed important issues that the other bots caught:
-- Indentation errors in transformers.py methods
-- No error handling in load_from_export
-- No validation of export_dir existence
-- Complete serialization of internal state via dict
-- Adding functionality to a deprecated class
Much less comprehensive than both previous bots
Bug report contains fewer entries than the other bots’ reports

2. Accuracy (no false positives): 5/5

All identified issues are legitimate concerns
No false positives detected
Appropriate classification of issues by severity
Correctly distinguished between bugs and security vulnerabilities

3. Technical details: 3.5/5

Provided decent explanations for security issues
Security issues have better detail than bug issues
Bug report lacks depth in explaining consequences and suggested fixes
Most bug descriptions are quite brief compared to the other bots
Missing details on how issues could affect users/systems

4. Bug alert quality: 3/5

Basic structure with tags, priorities, and locations
Very concise descriptions of each issue (sometimes too brief)
Appropriate prioritization of issues
Some issues lack sufficient explanation (e.g., “Potential incomplete state loading”)
Redundancy with two separate entries for scaling issues in transformers.py

5. Security alert quality: 4/5

Clear identification of major security vulnerabilities
Good risk scores with justifications
Standard vulnerability naming conventions used
Included longer code snippets for context
Reasonable explanations of the security impact
Not as detailed as Bot 1’s security analysis

6. No overlap between bug and security reports: 4.5/5

Very minimal overlap between bug and security reports
The bug report focuses mostly on implementation issues
The security report focuses on security vulnerabilities
Great separation of concerns (best of the three bots in this category)
Bug report mentions missing base64 import while the security report covers the incorrect use of base64 for security

Bot 4 (DeepSeek R1 Qwen 32B) Review Assessment

1. Coverage: 3/5

Covered some key issues in both files but missed several important problems
Identified critical security vulnerabilities (eval, pickle, base64 encoding)
Missed important issues that the other bots caught:
-- Indentation errors in transformers.py methods
-- No error handling in load_from_export
-- No validation of export_dir existence
-- Complete serialization of internal state via dict
-- Adding functionality to a deprecated class
Much less comprehensive than both previous bots
Bug report contains fewer entries than the other bots’ reports

2. Accuracy (no false positives): 5/5

All identified issues are legitimate concerns
No false positives detected
Appropriate classification of issues by severity
Correctly distinguished between bugs and security vulnerabilities

3. Technical details: 3.5/5

Provided decent explanations for security issues
Security issues have better detail than bug issues
Bug report lacks depth in explaining consequences and suggested fixes
Most bug descriptions are quite brief compared to the other bots
Missing details on how issues could affect users/systems

4. Bug alert quality: 3/5

Basic structure with tags, priorities, and locations
Very concise descriptions of each issue (sometimes too brief)
Appropriate prioritization of issues
Some issues lack sufficient explanation (e.g., “Potential incomplete state loading”)
Redundancy with two separate entries for scaling issues in transformers.py

5. Security alert quality: 4/5

Clear identification of major security vulnerabilities
Good risk scores with justifications
Standard vulnerability naming conventions used
Included longer code snippets for context
Reasonable explanations of the security impact
Not as detailed as Bot 1’s security analysis

6. No overlap between bug and security reports: 4.5/5

Very minimal overlap between bug and security reports
The bug report focuses mostly on implementation issues
The security report focuses on security vulnerabilities
Great separation of concerns (best of the three bots in this category)
Bug report mentions missing base64 import while the security report covers the incorrect use of base64 for security

Bot 5 (Qwen Coder 2.5) Review Assessment

1. Coverage: 2/5

The bug report has a critical flaw: it duplicates all dataframe_processor.py issues in
This is a serious error that significantly reduces the credibility of the review
Identified some key issues (eval, pickle, base64 encoding)
Missing critical issues in transformers.py like indentation errors and division by zero
No meaningful analysis of the actual transformers.py file issues

2. Accuracy (no false positives): 2/5

Major accuracy issue: incorrectly attributed dataframe_processor.py bugs to trans-
Alltransformers.pybugsarefalsepositives(they’recopy-pastesfromdataframe_processor.py)
Some bugs appear reasonable for dataframe_processor.py but are completely irrele- vant for transformers.py
Few genuine issues were identified correctly (eval, pickle, base64)

3. Technical details: 2.5/5

The security report has decent technical details
The bug report is very brief with minimal explanations
Lacks depth in explaining consequences or providing remediation suggestions
The duplication severely undermines any technical value in the report
Missing important technical details about division by zero and indentation issues

4. Bug alert quality: 2.5/5

Major quality issue with the duplicated bugs across files
Brief descriptions with minimal context
Some bugs are noted without adequate explanation
The duplication shows poor quality control

5. Security alert quality: 3.5/5

Security report is better than the bug report
Identifies major security issues (eval, pickle, base64)
Provides reasonable risk scores with justifications
Good explanation of the security impact
The structure is clear with vulnerability names and details

6. No overlap between bug and security reports: 3/5

Some separation between bug and security reports
Eval() and pickle vulnerabilities appear in both sections
Security report focuses more on the attack vector
Bug report is more focused on implementation issues
The massive duplication issue overshadows any positive aspects here

Bot 6 (Llama 3.1 70B) Review Assessment

1. Coverage: 3.5/5

Covered key issues in both files (dataframe_processor.py and transformers.py)
Identified critical security vulnerabilities (eval, pickle)
Took a different approach by focusing on potential runtime errors and edge cases
Identified division by zero risk in log_scaled transformation
Made good connections between changes in transformers.py and potential impacts in dataframe_processor.py
Missed some important issues like indentation errors in transformers.py methods

2. Accuracy (no false positives): 4.5/5

Most identified issues are legitimate concerns
No clear false positives detected
Focused on potential runtime errors that are plausible
Madereasonableconnectionsbetweenthetransformers.pychangesanddataframe_processor.py
The eval() risk is somewhat understated as a “potential TypeError” rather than a security vulnerability

3. Technical details: 4/5

Provided good explanations for most issues
Included “additional info” for several bugs, adding useful context
Distinguished between different severity levels appropriately
Made connections between components that other bots missed
The security report has good technical details, though less comprehensive than Bot 1

4. Bug alert quality: 4.5/5

Well-structured format with clear tags and priorities
Good separation of distinct issues
Appropriate prioritization of issues
Included relevant line numbers
Added supplementary context in “additional info” fields for many bugs
Took a unique perspective on potential runtime errors

5. Security alert quality: 4/5

Identified major security vulnerabilities (eval, pickle)
Provided reasonable risk scores with justifications
Used standard vulnerability naming conventions
Good explanations of security impact

6. No overlap between bug and security reports: 4.5/5

Excellent separation between bug and security reports
Bug report focuses on runtime and implementation issues
Security report focuses on security vulnerabilities
Minimal overlap between sections
Different perspectives on the eval() issue in bug vs. security sections

Conclusion

LLM capabilities are advancing quickly, making it essential to assess their suitability for your specific use cases. As software development grows more complex, leveraging AI-driven solutions from CloudaEye will be crucial for ensuring the delivery of high-quality, secure, and efficient code while improving developer efficiency.

About the Author: Hardik Prabue works as a Machine Learning Researcher at CloudAEye.

About CloudAEye

CloudAEye offers two SaaS services, Test Failure Analysis in CI and Code Review, that can save developers up to 14 hours per week.

Speed and quality are crucial in software development. Manual test failure analysis is time-consuming and error-prone, delaying issue resolution. CloudAEye's automated test failure analysis within CI pipelines revolutionizes software testing and debugging with our AI-augmented approach to accelerate root cause analysis (RCA). The GenAI-based solution swiftly identifies the underlying software issues behind test failures by transforming intricate error logs and code analysis into succinct RCA summaries.

Code reviews are vital for quality assurance before deployment but often take over a week. CloudAEye tackles these challenges by ensuring AI code security and reliability, detecting vulnerabilities, and providing actionable fixes. The solution acts as an essential guardrail for your AI projects, enabling rapid and confident progress.

Enjoy complimentary access at www.CloudAEye.com.

Introduction

Cost Considerations

Experiment: Writing Buggy Code on Purpose

Planted Bugs and security issues

Component 1: DataFrame Processor

Component 2: Numerical Transformer

PR Description Comparison

Analysis of PR Descriptions

Scoring Criteria Explanation

Comparative Analysis

Bug and Security Report Comparison

Analysis of Code Review

Bot 1 (Claude 3.7 Sonnet) Review Assessment

1. Coverage: 4.5/5

2. Accuracy (no false positives): 4.5/5

3. Technical details: 5/5

4. Bug alert quality: 5/5

5. Security alert quality: 5/5

6. No overlap between bug and security reports: 3.5/5

Bot 2 (GPT-4O) Review Assessment

1. Coverage: 3.5/5

2. Accuracy (no false positives): 5/5

3. Technical details: 4/5

4. Bug alert quality: 4/5

Recommended by LinkedIn

5. Security alert quality: 4.5/5

6. No overlap between bug and security reports: 4/5

Bot 3 (DeepSeek Llama Distill 70B) Review Assessment

1. Coverage: 3/5

2. Accuracy (no false positives): 5/5

3. Technical details: 3.5/5

4. Bug alert quality: 3/5

5. Security alert quality: 4/5

6. No overlap between bug and security reports: 4.5/5

Bot 4 (DeepSeek R1 Qwen 32B) Review Assessment

1. Coverage: 3/5

2. Accuracy (no false positives): 5/5

3. Technical details: 3.5/5

4. Bug alert quality: 3/5

5. Security alert quality: 4/5

6. No overlap between bug and security reports: 4.5/5

Bot 5 (Qwen Coder 2.5) Review Assessment

1. Coverage: 2/5

2. Accuracy (no false positives): 2/5

3. Technical details: 2.5/5

4. Bug alert quality: 2.5/5

5. Security alert quality: 3.5/5

6. No overlap between bug and security reports: 3/5

Bot 6 (Llama 3.1 70B) Review Assessment

1. Coverage: 3.5/5

2. Accuracy (no false positives): 4.5/5

3. Technical details: 4/5

4. Bug alert quality: 4.5/5

5. Security alert quality: 4/5

6. No overlap between bug and security reports: 4.5/5

Conclusion

More articles by CloudAEye

From Panic to Pull Request: How CloudAEye Reimagines Bug Fixing for Engineering Teams

Diving Deep into Agentic AI Frameworks: A Practical Guide

Streamlining Code Reviews with CloudaEye: A Developer’s Experience

Boosting productivity with AI tools

Two major events that will reshape the AI world

Taking Advantage of LLMs for Your Software Development Project

2025: "The Year of AI Agents"

Fix Jenkins Job Failures Faster with CloudAEye’s Test RCA

Top 10 AI trends for 2024

Others also viewed

AI Decides, Scripts Execute: A Two-Layer Approach to Writing Claude Skills

Mastering the Ingestion Phase of Retriever Augmented Generation (RAG)

Prompt Engineering for the Procrastinator: How to Get AI to Do All the Work

LangChain Ecosystem - overview

Why AI Code Generation Doesn't Work (But it Could)

From APIs to MCP: A New Era of AI Integration

LangChain Models

LangGraph: Basics and Advanced

Best Practices for Function Calling in Semantic Kernel

Reflections on AI from 2022 to 2026

Similar topics

Improving LLM Coding Accuracy with Code Intelligence

How Llms Process Language