Technical Report on LLM Performance for Code Review

Technical Report on LLM Performance for Code Review

Introduction

This technical report evaluates the effectiveness of different Large Language Models (LLMs) for automated code review in Git pull requests. At CloudAEye, we provide AI-powered code review agents that integrate directly with GitHub, helping development teams identify bugs, security vulnerabilities, and potential improvements before code is merged.

Our AI agents analyze pull requests in real time, providing detailed feedback on code quality, and potential issues, and answering any queries related to code change. This approach helps development teams:

  • Catch bugs and security vulnerabilities earlier in the development cycle
  • Maintain consistent code quality standards across teams
  • Reduce the time spent on manual code reviews
  • Accelerate development velocity while improving code reliability

In this report, we compare several leading LLMs to determine their effectiveness for automated code review tasks. We focus on their ability to identify intentionally planted bugs, provide accurate bug reports, PR descriptions, and identify security vulnerabilities.

Cost Considerations

Cost is a critical factor in agentic workflows, where multiple LLM calls are typically made during the analysis of a single pull request. The table below compares the input and output costs per million tokens for various LLM providers and models.

Article content
Table 1: Cost Comparison for Large Language Models (Input/Output per million tokens)

Experiment: Writing Buggy Code on Purpose

We created code with intentional bugs to test our code review bot with different LLM configurations. The goal was to see if it could catch different types of problems while avoiding false alarms.

You can see the full pull request here: github.com/CloudAEye/sklearn-pandas/pull/8

Planted Bugs and security issues

We made changes to two files in a Python data science library:

Component 1: DataFrame Processor

Here are the key vulnerabilities we added:

Article content
This allows attackers to run any code they want on your server!
Article content
If your program crashes while saving, you could lose data or corrupt files.
Article content
Base64 encoding looks secure but isn’t, plus the import is missing!

Component 2: Numerical Transformer

More bugs in the math code:

Article content
Division by zero error
Article content
Multiple errors in one function

PR Description Comparison

This section compares how different LLM-based code review bots describe the same pull request.

Article content
Figure 1: PR Description using Anthropic Claude 3.7 Sonnet
Article content
Figure 2: PR Description using OpenAI GPT-4o
Article content
Figure 3: PR Description using Deepseek R1 Distill Llama 70B
Article content
Figure 4: PR Description using Deepseek R1 Distill Qwen 32B
Article content
Figure 5: PR Description using Qwen 2.5 Coder
Article content
Figure 6: PR Description using Llama 3.3 70B

Analysis of PR Descriptions

Article content
Table 2: Scoring of PR descriptions generated by different LLM-based code review bots

Scoring Criteria Explanation

  • Completeness: Coverage of all significant changes in the PR (both files and all classes/- functions)
  • Technical accuracy: Correctness of technical details and implementation description
  • Clarity: How easily understandable the description is for developers
  • Conciseness: Organization of the information in a short but logically sound format
  • Focus on key changes: Emphasis on the most important aspects of the changes
  • Additional recommendations: Quality and relevance of suggestions for improvements

Comparative Analysis

  • Claude 3.7 Sonnet: Highly detailed and technically accurate. However, selective in coverage, focusing extensively on LogTransform while only briefly mentioning newly added files. Very detailed but verbose. Additional recommendations were the most helpful among all models.

  • GPT-4o: Provided complete coverage with excellent clarity. Technical details were slightly vague, and additional recommendations were minimal. Well-organized and concise.
  • DeepSeek Llama 70B (Groq): Performed poorly across most criteria. Failed to cover significant portions of the changes, had low technical accuracy, and lacked clarity. The weakest performer overall.
  • DeepSeek Qwen 32B (Groq): Concise but missed critical components, particularly the log transformation functionality. Low technical accuracy with almost no valuable recommendations.
  • Qwen 2.5 Coder (Groq): Excelled in clarity and conciseness. Good coverage of changes with an effective focus on key modifications. Additional recommendations could be improved but performed strongly overall, tied with Claude for second place.
  • Llama 3.3 70B (Groq): The top performer with balanced strengths across all criteria. Very concise while highlighting important changes effectively. Could have been more verbose in some areas, but otherwise nearly perfect.

Bug and Security Report Comparison

The CloudAeye bot identifies the code’s potential bugs and security risks when asked for a review, as shown in Figure 7,8.

Analysis of Code Review

Article content
Table 3: Scoring of Bug Reports Generated by Different LLM-based Code Review Bots

Bot 1 (Claude 3.7 Sonnet) Review Assessment

1. Coverage: 4.5/5

  • The bot thoroughly examined both files (dataframe_processor.py and transformer.py)
  • Identified a wide range of issues across different severity levels
  • Covered security vulnerabilities, syntax errors, potential runtime issues, and design flaws
  • Detailed both prominent issues (eval, pickle) and more subtle problems (division by zero, indentation)
  • Only slight deduction because it didn’t explicitly mention the risks of updating dict directly from untrusted input in load_from_export

Article content
Figure 7: A Bug alert example


Article content
Figure 8: A Security Alert example

2. Accuracy (no false positives): 4.5/5

  • Most identified issues are legitimate concerns
  • Correctly flagged critical security issues with eval() and pickle
  • Correctly identified indentation errors in transformers.py
  • Minor deduction for the “Missing numpy import” issue - the bot notes that numpy is likely already imported, so this seems like a hedge rather than a clear false positive

3. Technical details: 5/5

  • Provided excellent depth in explanations
  • Included code snippets for each issue
  • Gave clear explanations of why each issue is problematic
  • Described potential consequences of each bug
  • Suggested remediation approaches
  • References to specific lines and methods were precise

4. Bug alert quality: 5/5

  • Structured format with tags, paths, line numbers, and priorities
  • A clear distinction between different types of issues
  • Appropriate prioritization of issues (critical security flaws rated “High”)
  • Detailed explanations for each bug
  • Good categorization of similar but distinct issues (e.g., separate entries for different problems with log_scaled)

5. Security alert quality: 5/5

  • Clear identification of major security vulnerabilities
  • Appropriate risk scores with justifications
  • Used standard vulnerability naming conventions
  • Provided detailed explanations of attack vectors
  • Distinguished between different types of security issues (injection, deserialization, data exposure)

6. No overlap between bug and security reports: 3.5/5

  • Some significant overlap between bug and security reports
  • The eval() issue appears in both reports
  • Pickle serialization/deserialization appears in both reports
  • The base64 encoding issue appears in both reports

Bot 2 (GPT-4O) Review Assessment

1. Coverage: 3.5/5

  • Covered key issues in both files (dataframe_processor.py and transformers.py)
  • Identified critical security vulnerabilities (eval, pickle)
  • Missed several important issues that Bot 1 caught:
  • -- Missing base64 import
  • -- Exposure of entire internal state via dict
  • -- No error handling in load_from_export
  • -- No validation of export_dir existence
  • -- Potential negative values in logarithm input
  • Generally focused on fewer issues but with decent depth

2. Accuracy (no false positives): 5/5

  • All identified issues are legitimate concerns
  • No false positives were detected in the report
  • Appropriate classification of issues by severity
  • Correctly distinguished between functional bugs and security vulnerabilities

3. Technical details: 4/5

  • Provided good explanations for each issue
  • Included code snippets for context
  • Clear descriptions of potential consequences
  • Detailed security risk scores with justifications
  • Slightly less comprehensive explanations than Bot 1, particularly for security issues

4. Bug alert quality: 4/5

  • Well-structured format with tags and priorities
  • Clear descriptions of each issue
  • Good separation of distinct problems
  • Appropriate prioritization (eval and division by zero as “High”)
  • Missing line numbers in most bug reports, which reduces traceability

5. Security alert quality: 4.5/5

  • Clear identification of major security vulnerabilities
  • Provided appropriate risk scores with justifications
  • Used standard vulnerability naming conventions
  • Good distinction between different types of security issues
  • Included detailed vulnerability descriptions
  • Slightly less comprehensive than Claude in describing attack vectors

6. No overlap between bug and security reports: 4/5

  • Limited overlap between bug and security reports
  • The eval() issue appears in both reports (primary overlap)
  • The pickle vulnerability appears only in the security report, while division by zero appears in both (but categorized differently)
  • Better separation of concerns than Claude, with distinct issues in each section

Bot 3 (DeepSeek Llama Distill 70B) Review Assessment

1. Coverage: 3/5

  • Covered some key issues in both files but missed several important problems
  • Identified critical security vulnerabilities (eval, pickle, base64 encoding)
  • Missed important issues that the other bots caught:
  • -- Indentation errors in transformers.py methods
  • -- No error handling in load_from_export
  • -- No validation of export_dir existence
  • -- Complete serialization of internal state via dict
  • -- Adding functionality to a deprecated class
  • Much less comprehensive than both previous bots
  • Bug report contains fewer entries than the other bots’ reports

2. Accuracy (no false positives): 5/5

  • All identified issues are legitimate concerns
  • No false positives detected
  • Appropriate classification of issues by severity
  • Correctly distinguished between bugs and security vulnerabilities

3. Technical details: 3.5/5

  • Provided decent explanations for security issues
  • Security issues have better detail than bug issues
  • Bug report lacks depth in explaining consequences and suggested fixes
  • Most bug descriptions are quite brief compared to the other bots
  • Missing details on how issues could affect users/systems

4. Bug alert quality: 3/5

  • Basic structure with tags, priorities, and locations
  • Very concise descriptions of each issue (sometimes too brief)
  • Appropriate prioritization of issues
  • Some issues lack sufficient explanation (e.g., “Potential incomplete state loading”)
  • Redundancy with two separate entries for scaling issues in transformers.py

5. Security alert quality: 4/5

  • Clear identification of major security vulnerabilities
  • Good risk scores with justifications
  • Standard vulnerability naming conventions used
  • Included longer code snippets for context
  • Reasonable explanations of the security impact
  • Not as detailed as Bot 1’s security analysis

6. No overlap between bug and security reports: 4.5/5

  • Very minimal overlap between bug and security reports
  • The bug report focuses mostly on implementation issues
  • The security report focuses on security vulnerabilities
  • Great separation of concerns (best of the three bots in this category)
  • Bug report mentions missing base64 import while the security report covers the incorrect use of base64 for security

Bot 4 (DeepSeek R1 Qwen 32B) Review Assessment

1. Coverage: 3/5

  • Covered some key issues in both files but missed several important problems
  • Identified critical security vulnerabilities (eval, pickle, base64 encoding)
  • Missed important issues that the other bots caught:
  • -- Indentation errors in transformers.py methods
  • -- No error handling in load_from_export
  • -- No validation of export_dir existence
  • -- Complete serialization of internal state via dict
  • -- Adding functionality to a deprecated class
  • Much less comprehensive than both previous bots
  • Bug report contains fewer entries than the other bots’ reports

2. Accuracy (no false positives): 5/5

  • All identified issues are legitimate concerns
  • No false positives detected
  • Appropriate classification of issues by severity
  • Correctly distinguished between bugs and security vulnerabilities

3. Technical details: 3.5/5

  • Provided decent explanations for security issues
  • Security issues have better detail than bug issues
  • Bug report lacks depth in explaining consequences and suggested fixes
  • Most bug descriptions are quite brief compared to the other bots
  • Missing details on how issues could affect users/systems

4. Bug alert quality: 3/5

  • Basic structure with tags, priorities, and locations
  • Very concise descriptions of each issue (sometimes too brief)
  • Appropriate prioritization of issues
  • Some issues lack sufficient explanation (e.g., “Potential incomplete state loading”)
  • Redundancy with two separate entries for scaling issues in transformers.py

5. Security alert quality: 4/5

  • Clear identification of major security vulnerabilities
  • Good risk scores with justifications
  • Standard vulnerability naming conventions used
  • Included longer code snippets for context
  • Reasonable explanations of the security impact
  • Not as detailed as Bot 1’s security analysis

6. No overlap between bug and security reports: 4.5/5

  • Very minimal overlap between bug and security reports
  • The bug report focuses mostly on implementation issues
  • The security report focuses on security vulnerabilities
  • Great separation of concerns (best of the three bots in this category)
  • Bug report mentions missing base64 import while the security report covers the incorrect use of base64 for security

Bot 5 (Qwen Coder 2.5) Review Assessment

1. Coverage: 2/5

  • The bug report has a critical flaw: it duplicates all dataframe_processor.py issues in
  • This is a serious error that significantly reduces the credibility of the review
  • Identified some key issues (eval, pickle, base64 encoding)
  • Missing critical issues in transformers.py like indentation errors and division by zero
  • No meaningful analysis of the actual transformers.py file issues

2. Accuracy (no false positives): 2/5

  • Major accuracy issue: incorrectly attributed dataframe_processor.py bugs to trans-
  • Alltransformers.pybugsarefalsepositives(they’recopy-pastesfromdataframe_processor.py)
  • Some bugs appear reasonable for dataframe_processor.py but are completely irrele- vant for transformers.py
  • Few genuine issues were identified correctly (eval, pickle, base64)

3. Technical details: 2.5/5

  • The security report has decent technical details
  • The bug report is very brief with minimal explanations
  • Lacks depth in explaining consequences or providing remediation suggestions
  • The duplication severely undermines any technical value in the report
  • Missing important technical details about division by zero and indentation issues

4. Bug alert quality: 2.5/5

  • Major quality issue with the duplicated bugs across files
  • Brief descriptions with minimal context
  • Some bugs are noted without adequate explanation
  • The duplication shows poor quality control

5. Security alert quality: 3.5/5

  • Security report is better than the bug report
  • Identifies major security issues (eval, pickle, base64)
  • Provides reasonable risk scores with justifications
  • Good explanation of the security impact
  • The structure is clear with vulnerability names and details

6. No overlap between bug and security reports: 3/5

  • Some separation between bug and security reports
  • Eval() and pickle vulnerabilities appear in both sections
  • Security report focuses more on the attack vector
  • Bug report is more focused on implementation issues
  • The massive duplication issue overshadows any positive aspects here

Bot 6 (Llama 3.1 70B) Review Assessment

1. Coverage: 3.5/5

  • Covered key issues in both files (dataframe_processor.py and transformers.py)
  • Identified critical security vulnerabilities (eval, pickle)
  • Took a different approach by focusing on potential runtime errors and edge cases
  • Identified division by zero risk in log_scaled transformation
  • Made good connections between changes in transformers.py and potential impacts in dataframe_processor.py
  • Missed some important issues like indentation errors in transformers.py methods

2. Accuracy (no false positives): 4.5/5

  • Most identified issues are legitimate concerns
  • No clear false positives detected
  • Focused on potential runtime errors that are plausible
  • Madereasonableconnectionsbetweenthetransformers.pychangesanddataframe_processor.py
  • The eval() risk is somewhat understated as a “potential TypeError” rather than a security vulnerability

3. Technical details: 4/5

  • Provided good explanations for most issues
  • Included “additional info” for several bugs, adding useful context
  • Distinguished between different severity levels appropriately
  • Made connections between components that other bots missed
  • The security report has good technical details, though less comprehensive than Bot 1

4. Bug alert quality: 4.5/5

  • Well-structured format with clear tags and priorities
  • Good separation of distinct issues
  • Appropriate prioritization of issues
  • Included relevant line numbers
  • Added supplementary context in “additional info” fields for many bugs
  • Took a unique perspective on potential runtime errors

5. Security alert quality: 4/5

  • Identified major security vulnerabilities (eval, pickle)
  • Provided reasonable risk scores with justifications
  • Used standard vulnerability naming conventions
  • Good explanations of security impact

6. No overlap between bug and security reports: 4.5/5

  • Excellent separation between bug and security reports
  • Bug report focuses on runtime and implementation issues
  • Security report focuses on security vulnerabilities
  • Minimal overlap between sections
  • Different perspectives on the eval() issue in bug vs. security sections

Conclusion

LLM capabilities are advancing quickly, making it essential to assess their suitability for your specific use cases. As software development grows more complex, leveraging AI-driven solutions from CloudaEye will be crucial for ensuring the delivery of high-quality, secure, and efficient code while improving developer efficiency.


About the Author: Hardik Prabue works as a Machine Learning Researcher at CloudAEye.


About CloudAEye

CloudAEye offers two SaaS services, Test Failure Analysis in CI and Code Review, that can save developers up to 14 hours per week.

Speed and quality are crucial in software development. Manual test failure analysis is time-consuming and error-prone, delaying issue resolution. CloudAEye's automated test failure analysis within CI pipelines revolutionizes software testing and debugging with our AI-augmented approach to accelerate root cause analysis (RCA). The GenAI-based solution swiftly identifies the underlying software issues behind test failures by transforming intricate error logs and code analysis into succinct RCA summaries.

Code reviews are vital for quality assurance before deployment but often take over a week. CloudAEye tackles these challenges by ensuring AI code security and reliability, detecting vulnerabilities, and providing actionable fixes. The solution acts as an essential guardrail for your AI projects, enabling rapid and confident progress.

Enjoy complimentary access at www.CloudAEye.com.

To view or add a comment, sign in

More articles by CloudAEye

Others also viewed

Explore content categories