This technical report evaluates the effectiveness of different Large Language Models (LLMs) for automated code review in Git pull requests. At CloudAEye, we provide AI-powered code review agents that integrate directly with GitHub, helping development teams identify bugs, security vulnerabilities, and potential improvements before code is merged.
Our AI agents analyze pull requests in real time, providing detailed feedback on code quality, and potential issues, and answering any queries related to code change. This approach helps development teams:
Catch bugs and security vulnerabilities earlier in the development cycle
Maintain consistent code quality standards across teams
Reduce the time spent on manual code reviews
Accelerate development velocity while improving code reliability
In this report, we compare several leading LLMs to determine their effectiveness for automated code review tasks. We focus on their ability to identify intentionally planted bugs, provide accurate bug reports, PR descriptions, and identify security vulnerabilities.
Cost Considerations
Cost is a critical factor in agentic workflows, where multiple LLM calls are typically made during the analysis of a single pull request. The table below compares the input and output costs per million tokens for various LLM providers and models.
Table 1: Cost Comparison for Large Language Models (Input/Output per million tokens)
Experiment: Writing Buggy Code on Purpose
We created code with intentional bugs to test our code review bot with different LLM configurations. The goal was to see if it could catch different types of problems while avoiding false alarms.
We made changes to two files in a Python data science library:
Component 1: DataFrame Processor
Here are the key vulnerabilities we added:
This allows attackers to run any code they want on your server!
If your program crashes while saving, you could lose data or corrupt files.
Base64 encoding looks secure but isn’t, plus the import is missing!
Component 2: Numerical Transformer
More bugs in the math code:
Division by zero error
Multiple errors in one function
PR Description Comparison
This section compares how different LLM-based code review bots describe the same pull request.
Figure 1: PR Description using Anthropic Claude 3.7 Sonnet
Figure 2: PR Description using OpenAI GPT-4o
Figure 3: PR Description using Deepseek R1 Distill Llama 70B
Figure 4: PR Description using Deepseek R1 Distill Qwen 32B
Figure 5: PR Description using Qwen 2.5 Coder
Figure 6: PR Description using Llama 3.3 70B
Analysis of PR Descriptions
Table 2: Scoring of PR descriptions generated by different LLM-based code review bots
Scoring Criteria Explanation
Completeness: Coverage of all significant changes in the PR (both files and all classes/- functions)
Technical accuracy: Correctness of technical details and implementation description
Clarity: How easily understandable the description is for developers
Conciseness: Organization of the information in a short but logically sound format
Focus on key changes: Emphasis on the most important aspects of the changes
Additional recommendations: Quality and relevance of suggestions for improvements
Comparative Analysis
Claude 3.7 Sonnet: Highly detailed and technically accurate. However, selective in coverage, focusing extensively on LogTransform while only briefly mentioning newly added files. Very detailed but verbose. Additional recommendations were the most helpful among all models.
GPT-4o: Provided complete coverage with excellent clarity. Technical details were slightly vague, and additional recommendations were minimal. Well-organized and concise.
DeepSeek Llama 70B (Groq): Performed poorly across most criteria. Failed to cover significant portions of the changes, had low technical accuracy, and lacked clarity. The weakest performer overall.
DeepSeek Qwen 32B (Groq): Concise but missed critical components, particularly the log transformation functionality. Low technical accuracy with almost no valuable recommendations.
Qwen 2.5 Coder (Groq): Excelled in clarity and conciseness. Good coverage of changes with an effective focus on key modifications. Additional recommendations could be improved but performed strongly overall, tied with Claude for second place.
Llama 3.3 70B (Groq): The top performer with balanced strengths across all criteria. Very concise while highlighting important changes effectively. Could have been more verbose in some areas, but otherwise nearly perfect.
Bug and Security Report Comparison
The CloudAeye bot identifies the code’s potential bugs and security risks when asked for a review, as shown in Figure 7,8.
Analysis of Code Review
Table 3: Scoring of Bug Reports Generated by Different LLM-based Code Review Bots
Bot 1 (Claude 3.7 Sonnet) Review Assessment
1. Coverage: 4.5/5
The bot thoroughly examined both files (dataframe_processor.py and transformer.py)
Identified a wide range of issues across different severity levels
Detailed both prominent issues (eval, pickle) and more subtle problems (division by zero, indentation)
Only slight deduction because it didn’t explicitly mention the risks of updating dict directly from untrusted input in load_from_export
Figure 7: A Bug alert example
Figure 8: A Security Alert example
2. Accuracy (no false positives): 4.5/5
Most identified issues are legitimate concerns
Correctly flagged critical security issues with eval() and pickle
Correctly identified indentation errors in transformers.py
Minor deduction for the “Missing numpy import” issue - the bot notes that numpy is likely already imported, so this seems like a hedge rather than a clear false positive
3. Technical details: 5/5
Provided excellent depth in explanations
Included code snippets for each issue
Gave clear explanations of why each issue is problematic
Described potential consequences of each bug
Suggested remediation approaches
References to specific lines and methods were precise
4. Bug alert quality: 5/5
Structured format with tags, paths, line numbers, and priorities
A clear distinction between different types of issues
Appropriate prioritization of issues (critical security flaws rated “High”)
Detailed explanations for each bug
Good categorization of similar but distinct issues (e.g., separate entries for different problems with log_scaled)
5. Security alert quality: 5/5
Clear identification of major security vulnerabilities
Appropriate risk scores with justifications
Used standard vulnerability naming conventions
Provided detailed explanations of attack vectors
Distinguished between different types of security issues (injection, deserialization, data exposure)
6. No overlap between bug and security reports: 3.5/5
Some significant overlap between bug and security reports
The eval() issue appears in both reports
Pickle serialization/deserialization appears in both reports
The base64 encoding issue appears in both reports
Bot 2 (GPT-4O) Review Assessment
1. Coverage: 3.5/5
Covered key issues in both files (dataframe_processor.py and transformers.py)
The eval() risk is somewhat understated as a “potential TypeError” rather than a security vulnerability
3. Technical details: 4/5
Provided good explanations for most issues
Included “additional info” for several bugs, adding useful context
Distinguished between different severity levels appropriately
Made connections between components that other bots missed
The security report has good technical details, though less comprehensive than Bot 1
4. Bug alert quality: 4.5/5
Well-structured format with clear tags and priorities
Good separation of distinct issues
Appropriate prioritization of issues
Included relevant line numbers
Added supplementary context in “additional info” fields for many bugs
Took a unique perspective on potential runtime errors
5. Security alert quality: 4/5
Identified major security vulnerabilities (eval, pickle)
Provided reasonable risk scores with justifications
Used standard vulnerability naming conventions
Good explanations of security impact
6. No overlap between bug and security reports: 4.5/5
Excellent separation between bug and security reports
Bug report focuses on runtime and implementation issues
Security report focuses on security vulnerabilities
Minimal overlap between sections
Different perspectives on the eval() issue in bug vs. security sections
Conclusion
LLM capabilities are advancing quickly, making it essential to assess their suitability for your specific use cases. As software development grows more complex, leveraging AI-driven solutions from CloudaEye will be crucial for ensuring the delivery of high-quality, secure, and efficient code while improving developer efficiency.
About the Author: Hardik Prabue works as a Machine Learning Researcher at CloudAEye.
Speed and quality are crucial in software development. Manual test failure analysis is time-consuming and error-prone, delaying issue resolution. CloudAEye's automated test failure analysis within CI pipelines revolutionizes software testing and debugging with our AI-augmented approach to accelerate root cause analysis (RCA). The GenAI-based solution swiftly identifies the underlying software issues behind test failures by transforming intricate error logs and code analysis into succinct RCA summaries.
Code reviews are vital for quality assurance before deployment but often take over a week. CloudAEye tackles these challenges by ensuring AI code security and reliability, detecting vulnerabilities, and providing actionable fixes. The solution acts as an essential guardrail for your AI projects, enabling rapid and confident progress.