Automating Pull Requests Using Generative AI
Introduction
First off, happy holidays / new years to all folks that celebrate! I had a reasonably restful 2 weeks off full of visiting family and eating way too much chocolate that was gifted to me.
Recently (including over the holidays), I’ve had family and friends ask about the impact of GenAI on my role or roles within IT more broadly; one such question was whether programmers could be replaced by it outright. My opinion (which is far from unique) is that generative AI will be a tool used by programmers to accelerate certain coding and research activities, rather than replace the programmers themselves. Interestingly, soon after answering that question, I had seen multiple posts over the holidays on platforms such as Reddit where folks were asking that very same question, or worse: articles prophesying that AI will replace programming roles.
With such a large focus on this question / topic, I decided to explore it further and build a PoC that attempts to replace programming tasks in a simple GitHub project.
The workflow that I envisioned for this PoC is as follows:
The Example Project
The main application within the repo is the age old “hello world”:
From main.py:
def say_hello():
return "hello world"
if __name__ == "__main__":
print(say_hello)
I’ve also added a unit test for the sake of completeness, which is run automatically via GitHub Actions:
From tests/test_main.py
import unittest
from main import say_hello
class TestMain(unittest.TestCase):
def testSayHello(self):
expected = "hello world"
actual = say_hello()
self.assertEqual(expected, actual)
Based on the above, if I were to request a change to the say_hello method in a GitHub issue, the application connected to the webhook would need to make changes to the function itself, as well as the unit test so that the GitHub Action check doesn’t fail.
Let’s evaluate a practical example of how the system works.
Example / Walkthrough
The Start: Creating an Issue in GitHub
Suppose I wanted to update the say_hello method to say “hello LinkedIn” in place of “hello world”; instead of cloning the repository myself, creating a new branch making the change, updating the unit test, pushing the code, and creating the PR, all I need to do is create an issue about it:
Once I hit “Submit new issue”, the Cloud Function immediately picks it up and runs with it. The Cloud Function leverages an “Autocoder” class I’ve written, which has methods that leverage pygit2 for cloning and acting on the repository, and Vertex AI for the generative AI calls.
There are a couple of cool things that happen in this codebase (if I don’t say so myself):
The prompt and surrounding code to pull this off looks like this (from autocoder.py):
if not commit_message:
commit_msg_prompt = f"Please provide a commit message outlining the change between the old and new code. Provide just the commit message -- no need to title the message as 'commit message' or anything. Old code:\n{existing_code}\n\nNew code:\n{replacement_code}"
if contributing:
commit_msg_prompt += f"\n\nTake any commit structure instructions/examples into account from the following:\n{contributing}"
response = self._llm.predict(
commit_msg_prompt
)
commit_message = response.text.strip()
As you can see, some prompt engineering was required here to provide a consistent result (it would often try to format the output), but overall, I was quite pleased with the result.
2. The webhook application automatically updates the unit tests so that when the PR is created automatically, the workflow will run the updated unit test. This is cool for a couple of reasons: The project still successfully builds (yay!), and the human reviewer of the PR has more confidence in the work put forward by the LLM.
The Middle: The Application Pushes a new Branch
Once the application has cloned the repository and leveraged generative AI to make a couple of changes in a new branch, it pushes that branch back up to the remote.
Again, the application uses generative AI to pick up on naming conventions. In my CONTRIBUTING.md, I did not specify a naming convention, so it was left to its own devices to determine a branch name befitting of the changes it performed. From autocoder.py:
if not branch_name:
prompt = f"Create an appropriate git feature branch name based on the requested code change -- do NOT format with markdown:\n{desired_change}"
if contributing:
prompt += f"\n\nFollow any branch naming convention within this guide, if any are stipulated:\n{contributing}"
resp = self._llm.predict(prompt)
branch_name = resp.text.strip()
self._branch = self._local_repo.branches.local.create(branch_name, commit)
Again, the LLM tried to style its output here, but I thankfully won it over easily with some phrasing in the prompt that ruled against it.
If we click into the branch, we can see that the webhook application pushed up a single commit that not only made the desired adjustment, but it also updated the unit test:
The End: The Application Creates a Pull Request
Once the branch has been created and pushed, my webhook application makes a separate call to GitHub’s API to open a PR and relate it to the original issue.
The act of creating this PR does a couple of things:
Conclusion
Overall, this was a fun PoC to build out and an interesting problem (both technical and philosophical) to explore.
Has my success with building this PoC swayed my opinion on generative AI’s risk (or lack thereof) to developers? In a word: no.
There’s a good reason I chose a simple Hello World application to test this with – it’s self–contained within a single function and a single module. When contributing to a project as a software developer, you need to be aware of all classes, methods, SDKs, and configurations available to you to arrive at an optimal solution. It’s often the case that you will not get all of that right when you first start contributing to a project, and there will be some back and forth on the PR before it’s mergeable.
For an LLM to replicate this type of behaviour, the entire project would need to be available in the prompt’s context (either stuffed into the prompt wholly or chunked up using other means). The LLM would then need to determine which mechanisms are already in place within the project and use them where possible (even if they exist in a currently sub-optimal way). Along the same lines, if your organization develops its application in terms of microservices, an LLM would need to be aware of how it slots its work into that architecture and how it can leverage other microservices. To further complicate matters, many projects include dependencies that are outside of their project (for this example, I used pygit2, Vertex AI, GitHub, and the Functions Framework). An LLM would need to evaluate my requirements.txt and understand which objects are available for the version I’ve pinned and which of those functions might be marked for deprecation, are internal, etc.
In the same vein, there are external services to consider. Perhaps your application needs to interact with a new data source (either an external REST service or maybe an internal Google Drive), or maybe your organization stores its API credentials in Hashicorp Vault. The LLM would need to be aware of how to interact with those services and the nuances that might come into play (rate limits, API key rotations, etc). An LLM can parrot back many of the factors that need to be accounted for, but to implement them in an integrated way that fits into the bigger picture of the project and its ecosystem is another matter entirely.
While LLMs are always improving with each version and foundational model release, the above bigger-picture considerations are complicated for seasoned software developers and the many nuances of those considerations currently put them out of reach for an LLM.
With the "bad" out of the way, it's worth re-stating my original position:
Generative AI will be a tool used by programmers to accelerate certain coding and research activities, rather than replace the programmers themselves
Right now, generative AI is an excellent tool to have in your toolkit as a software developer when used safely (i.e. using an enterprise-grade application of GenAI found on Vertex AI or its competitors to prevent exposure of your proprietary code, not blindly copying/pasting code with critical security implications, etc.). Through advanced prompt engineering, you can build and refine regions of your code, summarize regions of code, and synthesize test data or massage data into other formats, giving you back some time to tackle the system design and integration elements that LLMs need help with.
This is pretty neat, especially with regenerating the unit tests afterwards; although I'd be curious as to the consistency and correctness of the output for anything more complex. Thinking if there would be some way to tie the llm into cicd to create some kind of debugging/validation process loop. 🤔