My Claude Code Agent Teams Experiment

My Claude Code Agent Teams Experiment

Developing working software with my agent team…

OK, I have a technical programming background. My career swung over the management side of things relatively early in my career and software engineering folks generally enjoyed working with me (apologies to the few exceptions that might raise their hand!!) because I understood coding firsthand.

Not to bore you, but over the course of my career I have used Basic, Cobol, C, C++, Smalltalk, Objective C, Java, JavaScript, and Python. I have looked at R and Rust. However, I no longer write code and have no interest in reviewing it either. Haven’t for 15+ years unless I got an itch and wanted to learn something new – like when I dabbled in understanding Pytorch and Tensorflow, which made me dive into Python.

Why am I telling you this? To set the stage so you understand what I might know or not. I am not an expert programmer but I have a really good understanding of programming languages.

With all the crazy excitement about AI coding agents hitting the market over the last 18 months, I have been following the developments with interest and curiosity.

From my viewpoint, the easy parts are getting easier, and the hard parts are getting more accessible, because you can get into the hard parts without much prior knowledge (see below about my experience with AWS and Terraform).

In other words, you can more easily get into deep water!

We have seen a fundamental change over the last 5 years:

  • 2020: “AI can’t understand context”
  • 2022: “AI can’t write production code”
  • 2024: “AI can’t handle complex systems”
  • 2026: “AI can’t replace senior developers”
  • 2026+: “AI can’t replace Architects”
  • 2027: “???”

I have a guess where this is going, but instead of speculating I decided to run a real-world experiment to demonstrate the point.

Follow along to see what I did, blow by blow, and the conclusions I came to:

Start time: Wednesday, February 11th, 2026, approximately 8:00 PM

1.        I installed Claude Code “out of the box”, no customization, no skills configuration, no special setup of any kind. I am subscribed to the Claude Max plan.

2.        I tried to find a moderately complex application to write that was complex enough to satisfy critics but simple enough so I wouldn’t spend an entire week on it. Not a toy but not a full-blown system. I also wanted to create something “enterprise grade”, i.e. follow a SDLC and tech stack / tooling that is common in enterprise environments. I settled on the idea to create an Automated Teller Machine (ATM) simulator. This entailed an application with front-end, back-end, database, local Docker development, final Vercel or AWS deployment. And a management console.

3.        I prompted Claude (web) to help me define a CLAUDE.md file that laid out the desired functionality, rules of the road, tech stack, roles, etc. Here is what I asked:

I want you to help me create a markdown file that I can use to provide guidance to Claude Agent Teams. I want to create an application, written in Python, that simulates a bank ATM (Automated Teller Machine).

The application is supposed to represent real world functionality of an ATM, such as being able to make deposits, withdrawals, money transfers between accounts, account inquiries, printing accounts statements (to PDF files), etc.

I want to use Claude Agent Teams to have agents work on this application development problem themselves. Roles that I can imagine agents need to fill are UX designer, software architect, software engineer, software engineer in test, technical writer, cloud specialist / SRE, etc. I am open to hear about additional agent suggestions.

I want to start small with the ATM application running on a local Docker image. Once it is fully functional, I want to host the application on a preferred hosting service, such as Vercel or AWS.

How do I get started? Help!

4.        Claude and I bantered back and forth, for example, it suggested I did not need a technical writer as it would document things as we went along. It also suggested that Vercel was not a good hosting option. Compared to AWS, Vercel is very frontend apps oriented vs. AWS which is better with backends that use PostgreSQL, background jobs, and persistent states (ECS Fargate or App Runner). Check.

5.        I tweaked up the unit test coverage percentages from the suggest 80% to 100% on business logic and security code, 95%+ overall, with any exclusions explicitly documented and justified. Claude already had defined the initial 42 End-to-End test cases. Check.

6.        It asked me what license I want (MIT), what Python version (3.12), and provided me with a zip file that contained the GitHub project scaffolding, including the initial CLAUDE.md file. Check.

7.        I took the zip file and unzipped its contents to the GitHub repository. Ready to start.

8.        I asked Claude (web) how to show me how to setup Claude Code (terminal). It provided easy to follow instructions. Claude code was ready to go. Check.

9.        I started the agent team by entering the following prompt on the terminal command line (see the CLAUDE.md file):

I need you to act as Team Lead for building a Python ATM simulator application.

Read CLAUDE.md thoroughly — it contains the full project specification, team structure, and development phases.

Create an agent team with these teammates:

1. Architect — system design, data models, API contracts

2. Backend Engineer — core Python/FastAPI implementation 

3. UX Designer — Textual terminal UI

4. SDET — testing strategy and test implementation

5. Security Engineer — auth, encryption, audit, threat model

6. DevOps Engineer — Docker, CI/CD, deployment config

Start with Phase 1, Sprint 1. Use delegate mode — coordinate only, do not implement.

Require plan approval from all agents before they begin implementation.

Assign clear file ownership per CLAUDE.md to avoid conflicts.

10.  The agent team started chugging along, prompting me every couple of minutes to review commands, artifacts, or make decisions. In between I watched Netflix, cooked, wondered where this was going, went to bed around midnight when Claude had finished Sprint 2 (if I remember correctly).

11.  Back at it the next day around 8:00 AM, I mostly followed Claude’s guidance, sometimes I asked clarifying questions, I once asked it to make sure that we will choose the most cost-effective AWS hosting options. Initially I performed all the git push commands up to GitHub myself, but other than that, the Claude agent team coordinated amongst themselves, committed changes, resolved conflicts, implemented tests (unit and E2E), found bugs, and fixed bugs. After a while I let Claude Code push to GitHub itself, just asking me for permission.

12.  Claude progressively worked through the phases / sprints as defined in the CLAUDE.md file. Sometimes, with a new sprint gate finalizing the sprint deliverable, it would ask me if I wanted to rerun all the tests, which I confirmed. This frequently found additional bugs, which Claude then proceeded to fix.  

13.  The heaviest lift for me personally was configuring AWS, not because it’s difficult, but because it involved a lot of tedious setup work, access token, etc. Stuff I really hate doing! This was towards the end of process. Terraform and AWS configuration took the most time and iterations. Not sure if that was because I fat-fingered some things, though - so I am not blaming Claude for this.

14.  No HTTPS – I made a call to skip the https setup intentionally to save ~$16/month on an Application Load Balancer (ALB), which is the standard way to do HTTPS on ECS Fargate. AWS doesn't let you attach SSL certificates directly to Fargate tasks. So that’s on me, not a Claude Code shortcoming.

15.  After the backend was implemented (v1.0), I directed Claude to plan the add-on of a modern web front end (v2.0). I provided three images from the web that showed different ATM machines; at this time, I let Claude run and do the plan, while I took off for the day.

16.  Next morning, the plan was done, I reviewed and provided feedback, had an extended discussion about what kind of front-end technology to use (Jinja, React, HTMX, etc.), why, the drawbacks of one over the other, etc. This was educational for me, as I am not a front-end guy. We finally agreed on React + Framer Motion because I wanted a fancy user interface – after all, it’s in experiment, might as well go all in. Desktop and tablet support only. No Mobile support. Because of this, we added a Frontend/UX Engineer agent to the team. In retrospect, I am not sure we lived up to the fancy aspirations.

17.  After final plan discussions, I suggested some safety measures such as explicit reminder to compact after each sprint, so we don’t run out of context mid-development. After that, off we went to work on the front-end GUI. Did I just refer to us with the collective “we”?

18.  … working through the plan… one sprint at a time… develop… build… test… iterating… Claude occasionally asking questions, asking for permission to proceed, me reviewing, sometimes hands-on testing… giving feedback… asking for fixes… chugging along

19.  How about DevSecOps? Oops, I didn’t think about that upfront, LOL… asked Claude to incorporate free security scanners into the CI job (added Bandit npn audit, Triy, gitleaks, GitHub Dependabot runs). Found and fixed several security vulnerabilities.

20.  There were some AWS security issues that were found, but they were based on architectural decisions I made to save money (no NAT gateway, public subnet design for cost). The proper fixes (VPC endpoints, private subnets + NAT, CMK encryption) are beyond the current scope. If this app would go to production, we’d have to fix them:

  • AWS-0104 (CRITICAL): HTTPS egress to 0.0.0.0/0 is still flagged (needed for ECR/CloudWatch/Secrets Manager access)       
  • AWS-0031 (HIGH): ECR mutable tags                                                                                        
  • WS-0164 (HIGH): Public subnet IP association            
  • AWS-0180 (HIGH): RDS public access                                                                                       
  • AWS-0132 (HIGH): S3 not using customer managed key

21.  Boom!! Done! I personally did not write a single line of code. I only directed Claude Code. All code was produced by Claude, including all scripts, CI, etc. Everything in the repo was produced by Claude (find link in the Addendum below).

22.  I started testing the application manually, doing my personal User Acceptance Test, on my local Docker image. Didn’t like some of the screen layouts, colors, spacing… more feedback… iterating… fixing manual bugs, building, committing, retesting, pushing… iterating.

23.  Turned the repo public on GitHub, which enabled Dependabot, which flagged 8 PRs for us to process. Fixed / pushed all those. Check.

24.  We originally implemented the admin console using Jinja, but decided to move it to React + Framer Motion. Claude create a plan, I reviewed and approved, … … working through the plan… one sprint at a time… develop… build… test… iterating… Claude occasionally asking questions, asking for permission to proceed, me reviewing, sometimes hands-on testing… giving feedback… asking for fixes… chugging along.

So, there it is. 100% working sample of a real application, working on your local Docker environment or hosted on AWS. 100% coded by Claude Code. Still some things we could improve on, but functional at this point.

End time: Friday, February 13th, 2026, approximately 9:30 PM

Total work time: approximately 14 hours (I watched Netflix and did other things in-between). And I did a lot of reading while Claude was working.

On Saturday, I decided to give it another swing for two more hours, and we fixed a couple of more defects, made sure the local Docker images matched what was on AWS (when I looked, it wasn't), documented things, tore down the Terraform dev infrastructure on AWS to avoid cost.

How much did this cost?

I am on the Claude Max / $100 / month plan and during my three days working on this never hit the limit. So I guess about $10 - $20, $1 for AWS, plus my time.

Things to Note / Lessons Learned

  • Looking at some of the bugs I found by testing things manually, one lesson is clear: Don’t send you manual test folks home yet – or make sure your Product Owner understands that manual testing is on his / her plate.
  • Unless you are super clear what to do on a micro level, like “wire the front-end elements to back-end actions” or “make sure labels are visible”, many, many things can go wrong, or from Claude’s perspective, they hadn’t been sufficiently specified. What is implied for a human, might not be so clear for your team of agents. But that's not Claude's problem. You have to define what you want.
  • Working with a team of agents is weirdly similar to working with a team of remote engineers via Slack or Email. At times I couldn’t tell the difference
  • Although I have prior coding experience in Python, I had no hands-on experience with several of the other technologies, such as SQLAlchemy, Alembic, Terraform, React + Framer Motion. I know what they are and what they are for, but I never wrestled with the details myself. There are also some tech pieces I never heard of that were used, and I have no clue how they work. I learned about them and nodded my head, assuming Claude would guide me safely.
  • I never had implemented authentication / security functionality myself.
  • Claude Code is an exceptional coding / learning resource, offering broad and deep expertise that's hard to match – I never had to go outside of Claude Code to get any information I needed. And I learned more about AWS and Terraform than I ever wanted to know.
  • Sometimes we forget about this: Claude Code can also be used for non-coding tasks. As such, Claude Code Agent Teams can be used as a general agent coordinator – it doesn’t have to be coding related, it entirely depends what you define in you CLAUDE.md file.
  • There is significant overlap with Claude Cowork, which is a full Mac application vs. the Claude Code command line interface. Claude Cowork (in research preview / beta) is currently not as capable as Claude Code but worthwhile looking at. Depends on what you feel comfortable with (terminal command line vs. Mac application).

Question I have

  • Can you trust the code? You be the judge – find the GitHub repo link in the Addendum below.
  • Is the architecture sound? Based on what I can see, the architecture looks good to me, but I am not an enterprise architect responsible for ensuring architectural integrity across the entire enterprise portfolio. In an enterprise environment, I would get this architecture approved by the right folks. In an enterprise that uses Claude Code you will find architecture .md files that are already defined and ready to be used.
  • Is it secure? Well, calling on my security expert friends to tell me. I only know that I never personally implemented authentication / security functionality before, so what Claude produced looks good to me. Having said that, in a real-world production application, I would never release anything without having the security folks review and approve it. But, we used lots of open-source security scanners and we used them as part of our CI / GitHub process (see Addendum). Again, in an enterprise you would most likely find .md files that lay out minimally acceptable security standards.
  • Was it worth the $$$ spent? Compared to having a team of people working on this? From a cost perspective, there is no comparison. Not even close.
  • What about the lack of social interaction on a team? Do I miss working with people one-on-one? Yes. But nobody will care, considering the cost differential.
  • What was harder than working on a real team (the ones with humans)? You miss the social interaction, that’s a given. But you also miss the collective brain. With agents, you are on your own. Whatever you know, whatever you research, whatever you decide, your Agent Team will do. What is in your head is the boundary of what you can do and guide on. That’s good and bad. I have had many professional experiences where teammates were simply better at certain things than me - they were smarter, better educated, more experienced, had better intuition, or simply looked at problems from a different perspective - and that made the team better. With AI / Claude Code, you gained the technical knowledge of the Agent Team, but you lost your human teammates’ knowledge and wisdom, including their pain in the ass attitudes, common lunches, happy hours, and all that comes with human beings. You trade human connection and happy hours for complete control. It’s easy to get full of yourself. You basically become the project dictator. Hmmm, did I just say that?
  • What was easier than working on a real team? You don’t hurt anybody’s feelings. You don’t have to worry about team politics (“I can’t give this to Sally because Bob has been working on this forever”). You don’t have resource constraints, no more “I need one additional engineer, but we don’t have the budget until next fiscal year”. You don't need to spend 3 months onboarding. You don’t feel sad when the project is over and all your Agent Team members get shutdown or get redistributed to new teams. Team management essentially has become frictionless.
  • What about the creativity of teams? War rooms? Bull pens? Teams working through tough problem together, face to face? I miss working with experts… the human kind. That will never change. But...

Conclusion

I am impressed. Doing this project by myself would have taking me at least 5 weeks, and probably sleepless nights – because I am rusty at programming, and I used technology I am not familiar with. My guess is that an experienced programmer, familiar with all the technologies, could have cranked this out in a week in a week or less.

I am not sure if I am overestimating myself or underestimating the experienced programmer. But the tech stack is pretty deep, so the cognitive overload problem is real. Considering the cost differential, and the fact that as you gain experience running agent teams you will be able to work at 3 to 5 projects at a time, even with the most skilled engineer competing, it's not even close.

Check out the GitHub repo, I am curious about your feedback. How long do you think this would take?

Am I shocked? No, not based on the exponential improvements we have seen over the last 18 months. This was inevitable. What is interesting is to ponder what the next 18 months will offer up. Humans are bad at understanding exponential improvements / exponential growth, but if you look at the past couple of years as an indicator, it’s clear that we are on an incredible rocket ship, an exponential curve of improved functionality, capability, productivity that software engineering has never seen before.

Software development has been changed forever. The skill moved from mechanically assembling code, to now managing a set of agents that assemble code for you.

Management, specification, decisioning, code review, deployment – the skill sets of entire teams have been collapsed into a set of agents that are controlled by one or very few humans.

But the adage still applies: Junk in, junk out.

AI will produce amazing results if you guide it correctly. Guide it poorly, or not at all, and it will run itself off the tracks. Again, considering improvement patterns of the last 18 months, this is likely to get much better over time.

The economics of software development have changed so fundamentally that we all have a hard time wrapping our head around it. In the old days, it took many years of study and work experience to gain the knowledge which in turn justified the $200,000+ salaries. The cost of software development projects was mainly driven by labor cost for qualified engineers, which took a long time to educate and train. This fact drove the offshoring boom of the early 2000s. Now an agent costing $1,000 can do the same. International labor arbitrage has now turned into a simple tradeoff between labor cost and agent cost.

The SaaS per seat pricing model assumed humans in the seats – will that model survive when you have fewer humans and hundreds of agents? Will SaaS per seat pricing die and be replaced by a pure consumption-based model, so SaaS vendors can continue their revenue stream while mostly agents use it? And if SaaS products are mostly used by agents, do we really need fancy SaaS user interfaces? Why have an interface that mimics human, paper-based, workflows?

Finally, how will young engineers gain the experience to guide teams of agents? Have universities even begun to address the new educational requirements for Computer Science / Computer Engineering?

I sincerely hope this write-up helps people understand where we are and where we are heading.

So many questions, so little time, so few answers.

Addendum

Repo

You can find the repository, which contains the CLAUDE.md file and all source code produced by Claude below.

GitHub Repository: https://github.com/brianw1130/atm-simulator

Tech Stack Used

Backend

  • Language: Python 3.12
  • Web Framework: FastAPI
  • ORM: SQLAlchemy 2.0 + Alembic migrations
  • Database: PostgreSQL 16 (production), SQLite (testing)
  • Auth: PIN-based with bcrypt + application-level pepper
  • Sessions: In-memory with secrets.token_urlsafe
  • Task Queue: Celery + Redis
  • PDF Generation: ReportLab
  • S3 Client: boto3
  • Config: pydantic-settings
  • Logging: structlog (JSON)
  • Rate Limiting: slowapi

Frontend (ATM UI)

  • Framework: React 18 + TypeScript (strict)
  • Build Tool: Vite
  • Animations: Framer Motion 11
  • HTTP Client: Axios
  • State Management: React Context + useReducer (state machine)
  • Styling: Plain CSS with custom properties

Frontend (Admin Dashboard)

  • Framework: React 18 + TypeScript (strict)
  • Build Tool: Vite
  • HTTP Client: Axios (cookie auth)
  • Styling: Plain CSS

Infrastructure

  • IaC: Terraform (VPC, ECS, RDS, S3, Secrets Manager)
  • Compute: AWS ECS Fargate
  • Database: AWS RDS PostgreSQL
  • Storage: AWS S3 (statements + snapshots)
  • Secrets: AWS Secrets Manager
  • Container Registry: AWS ECR
  • Containerization: Docker (multi-stage build)
  • CI/CD: GitHub Actions → ECR → ECS

Testing

  • pytest + pytest-asyncio: Python unit/integration/E2E (715 tests)
  • Vitest + React Testing Library: Frontend unit/component (264 + 197 tests)
  • mypy (strict): Python type checking
  • Ruff: Python linting + formatting
  • ESLint + TypeScript strict: Frontend linting + type checking

Security

  • Bandit: Python SAST
  • pip-audit: Python dependency vulnerabilities
  • npm audit: Frontend dependency vulnerabilities
  • Trivy: Docker image + Terraform scanning
  • Gitleaks: Secret detection in git history
  • CodeQL: Deep SAST (Python + TypeScript)
  • Dependabot: Automated dependency update PRs

Great article. thank you for writing.

To view or add a comment, sign in

More articles by Brian Will

Others also viewed

Explore content categories