Can GPT5 / Claude Sonet replace me and my team ?
Seems to be the question on everyones mind lately. Well can they really ?. Let's find out shall we ?
Spoiler alert, the answer is not a simple YES or NO. 
Vibe Coding
Lets gets some things clarified. I'm not vibe coding here. I have som strong opinions about the pitfalls of vibe coding in production systems. This is not the place to discuss that. I don't exactly know what to call this but to put things simply I'm using AI with well defined context to do this experiment. 
Credits
This project is based on Ed Donner's CrewAI project featured in The Complete AI Engineering Course, If you are interested in learning more you should check out the learning material by Ed Donner. I can highly recommend them. 
The Project
In order to test out this hypothesis I'm gonna use a Crew AI . The title was a catchy one, the project however has a narrower context. I'm not trying to create an AI Enginnering team to replace us humans, rather experimenting to see how we can use AI to boost our productivity. 
The problem we are going to tackle is a rather simple one. This will allow me to analyze the results easily and get a good baseline for further experimentation. 
The task is to get a containerized solution up and running to a provided problem statement which is classified as an "assignment" in the below config. We'll go into details when analyzing the two different approaches used to generate outputs. 
Task Definitions
design_task:
  description: >
    Create a solution determining what docker images can be used to solve the {assignment}
    Follow docker and docker compose best practices
    IMPORTANT: Output ONLY the raw markdown without any markdown formatting, code block delimiters, or backticks.
    the current year is {current_year}.
  expected_output: >
    A docker compose file to do the {assignment}
  agent: senior_engineer
  output_file: output/design.md
coding_task:
  description: >
    Create a docker compose file that implements the design described by senior_engineer, in order to achieve the requirements.
    Here are the requirements: {assignment}
    Follow docker and docker compose best practices
    IMPORTANT: Output ONLY the raw docker compose code without any markdown formatting, code block delimiters, or backticks.
    the current year is {current_year}.
  expected_output: >
    A docker compose file to do the {assignment}
  agent: docker_engineer
  context:
    - design_task
  output_file: output/docker-compose.yml
documentation_task:
  description: >
    Create a readme.md markdown file explaining the docker compose file and how to run it
    IMPORTANT: Output ONLY the raw markdown code without any markdown formatting, code block delimiters, or backticks.
    the current year is {current_year}.
  expected_output: >
    A readme.md file explaining the solution and how to run it
  agent: docker_engineer
  context:
    - coding_task
  output_file: output/readme.md    Give me the details, gimme, gimme
I try two different approaches in this experiment. The first approach is to use a human senior engineer with an AI engineer to build the system. The human senior engineer is going to provide very prescriptive instructions for the AI engineer to complete the task in hand. 
In the second approach we are going to use two different AI agents. One senior engineer which we will equip with design capabilities. We are gonna provide this agent with a problem statement with some guidelines. Then this AI agent is going to create the prescriptive instructions that will be used by the same AI engineer from the pervious  example to produce the code. 
Models Used
I tried using different models to try out this concept
- Claude Sonet 4 (claude-sonnet-4-20250514) by Anthropic
- gpt-4.1 by OpenAI
- gpt5 by OpenAI
- deepseek-chat by DeepSeek
- llama3.2 by Meta hosted with Ollama locally
Human Senior Engineer with an AI Engineering team
In the first approach we have a human senior engineer providing the detailed instructions to the AI engineering team. The senior engineer is going to provide a very prescriptive task for the AI engineers
assignment = f"""Created a self hosted nextcloud instance. Use the docker hub image nextcloud:31.0.8-apache
use mariadb:10.11 as the database 
use redis:alpine3.22 with nextcloud
create separate containers for nextcloud and cron jobs
Use jc21/nginx-proxy-manager:latest as a reverse proxy to expose the nextcloud instance to the internet
# """
AI Senior Engineer with an AI Engineering Team
In the second approach the senior engineer is also replaced by an AI Agent. We provide the problem statement to the AI Senior Engineering Agent. This agent in turn is gonna provide a prescriptive task for the crew to generate the output
assignment = f"""Created a self hosted nextcloud instance. 
Think about separation of concerns, maintainability and other software development best practices
The instance needs to be exposed to the internet
The same server that is going to host the nextcloud instance also has some simple websites exposed to the internet
The nextcloud instance and the other websites need their own ssl certificates 
The server hosting all these is behind a pfsense firewall
"""
Agent Definitions
senior_engineer:
  role: >
    Engineer with extensive knowledge about software development practices including web based security vulnerabilities
  goal: >
    Provide solutions for complex software problems
  backstory: >
    You're a tenured engineer with extensive knowledge about software engineering
    You are able to create software designs that can guide other engineers in creating solutions
  llm: anthropic/claude-sonnet-4-20250514
docker_engineer:
  role: >
    Engineer with extensive knowledge about docker and docker compose
  goal: >
    Generate docker compose files with explanations
  backstory: >
    You're a tenured engineer with extensive knowledge about docker and docker compose. 
    You are able to create working docker compose files following best practices to fulfill the given requirements.
  llm: anthropic/claude-sonnet-4-20250514Output and Conclusions
To be honest this exercise is too narrow and simplistic to come to a concrete conclusion. It does show some good data points though. (Full outputs in appendix)
When it comes to agents I was not able to produce good results with deepseek-chat or llama3.2. Only the frontier models were able to provide good enough data points in this project. 
The first approach created a good simple working solution sticking with the provided guidelines. The second approach created a novel solution which could be classified as overkill. Though on second thought narrowing down the requirements to specify the host environment would have generated better results. 

 
 
 
Comments
Post a Comment