AI Writes Code Fast. But Debugging Eats All The Gains.

AI Writes Code Fast. But Debugging Eats All The Gains.

Cursor published something honest in December: "Some bugs consistently stump" their agents, and you need human verification because "the agent can't make that call on its own."

CodeRabbit analyzed thousands of pull requests last quarter. AI-generated code has 1.7x more issues than human code. 75% more logic errors. 3x more readability problems. Up to 2.74x more security vulnerabilities.

Stack Overflow asked 49,000 developers what frustrates them most. 66% said the same thing: "AI solutions that are almost right, but not quite." Trust in AI accuracy dropped from 40% to 29% in a year.

Developers think AI makes them 20% faster. METR measured what actually happens: 19% slower. All the time saved writing code gets spent debugging it.

I've been saying AI can't debug its own code for months. The data's finally catching up.

Why AI Can't Debug What It Writes

Cursor explained it in their Debug Mode launch: when you ask AI to debug, it produces "hundreds of lines of speculative code" instead of finding the actual problem. Training data is mostly working code. Bugs—the patterns AI needs to recognize—barely exist in what it learned from.

Amjad Masad, Replit's CEO, shared what a public company CEO told him: "Whatever time saved in generating the code is lost back in debugging, reverting bugs, and security audits." 25-50% of their code is AI-generated. Engineering velocity? "Negligible impact."

GitHub's documentation is blunt: "Without context, you get generic answers that don't understand your codebase." Google's Gemini has a permanent warning: "Validate all output."

Everyone shipping AI coding tools reached the same conclusion in Q4: AI debugging isn't autonomous. It's collaboration.

What Actually Makes Code Debuggable By AI

I've built with these tools every day for months. Some patterns let AI debug. Others send it into loops.

Explicit error messages matter more than you think:

// This breaks and AI has no idea why
async function fetchUser(id: string) {
  const data = await fetch(`/api/users/${id}`).then(r => r.json());
  return data.user;
}

// This breaks and AI knows exactly what failed
async function fetchUser(id: string) {
  const response = await fetch(`/api/users/${id}`);
  
  if (!response.ok) {
    throw new Error(
      `User fetch failed: ${response.status} ${response.statusText} for user ${id}`
    );
  }
  
  const data = await response.json();
  
  if (!data.user) {
    throw new Error(
      `API returned success but missing user object. Got keys: ${Object.keys(data)}`
    );
  }
  
  return data.user;
}        

Second version breaks? AI reads the error and knows if it's a 404, a missing field, or wrong data shape. First version? It's guessing.

Types stop AI from hallucinating APIs:

// AI will invent properties that don't exist
function processPayment(order, user, payment) {
  const amount = order.total * payment.multiplier; // made up .multiplier
  return charge(user.paymentMethod, amount);       // made up .paymentMethod
}

// AI knows exactly what's available
interface Order {
  id: string;
  items: Array<{ price: number; quantity: number }>;
  total: number;
}

interface User {
  id: string;
  email: string;
  stripeCustomerId: string;
}

interface PaymentIntent {
  amount: number;
  currency: 'usd' | 'eur';
  customerId: string;
}

function processPayment(
  order: Order, 
  user: User
): PaymentIntent {
  return {
    amount: order.total,
    currency: 'usd',
    customerId: user.stripeCustomerId
  };
}        

Augment Code found hallucinated APIs are the #1 failure mode—45% of AI code references functions or fields that don't exist. TypeScript kills this entirely.

Tests tell AI what it can't break:

Kent Beck said in December that TDD is "a superpower when working with AI agents" because tests catch regressions. He also warned to watch for AI disabling tests to make them pass.

# AI breaks this trying to fix anything
def calculate_shipping(weight_kg, distance_km, is_express):
    base = weight_kg * 0.5 + distance_km * 0.1
    return base * 2 if is_express else base

# AI knows what must stay true
def calculate_shipping(
    weight_kg: float, 
    distance_km: float, 
    is_express: bool = False
) -> Decimal:
    """
    Calculate shipping cost with zone-based pricing.
    
    Test cases define the contract:
    >>> calculate_shipping(10, 100, False)
    Decimal('15.00')  # base rate
    
    >>> calculate_shipping(10, 100, True)
    Decimal('30.00')  # express doubles it
    
    >>> calculate_shipping(50, 500, False)
    Decimal('75.00')  # scales linearly
    """
    base_cost = Decimal(str(weight_kg * 0.5 + distance_km * 0.1))
    return base_cost * 2 if is_express else base_cost        

Doctests lock down the pricing model. AI can't "fix" a bug by breaking the discount calculation.

Architecture determines if AI can isolate failures:

Research from Q4 shows microservices help AI because each service is small and explicit. Monoliths hurt because everything's coupled beyond AI's context window.

// AI drowns in 500 lines of mixed concerns
class OrderSystem {
  async processOrder(cart, user, payment) {
    // payment + inventory + email + analytics all mixed
    // breaks? AI has no idea which part failed
  }
}

// AI sees exactly where it broke
interface InventoryService {
  reserveItems(items: OrderItem[]): Promise<ReservationId>;
  confirmReservation(id: ReservationId): Promise<void>;
}

interface PaymentService {
  createIntent(amount: number, customerId: string): Promise<PaymentIntent>;
  capturePayment(intentId: string): Promise<PaymentConfirmation>;
}

class OrderOrchestrator {
  constructor(
    private inventory: InventoryService,
    private payments: PaymentService,
    private notifications: NotificationService
  ) {}
  
  async processOrder(order: Order, user: User): Promise<OrderResult> {
    const reservation = await this.inventory.reserveItems(order.items);
    try {
      const intent = await this.payments.createIntent(order.total, user.id);
      const confirmation = await this.payments.capturePayment(intent.id);
      await this.inventory.confirmReservation(reservation);
      await this.notifications.sendConfirmation(user.email, order.id);
      return { success: true, orderId: order.id };
    } catch (error) {
      await this.inventory.releaseReservation(reservation);
      throw new OrderProcessingError(`Failed at payment step`, { cause: error });
    }
  }
}        

Payment fails? AI sees it failed at this.payments.capturePayment(). Not somewhere in a god class.

How To Actually Prompt AI For Debugging

Simon Willison wrote in December: "Your job is to deliver code you have proven to work." Not code that looks right.

Here's what works when you need AI to debug:

When debugging:
1. Read the entire error message before suggesting anything
2. Figure out which assumption broke (network? wrong data? type mismatch? auth?)
3. Add logging at the exact failure point to confirm it
4. Fix only what's broken, not everything around it
5. Run tests before and after

Don't:
- Rewrite working code to fix a small bug
- Delete error handling to make tests pass
- Suggest fixes without reading the error
- Assume data shapes without checking types first        

Kent Beck calls this "augmented coding"—you care about quality, tests, coverage. The opposite is "vibe coding"—paste errors back to AI and hope it guesses right.

Steve Yegge spends $80K/year on AI coding. His advice: "Give them the tiniest task you possibly can. Track what they're doing. Own every line of code they write."

The Better Pattern: Design For Debugging From The Start

Here's what I've learned: don't ask AI to debug broken code. Ask it to write debuggable code from the beginning.

Four rules that make code self-debugging:

1. Dependencies you can swap = failures you can isolate

// AI writes this, you're screwed when it breaks
class OrderProcessor {
  async process(orderId: string) {
    const db = new Database('prod-connection-string');
    const stripe = new StripeClient(process.env.STRIPE_KEY);
    const email = new SendGridClient(process.env.SENDGRID_KEY);
    
    // which one failed? good luck finding out
    const order = await db.getOrder(orderId);
    const charge = await stripe.charge(order.total);
    await email.send(order.email, 'receipt', { charge });
  }
}

// AI writes this, you can debug it
interface Database {
  getOrder(id: string): Promise<Order>;
}

interface PaymentProvider {
  charge(amount: number, customerId: string): Promise<ChargeResult>;
}

interface EmailService {
  send(to: string, template: string, data: object): Promise<void>;
}

class OrderProcessor {
  constructor(
    private db: Database,
    private payments: PaymentProvider,
    private email: EmailService,
    private logger: Logger
  ) {}
  
  async process(orderId: string) {
    this.logger.info('Processing order', { orderId });
    
    const order = await this.db.getOrder(orderId);
    this.logger.debug('Order fetched', { order });
    
    const charge = await this.payments.charge(order.total, order.customerId);
    this.logger.debug('Payment charged', { chargeId: charge.id });
    
    await this.email.send(order.email, 'receipt', { charge });
    this.logger.info('Order processed', { orderId, chargeId: charge.id });
  }
}        

Second version breaks? Logs tell you which dependency failed. You can swap in a mock to test each piece. First version? Everything's tangled.

2. Logging as you write, not after it breaks

# AI generates this
def sync_inventory(warehouse_id: str, items: list):
    inventory = fetch_current_stock(warehouse_id)
    updates = calculate_differences(inventory, items)
    apply_updates(warehouse_id, updates)
    return len(updates)

# AI should generate this
def sync_inventory(
    warehouse_id: str, 
    items: list[InventoryItem],
    logger: Logger
) -> int:
    logger.info(f"Starting inventory sync for warehouse {warehouse_id}")
    logger.debug(f"Syncing {len(items)} items")
    
    inventory = fetch_current_stock(warehouse_id)
    logger.debug(f"Current stock: {len(inventory)} items")
    
    updates = calculate_differences(inventory, items)
    logger.info(f"Calculated {len(updates)} updates", extra={
        'additions': sum(1 for u in updates if u.type == 'add'),
        'removals': sum(1 for u in updates if u.type == 'remove'),
        'modifications': sum(1 for u in updates if u.type == 'modify')
    })
    
    if not updates:
        logger.info("No updates needed, skipping apply")
        return 0
    
    apply_updates(warehouse_id, updates)
    logger.info(f"Applied {len(updates)} updates successfully")
    
    return len(updates)        

Production breaks? Second version's logs tell you exactly where. First version gives you nothing.

3. Tests first = spec is locked before code exists

// AI writes code then maybe tests
function calculateDiscount(cart: Cart, user: User): number {
  // 50 lines of discount logic
  // breaks in production
  // now you're guessing what it should do
}

// AI writes tests then code
describe('calculateDiscount', () => {
  it('applies 10% for orders over $100', () => {
    const cart = { total: 150, items: [...] };
    const user = { tier: 'standard' };
    expect(calculateDiscount(cart, user)).toBe(15);
  });
  
  it('applies 20% for premium users', () => {
    const cart = { total: 100, items: [...] };
    const user = { tier: 'premium' };
    expect(calculateDiscount(cart, user)).toBe(20);
  });
  
  it('never discounts below $50 threshold', () => {
    const cart = { total: 40, items: [...] };
    const user = { tier: 'premium' };
    expect(calculateDiscount(cart, user)).toBe(0);
  });
  
  it('stacks user tier with bulk discounts', () => {
    const cart = { total: 500, items: Array(20).fill({...}) };
    const user = { tier: 'premium' };
    // 20% user + 5% bulk = 25% total
    expect(calculateDiscount(cart, user)).toBe(125);
  });
});

function calculateDiscount(cart: Cart, user: User): number {
  if (cart.total < 50) return 0;
  
  let discount = 0;
  
  // User tier discount
  if (user.tier === 'premium') discount += 0.20;
  else if (cart.total > 100) discount += 0.10;
  
  // Bulk discount
  if (cart.items.length >= 20) discount += 0.05;
  
  return cart.total * discount;
}        

Tests lock down behavior before code exists. AI can't accidentally change the discount rules while fixing a bug.

4. Types + assertions = assumptions you can verify

// Implicit assumptions AI will break
async function processWebhook(payload: any) {
  const orderId = payload.data.order.id;
  const amount = payload.data.order.total;
  await updateOrder(orderId, amount);
}

// Explicit assumptions AI can't violate
interface WebhookPayload {
  event: 'order.created' | 'order.updated' | 'order.cancelled';
  data: {
    order: {
      id: string;
      total: number;
      currency: 'usd' | 'eur';
      status: 'pending' | 'paid' | 'failed';
    };
  };
  timestamp: number;
}

async function processWebhook(
  payload: WebhookPayload,
  logger: Logger
): Promise<void> {
  // Runtime assertions for things types can't catch
  if (payload.timestamp < Date.now() - 300000) {
    throw new Error(`Webhook too old: ${Date.now() - payload.timestamp}ms`);
  }
  
  if (payload.data.order.total <= 0) {
    throw new Error(`Invalid order total: ${payload.data.order.total}`);
  }
  
  logger.info('Processing webhook', {
    event: payload.event,
    orderId: payload.data.order.id,
    amount: payload.data.order.total
  });
  
  await updateOrder(
    payload.data.order.id,
    payload.data.order.total
  );
}        

Types catch wrong shapes at compile time. Assertions catch wrong values at runtime. AI can't silently break assumptions.

The prompt that makes AI write debuggable code:

Write this with debugging in mind:
1. Start with tests that define exact behavior
2. Inject all dependencies (db, apis, services) - no globals
3. Add logging at every decision point (before/after each operation)
4. Type everything - inputs, outputs, intermediates
5. Assert assumptions that types can't enforce (positive amounts, recent timestamps, etc)

The code should tell me what failed and why without me adding prints.        

This is the difference between "AI writes fast code that breaks mysteriously" and "AI writes debuggable code that tells you what failed."

What Shipped Last Quarter

Cursor launched Debug Mode on December 10. It instruments your app with runtime logs, generates hypotheses about what broke, then calls you back to verify. First tool to admit AI can't debug alone.

GitHub shipped Copilot Code Review in October. Combines LLM detection with CodeQL's deterministic analysis. They know AI needs real static analysis for security.

OpenAI released GPT-5.2-Codex on December 18. "Project-scale debugging sessions" with better context. Hit 56.4% on SWE-Bench Pro. Found actual React CVEs during testing.

Anthropic's Claude Sonnet 4.5 went GA in October. 0% error rate on their internal editing benchmarks (vs 9% before). Can run autonomous for 30+ hours.

Amazon Q Developer got debugging agents at re:Invent. Analyzes CloudWatch logs, finds Lambda errors, VPC routing problems.

Notice the pattern? Everyone's adding human checkpoints and runtime instrumentation. Nobody claims autonomous debugging works.

What This Actually Means

Addy Osmani from Google put it best in December: AI gives you "maybe 70% of the code—the scaffolding, the obvious patterns." But "the remaining 30%—edge cases, security, production integration, debugging—can be just as time consuming as it ever was."

Martin Fowler used an engineering metaphor: "We can't skate too close to the edge because otherwise we're going to have some bridges collapsing." He's talking about safety margins when working with non-deterministic systems.

Charity Majors: "The hardest problem is not usually debugging the code. It's finding out wherein the system is the code that you need to debug." Observability matters more with AI code, not less.

The numbers back this up. Median PR size grew 33% from March to November. Lines per developer went from 4,450 to 7,839. We're writing more code. CodeRabbit found 1.7x more bugs per PR.

More code doesn't mean more productivity. It means more surface area for bugs.

I'm a solo founder. This makes or breaks my entire day.

When AI-generated code breaks at 2am and I'm the only one who can fix it, that's it. No team to delegate debugging to. No senior engineer to review what went wrong. I'm doing product, code, compliance, marketing—all of it.

A mysterious failure in AI-generated code doesn't slow me down. It stops me completely. That debugging session that should take 20 minutes? It becomes 4 hours because the code has no instrumentation, no clear boundaries, no logging. I'm context-switching between frontend, backend, infrastructure, trying to figure out which AI hallucination broke and why.

One hard-to-debug failure kills my entire week of shipping.


Waiting for GPT-6 won't fix this. It's architectural and methodological.

I see two types of developers.

Developers who write explicit error messages AI can parse. Use types to stop hallucinated APIs. Keep tests so AI can't break contracts. Design bounded services AI can reason about. Add logging and dependency injection from the start. Treat AI output as a first draft that needs instrumentation.

Developers who let AI write loose code with no types. Accept "almost right" without checking. Skip tests. No dependency injection. Feed errors back hoping AI guesses right eventually.

The skill that'll matter in 2026 isn't "can you use Cursor." It's "can you write code that tells you what broke and why—without spending hours adding debug statements after the fact."

When you're solo, you can't afford mysterious failures. Every hour debugging AI code is an hour not shipping, not talking to users, not keeping the company alive.

Most developers optimize for how fast AI writes code. Smart developers optimize for whether they can actually fix it when it inevitably breaks.

Hey Fabian, could you please check your DMs.

Like
Reply

This article is 100% accurate, well written and something I need to share. Everybody using AI to write code needs to read this.

To view or add a comment, sign in

More articles by Fabian Weber

Others also viewed

Explore content categories