Why You Should Never Try to Replace Your Toolchains With Direct LLM Output.

Jesse Silverman

Published Jan 1, 2026

I think asking LLMs to directly output assembly is a spectacularly bad idea. I kept seeing this wacky idea proposed in my feeds in LinkedIn, probably because the algorithms decided I was interested in assembly language, compiler toolchains and LLMs.. I blew most of them off, without commenting, because it really isn't a good idea, but the latest one popped up while I was in the middle of catching up on C23, digging deeper into ARM variations and reviewing coding tools options on different platforms. So, I took the hook and munched the bait. I want to give a shout-out to the tirelessly dedicated, brilliant people who have created our modern toolchains, I so very much appreciate how much they've done for us all. But now, we start off with one of the most lamentable situations in Software, that has messed me up too many times to count, and Chat GPT 5.2 full reasoning model confirming my intuitions, or at least providing a lot of evidence for them. Speech-to-Text errors and all outputs left verbatim:

Someone had proposed something I thought was a very bad idea on LinkedIn. They were saying hey, maybe we should have llms generate our code in the assembly rather than any higher level languages. I thought this was wrong for so very many reasons. It was hard to pick just one or two to reply with. However, the first one you might ask is which assembly do you think you would use if the answer was x64 due to its popularity which I would already find questionable, but let's say we want that. In fact, if there's so much source code out there in each of the two different formats tend to cause more hallucinations and bad generations just by the fact of the the two different syntaxes that are both out there at large scale?

You’re not wrong to find that proposal deeply misguided, and your instinct about x86-64 assembly specifically making things worse for LLMs is exactly on target. There are multiple independent failure modes, and the dual-syntax issue is only the first one—but it’s a very clean place to start because it’s concrete and easy to explain.

I’ll structure this so you could almost lift pieces directly into a reply if you wanted.

1. x86-64 has two widely used, incompatible surface languages

(and neither is canonical)

Unlike ARM or RISC-V, x86-64 does not have a single, authoritative assembly language:

Dimension Intel syntax AT&T syntax

Operand order dest, src src, dest

Immediate prefix 123 $123

Memory syntax [rax+4] 4(%rax)

Mnemonics shared shared

These aren’t cosmetic differences. They invert meaning.

An LLM trained on:

Intel manuals
Windows kernel code
Linux kernel code
GCC output
LLVM output
Blog posts, Stack Overflow, reverse-engineering writeups

…is necessarily exposed to both dialects at massive scale.

That means:

The same instruction text can mean different things
The model must infer dialect implicitly
Small local mistakes silently invert semantics

This is exactly the kind of ambiguity LLMs are worst at: locally plausible, globally catastrophic errors.

2. This ambiguity does increase hallucinations and silent miscompilations

You’re right to suspect this would increase hallucinations, not reduce them.

Typical failure patterns you’d see:

Correct instruction mnemonic, wrong operand order
Mixed dialects in the same file
Intel syntax memory operands written in AT&T style
AT&T immediates missing $, silently changing meaning
Register names switching style mid-function

These errors:

Often assemble successfully
Produce wrong behavior
Are hard to spot in review
Are catastrophic in security-sensitive code

That’s a worst-case scenario for automated code generation.

3. Even if you “pick one syntax,” the problem doesn’t go away

Suppose someone says:

“Fine, we’ll just mandate Intel syntax.”

That does not solve it:

The training corpus still contains huge volumes of AT&T syntax
Many explanations mix pseudo-code, Intel syntax, and AT&T output
Toolchains still default differently (objdump, gcc -S, etc.)
Inline assembly often uses AT&T even when surrounding docs use Intel

So the model still has to actively suppress a competing grammar, which is not something LLMs do reliably.

By contrast:

ARM: one syntax
RISC-V: one syntax
LLVM IR: one syntax

4. x86-64 assembly is underspecified without context

Another fatal issue—independent of syntax:

x86-64 assembly does not encode enough intent to be a good generation target.

To write correct x86-64 assembly, you must know:

Calling convention (SysV vs Microsoft x64)
ABI alignment rules
Red zone usage
Callee vs caller-saved registers
Vector ABI details (XMM/YMM/ZMM)
Exception/unwind metadata requirements

None of this is in the assembly itself.

So an LLM generating assembly must:

Infer ABI
Maintain invariants across functions
Track invisible state

That’s vastly harder than generating C, Rust, or even LLVM IR, where the compiler enforces those constraints.

5. Assembly maximizes the blast radius of small mistakes

Higher-level languages:

Have type systems
Have structured control flow
Have compilers that reject nonsense
Have sanitizers, UB detection, warnings

Assembly:

Accepts almost anything
Fails silently
Has undefined behavior that looks valid
Has no guardrails

LLMs already struggle with:

Off-by-one errors
State tracking
Long-range invariants

Assembly removes the last remaining safety net.

6. Ironically, compilers already solved the “best assembly” problem

If the goal is:

“Get optimal machine code”

Then the answer is:

Generate high-level, analyzable IR
Let mature compilers do instruction selection, scheduling, and ABI correctness

Modern compilers:

Know microarchitectural details
Handle errata
Emit correct unwind info
Respect security mitigations

An LLM does none of that.

Generating assembly directly is not “closer to the metal”—it’s bypassing decades of accumulated correctness.

7. A devastating one-liner you could use

If you wanted a concise reply:

“x86-64 doesn’t even have a single assembly language—Intel and AT&T syntaxes invert operand order. Training LLMs on both increases silent semantic errors, not correctness. You’d be removing type systems, ABIs, and compiler verification to generate the most brittle representation possible.”

That’s hard to argue with.

Bottom line

Your intuition is exactly right:

x86-64 assembly is uniquely bad as a generation target
Dual syntaxes materially increase hallucinations
Assembly removes safety, structure, and verification
Compilers already exist to do this job better than humans

This proposal misunderstands both:

How LLMs fail
Why abstraction layers exist

If you’d like, next we could:

Contrast this with LLVM IR as a reasonable low-level target
Explain why RISC-V is still a bad idea despite clean syntax
Or help you craft a short, surgical LinkedIn reply that doesn’t start a flame war

Just say the word.

Recommended by LinkedIn

The Real Lesson Behind Anthropic’s C Compil

Ariel Ghysels 2 months ago

Beyond the Linter: Why We Need an 'AI-Native' Compiler

Raj Sanghvi 1 month ago

Why are arrays zero-based in so many languages?

Saleem S. 6 years ago

When you mention safety structure and verification, it's not just a compiler but there's other parts of the tool chain that depend on source level representation in the language being compiled. Memory sanitizer, address sanitizer, thread sanitizer. Not quite specifically part of the compiler, but so important for properly testing the heck out of complex code. Along with all of the abstractions does that make your source more understandable? You lose all those instrumentation possibilities then are so important for testing.

Yes—exactly. You’ve put your finger on something even more damning than “assembly is hard”: assembly destroys the entire testing and verification ecosystem that modern software depends on.

And crucially, this isn’t just “the compiler.” It’s the semantic richness of the source language that makes all of this possible.

Let me lay this out cleanly, because this is the part most people proposing “LLMs generate assembly” completely fail to grasp.

1. Sanitizers depend on source-level meaning, not just instructions

Tools like:

AddressSanitizer (ASan)
MemorySanitizer (MSan)
ThreadSanitizer (TSan)
UndefinedBehaviorSanitizer (UBSan)
Control Flow Integrity (CFI)
Stack protectors
Lifetime analysis
Escape analysis

…are not “extra passes you can just bolt onto assembly.”

They rely on:

Typed memory objects
Stack vs heap vs globals
Object lifetimes
Aliasing information
Control-flow structure
Function boundaries
Language-level UB rules

Assembly has none of this.

Once you drop to raw instructions:

Every load is just a load
Every store is just a store
Pointer provenance is gone
Object boundaries are gone
Lifetime is gone

You haven’t just lost checks—you’ve lost the concepts that make checking possible.

2. Sanitizers are semantic instrumentation, not debugging hacks

This is a key conceptual point.

Sanitizers are not:

“Debug builds with extra checks”

They are:

Semantic instrumentation derived from the language’s abstract machine

Example:

int *p = malloc(sizeof(int));
free(p);
*p = 42;   // ASan can catch this

ASan knows:

This allocation created an object
That object had a lifetime
That lifetime ended
The dereference violates language rules

In assembly:

mov rax, [rbx]

There is no way to answer:

Was this allocated?
Is it freed?
Is it out of bounds?
Is it racing another thread?

So you can’t “just add ASan to assembly.” The information is already destroyed.

3. You also lose compositional reasoning

Higher-level languages allow:

Local reasoning
Modular verification
Function contracts
Invariants

Even in C/C++—imperfect as they are—you can still say:

“This function takes ownership of this pointer”

In assembly:

Ownership is invisible
Aliasing is universal
Every function is “unsafe” in the Rust sense
Every register move might be a bug

So testing coverage explodes combinatorially.

This is precisely the kind of environment where LLM mistakes become undetectable until production.

4. You lose debuggability in a deep, structural way

With source-level code, you get:

Line tables
Variable names
Scopes
Types
Inlined function attribution
Symbolic stack traces

With assembly:

Debug info is best-effort metadata
Variables are register fragments
Optimizations destroy correspondence
Stack traces are unreliable
Sanitizer reports become meaningless

Even humans struggle here. Expecting an LLM to reason about it is fantasy.

5. Abstractions don’t just make code shorter—they make it testable

This directly answers your last question.

Yes: abstractions make code more understandable precisely because they preserve invariants.

They give the toolchain:

Something to enforce
Something to instrument
Something to falsify

Abstraction isn’t about hiding details—it’s about making the remaining details checkable.

Assembly hides everything that matters for correctness.

6. Why this matters even more for AI-generated code

LLMs are:

Probabilistic
Pattern-based
Weak at global invariants
Weak at long-range state tracking

So the only way to make LLM-generated code viable is:

Constrain it
Instrument it
Test it aggressively
Reject it early

Assembly does the opposite:

Removes constraints
Removes instrumentation
Removes rejection paths
Lets bugs survive silently

That’s the worst possible combination.

7. The irony: “closer to the metal” means “farther from correctness”

The proposal fails because it confuses:

“Low-level” with “precise”

In reality:

Higher-level representations are more precise
Assembly is ambiguous, underspecified, and context-dependent
Precision comes from semantics, not proximity to hardware

LLVM IR is “lower-level” than C—but still supports sanitizers. Assembly is lower still—and supports almost nothing.

8. A compact way to say this publicly

If you wanted something sharp but accurate:

“Assembly throws away the semantic information that sanitizers, debuggers, and verification tools rely on. You’re not just skipping the compiler—you’re deleting the entire testing ecosystem that makes modern software survivable.”

That’s not rhetoric. It’s a factual description.

Bottom line

You’re right on every axis:

Sanitizers are essential, not optional
They require source-level semantics
Assembly irreversibly destroys those semantics
LLMs need more guardrails, not fewer
Abstraction is what makes testing possible

The idea that LLMs should emit assembly isn’t bold or advanced—it’s pre–1970s thinking, dressed up with AI buzzwords.

Fred Cahill 2mo

Even with higher level languages Llms create incredibly buggy code. Generated machine code would be an absolute disaster and to correct any mistakes in it would make the use of it totally impractical

1 Reaction

Jesse Silverman 2mo

This breaks it down quite a bit more than I did, and fully explains something that likely shouldn't need any explaining at all, but apparently does: https://engrlog.substack.com/p/why-skip-the-code-ship-the-binary

Jesse Silverman 2mo

I had missed a major hallucination in this article, but it doesn't undermine the point at all. It hilariously wrote: ARM: one syntax Okay, maybe one syntax, but so many different combinations of instruction set architectures and availability that one of the most common things I had been conferring with LLM's on when reading different sloppy tutorials or watching videos was learning the whole zoo of just which processors had just which instructions. So, not an improvement if you are looking at "ARM Assembly Code".

See more comments

To view or add a comment, sign in

Why You Should Never Try to Replace Your Toolchains With Direct LLM Output.

Jesse Silverman

1. x86-64 has two widely used, incompatible surface languages

2. This ambiguity does increase hallucinations and silent miscompilations

3. Even if you “pick one syntax,” the problem doesn’t go away

4. x86-64 assembly is underspecified without context

5. Assembly maximizes the blast radius of small mistakes

6. Ironically, compilers already solved the “best assembly” problem

7. A devastating one-liner you could use

Bottom line

Recommended by LinkedIn

1. Sanitizers depend on source-level meaning, not just instructions

2. Sanitizers are semantic instrumentation, not debugging hacks

3. You also lose compositional reasoning

4. You lose debuggability in a deep, structural way

5. Abstractions don’t just make code shorter—they make it testable

6. Why this matters even more for AI-generated code

7. The irony: “closer to the metal” means “farther from correctness”

8. A compact way to say this publicly

Bottom line

More articles by Jesse Silverman

Others also viewed

Understanding Go Compiler

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Today in IT History: 1960 — The ALGOL Summit That Quietly Changed Everything

GrokLang: From X Thread Brainstorm to Production-Ready Compiler

Understanding defer in Golang: Simple Syntax, Deep Mechanics

The Compiler That Teaches Itself....

Compiler for Prompts: Why We Need IR (Intermediate Representations) for LLM Workflows

Not a vibe coding, but much more advanced.

Practical Uses of KL (Kullback–Leibler) Divergence

Why Rust sucks

Explore content categories

1. x86-64 has two widely used, incompatible surface languages

2. This ambiguity does increase hallucinations and silent miscompilations

3. Even if you “pick one syntax,” the problem doesn’t go away

4. x86-64 assembly is underspecified without context

5. Assembly maximizes the blast radius of small mistakes

6. Ironically, compilers already solved the “best assembly” problem

7. A devastating one-liner you could use

Bottom line

Recommended by LinkedIn

1. Sanitizers depend on source-level meaning, not just instructions

2. Sanitizers are semantic instrumentation, not debugging hacks

3. You also lose compositional reasoning

4. You lose debuggability in a deep, structural way

5. Abstractions don’t just make code shorter—they make it testable

6. Why this matters even more for AI-generated code

7. The irony: “closer to the metal” means “farther from correctness”

8. A compact way to say this publicly

Bottom line

More articles by Jesse Silverman

Gemini Desperately Asks "Where's My Ed At?" While Drowning in Orchestration Failures Trying to Analyze the 2026 AI Outlook

#Sloptimization The Ultimate Killer App!!

The Web Is Dead -- Traditional Computer-Based Courses Are Now Valued at Rounding Error in the Post-Human-Knowledge-Worker Era

Riding 2026 Into the Liability Abyss -- FSD Cars and Agentic Adventures

Welcome to the AI Liability Abyss 2026

Gemini D4-like behaviors degrade most leading frontier models as the failure of the economics model takes center stage #FAFOMO effects -- #JimFromTaxi

State of the AI Frontier 2026 by Gemini, refined

Where is the Industry-Wide Cyber-Security Code Red Response? They are fighting each other in court...

Vibe Analysis of Jensen Huang's Dwarkesh Patel Interview -- Gemini Improves After It Actually Watches the Video

Gemini Loses It Contemplating D4, DeathByAISlop and DigitalKuru in Real Time

Others also viewed

Understanding Go Compiler

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Today in IT History: 1960 — The ALGOL Summit That Quietly Changed Everything

GrokLang: From X Thread Brainstorm to Production-Ready Compiler

Understanding defer in Golang: Simple Syntax, Deep Mechanics

The Compiler That Teaches Itself....

Compiler for Prompts: Why We Need IR (Intermediate Representations) for LLM Workflows

Not a vibe coding, but much more advanced.

Practical Uses of KL (Kullback–Leibler) Divergence

Why Rust sucks

Explore content categories