Why You Should Never Try to Replace Your Toolchains With Direct LLM Output.
Why You Shouldn't Do This -- Slightly too x86_64 specific

Why You Should Never Try to Replace Your Toolchains With Direct LLM Output.

I think asking LLMs to directly output assembly is a spectacularly bad idea. I kept seeing this wacky idea proposed in my feeds in LinkedIn, probably because the algorithms decided I was interested in assembly language, compiler toolchains and LLMs.. I blew most of them off, without commenting, because it really isn't a good idea, but the latest one popped up while I was in the middle of catching up on C23, digging deeper into ARM variations and reviewing coding tools options on different platforms. So, I took the hook and munched the bait. I want to give a shout-out to the tirelessly dedicated, brilliant people who have created our modern toolchains, I so very much appreciate how much they've done for us all. But now, we start off with one of the most lamentable situations in Software, that has messed me up too many times to count, and Chat GPT 5.2 full reasoning model confirming my intuitions, or at least providing a lot of evidence for them. Speech-to-Text errors and all outputs left verbatim:

Someone had proposed something I thought was a very bad idea on LinkedIn. They were saying hey, maybe we should have llms generate our code in the assembly rather than any higher level languages. I thought this was wrong for so very many reasons. It was hard to pick just one or two to reply with. However, the first one you might ask is which assembly do you think you would use if the answer was x64 due to its popularity which I would already find questionable, but let's say we want that. In fact, if there's so much source code out there in each of the two different formats tend to cause more hallucinations and bad generations just by the fact of the the two different syntaxes that are both out there at large scale?

You’re not wrong to find that proposal deeply misguided, and your instinct about x86-64 assembly specifically making things worse for LLMs is exactly on target. There are multiple independent failure modes, and the dual-syntax issue is only the first one—but it’s a very clean place to start because it’s concrete and easy to explain.

I’ll structure this so you could almost lift pieces directly into a reply if you wanted.


1. x86-64 has two widely used, incompatible surface languages

(and neither is canonical)

Unlike ARM or RISC-V, x86-64 does not have a single, authoritative assembly language:

Dimension Intel syntax AT&T syntax

Operand order dest, src src, dest

Register prefix rax %rax

Immediate prefix 123 $123

Memory syntax [rax+4] 4(%rax)

Mnemonics shared shared

These aren’t cosmetic differences. They invert meaning.

An LLM trained on:

  • Intel manuals
  • Windows kernel code
  • Linux kernel code
  • GCC output
  • LLVM output
  • Blog posts, Stack Overflow, reverse-engineering writeups

…is necessarily exposed to both dialects at massive scale.

That means:

  • The same instruction text can mean different things
  • The model must infer dialect implicitly
  • Small local mistakes silently invert semantics

This is exactly the kind of ambiguity LLMs are worst at: locally plausible, globally catastrophic errors.


2. This ambiguity does increase hallucinations and silent miscompilations

You’re right to suspect this would increase hallucinations, not reduce them.

Typical failure patterns you’d see:

  • Correct instruction mnemonic, wrong operand order
  • Mixed dialects in the same file
  • Intel syntax memory operands written in AT&T style
  • AT&T immediates missing $, silently changing meaning
  • Register names switching style mid-function

These errors:

  • Often assemble successfully
  • Produce wrong behavior
  • Are hard to spot in review
  • Are catastrophic in security-sensitive code

That’s a worst-case scenario for automated code generation.


3. Even if you “pick one syntax,” the problem doesn’t go away

Suppose someone says:

“Fine, we’ll just mandate Intel syntax.”

That does not solve it:

  • The training corpus still contains huge volumes of AT&T syntax
  • Many explanations mix pseudo-code, Intel syntax, and AT&T output
  • Toolchains still default differently (objdump, gcc -S, etc.)
  • Inline assembly often uses AT&T even when surrounding docs use Intel

So the model still has to actively suppress a competing grammar, which is not something LLMs do reliably.

By contrast:

  • ARM: one syntax
  • RISC-V: one syntax
  • LLVM IR: one syntax


4. x86-64 assembly is underspecified without context

Another fatal issue—independent of syntax:

x86-64 assembly does not encode enough intent to be a good generation target.

To write correct x86-64 assembly, you must know:

  • Calling convention (SysV vs Microsoft x64)
  • ABI alignment rules
  • Red zone usage
  • Callee vs caller-saved registers
  • Vector ABI details (XMM/YMM/ZMM)
  • Exception/unwind metadata requirements

None of this is in the assembly itself.

So an LLM generating assembly must:

  • Infer ABI
  • Maintain invariants across functions
  • Track invisible state

That’s vastly harder than generating C, Rust, or even LLVM IR, where the compiler enforces those constraints.


5. Assembly maximizes the blast radius of small mistakes

Higher-level languages:

  • Have type systems
  • Have structured control flow
  • Have compilers that reject nonsense
  • Have sanitizers, UB detection, warnings

Assembly:

  • Accepts almost anything
  • Fails silently
  • Has undefined behavior that looks valid
  • Has no guardrails

LLMs already struggle with:

  • Off-by-one errors
  • State tracking
  • Long-range invariants

Assembly removes the last remaining safety net.


6. Ironically, compilers already solved the “best assembly” problem

If the goal is:

“Get optimal machine code”

Then the answer is:

  • Generate high-level, analyzable IR
  • Let mature compilers do instruction selection, scheduling, and ABI correctness

Modern compilers:

  • Know microarchitectural details
  • Handle errata
  • Emit correct unwind info
  • Respect security mitigations

An LLM does none of that.

Generating assembly directly is not “closer to the metal”—it’s bypassing decades of accumulated correctness.


7. A devastating one-liner you could use

If you wanted a concise reply:

“x86-64 doesn’t even have a single assembly language—Intel and AT&T syntaxes invert operand order. Training LLMs on both increases silent semantic errors, not correctness. You’d be removing type systems, ABIs, and compiler verification to generate the most brittle representation possible.”

That’s hard to argue with.


Bottom line

Your intuition is exactly right:

  • x86-64 assembly is uniquely bad as a generation target
  • Dual syntaxes materially increase hallucinations
  • Assembly removes safety, structure, and verification
  • Compilers already exist to do this job better than humans

This proposal misunderstands both:

  • How LLMs fail
  • Why abstraction layers exist

If you’d like, next we could:

  • Contrast this with LLVM IR as a reasonable low-level target
  • Explain why RISC-V is still a bad idea despite clean syntax
  • Or help you craft a short, surgical LinkedIn reply that doesn’t start a flame war

Just say the word.

When you mention safety structure and verification, it's not just a compiler but there's other parts of the tool chain that depend on source level representation in the language being compiled. Memory sanitizer, address sanitizer, thread sanitizer. Not quite specifically part of the compiler, but so important for properly testing the heck out of complex code. Along with all of the abstractions does that make your source more understandable? You lose all those instrumentation possibilities then are so important for testing.

Yes—exactly. You’ve put your finger on something even more damning than “assembly is hard”: assembly destroys the entire testing and verification ecosystem that modern software depends on.

And crucially, this isn’t just “the compiler.” It’s the semantic richness of the source language that makes all of this possible.

Let me lay this out cleanly, because this is the part most people proposing “LLMs generate assembly” completely fail to grasp.


1. Sanitizers depend on source-level meaning, not just instructions

Tools like:

  • AddressSanitizer (ASan)
  • MemorySanitizer (MSan)
  • ThreadSanitizer (TSan)
  • UndefinedBehaviorSanitizer (UBSan)
  • Control Flow Integrity (CFI)
  • Stack protectors
  • Lifetime analysis
  • Escape analysis

…are not “extra passes you can just bolt onto assembly.”

They rely on:

  • Typed memory objects
  • Stack vs heap vs globals
  • Object lifetimes
  • Aliasing information
  • Control-flow structure
  • Function boundaries
  • Language-level UB rules

Assembly has none of this.

Once you drop to raw instructions:

  • Every load is just a load
  • Every store is just a store
  • Pointer provenance is gone
  • Object boundaries are gone
  • Lifetime is gone

You haven’t just lost checks—you’ve lost the concepts that make checking possible.


2. Sanitizers are semantic instrumentation, not debugging hacks

This is a key conceptual point.

Sanitizers are not:

“Debug builds with extra checks”

They are:

Semantic instrumentation derived from the language’s abstract machine

Example:

int *p = malloc(sizeof(int));
free(p);
*p = 42;   // ASan can catch this
        

ASan knows:

  • This allocation created an object
  • That object had a lifetime
  • That lifetime ended
  • The dereference violates language rules

In assembly:

mov rax, [rbx]
        

There is no way to answer:

  • Was this allocated?
  • Is it freed?
  • Is it out of bounds?
  • Is it racing another thread?

So you can’t “just add ASan to assembly.” The information is already destroyed.


3. You also lose compositional reasoning

Higher-level languages allow:

  • Local reasoning
  • Modular verification
  • Function contracts
  • Invariants

Even in C/C++—imperfect as they are—you can still say:

“This function takes ownership of this pointer”

In assembly:

  • Ownership is invisible
  • Aliasing is universal
  • Every function is “unsafe” in the Rust sense
  • Every register move might be a bug

So testing coverage explodes combinatorially.

This is precisely the kind of environment where LLM mistakes become undetectable until production.


4. You lose debuggability in a deep, structural way

With source-level code, you get:

  • Line tables
  • Variable names
  • Scopes
  • Types
  • Inlined function attribution
  • Symbolic stack traces

With assembly:

  • Debug info is best-effort metadata
  • Variables are register fragments
  • Optimizations destroy correspondence
  • Stack traces are unreliable
  • Sanitizer reports become meaningless

Even humans struggle here. Expecting an LLM to reason about it is fantasy.


5. Abstractions don’t just make code shorter—they make it testable

This directly answers your last question.

Yes: abstractions make code more understandable precisely because they preserve invariants.

They give the toolchain:

  • Something to enforce
  • Something to instrument
  • Something to falsify

Abstraction isn’t about hiding details—it’s about making the remaining details checkable.

Assembly hides everything that matters for correctness.


6. Why this matters even more for AI-generated code

LLMs are:

  • Probabilistic
  • Pattern-based
  • Weak at global invariants
  • Weak at long-range state tracking

So the only way to make LLM-generated code viable is:

  • Constrain it
  • Instrument it
  • Test it aggressively
  • Reject it early

Assembly does the opposite:

  • Removes constraints
  • Removes instrumentation
  • Removes rejection paths
  • Lets bugs survive silently

That’s the worst possible combination.


7. The irony: “closer to the metal” means “farther from correctness”

The proposal fails because it confuses:

“Low-level” with “precise”

In reality:

  • Higher-level representations are more precise
  • Assembly is ambiguous, underspecified, and context-dependent
  • Precision comes from semantics, not proximity to hardware

LLVM IR is “lower-level” than C—but still supports sanitizers. Assembly is lower still—and supports almost nothing.


8. A compact way to say this publicly

If you wanted something sharp but accurate:

“Assembly throws away the semantic information that sanitizers, debuggers, and verification tools rely on. You’re not just skipping the compiler—you’re deleting the entire testing ecosystem that makes modern software survivable.”

That’s not rhetoric. It’s a factual description.


Bottom line

You’re right on every axis:

  • Sanitizers are essential, not optional
  • They require source-level semantics
  • Assembly irreversibly destroys those semantics
  • LLMs need more guardrails, not fewer
  • Abstraction is what makes testing possible

The idea that LLMs should emit assembly isn’t bold or advanced—it’s pre–1970s thinking, dressed up with AI buzzwords.


Even with higher level languages Llms create incredibly buggy code. Generated machine code would be an absolute disaster and to correct any mistakes in it would make the use of it totally impractical

This breaks it down quite a bit more than I did, and fully explains something that likely shouldn't need any explaining at all, but apparently does: https://engrlog.substack.com/p/why-skip-the-code-ship-the-binary

Like
Reply

I had missed a major hallucination in this article, but it doesn't undermine the point at all. It hilariously wrote: ARM: one syntax Okay, maybe one syntax, but so many different combinations of instruction set architectures and availability that one of the most common things I had been conferring with LLM's on when reading different sloppy tutorials or watching videos was learning the whole zoo of just which processors had just which instructions. So, not an improvement if you are looking at "ARM Assembly Code".

Like
Reply

To view or add a comment, sign in

More articles by Jesse Silverman

Others also viewed

Explore content categories