The Sum of All Inference

Jeff Brower

Published Jun 28, 2023

Generative AI [1] is a breakthrough of our time -- or maybe it's hype, depending on your perspective and expectations. But let's assume the former. Currently the largest AI models have more than 50 layers and over 200 billion parameters [2], needing 100s of GPUs and days or even weeks to train. That's kinda Ok as language and source codes aren't changing that fast. After all, humans take years to train, and at what expense lord only knows !

But training is one half of an AI model. Like any computer program, an AI model has to run - this is called inference. To give some perspective on large model inference, consider the following:

when you spend 5 minutes chatting with a large model, you pour 1L of water on the ground [3]. That's right, not re-used, but gone - and not salt water either - fresh water. This is because literally dozens of high-end CPU and GPU cores, consuming 1000s of Watts of power, are needed to calculate inference for your conversation. As you can imagine, these "inference cores" get extremely hot. To provide non-corrosive water cooling needed to avoid a meltdown [4], inference cores operate in "server farms" located near cold water, for example Google's site on the Columbia River near The Dalles, Oregon
the need for inference is increasing exponentially, an endless quest for more layers and more parameters. Where does this lead ? To server farms the size of Rhode Island ? Located on the floor of Lake Superior ?
currently only three (3) semiconductor manufacturers on the planet make off-the-shelf inference cores: Nvidia, AMD, and Intel. In addition, major cloud and social media providers (Google, Amazon, Microsoft, Apple, Facebook, etc) make their own inference cores, using ASIC [5] technology to avoid paying markup to the semiconductor Big 3

Clearly, something is wrong with this picture. The human brain takes 40 Watts, not thousands of Watts, and 1L of water lasts a day, not 5 minutes. On our current path, if our best AI model data scientists continue to make breakthroughs for the next 10 years, they won't approach even a fraction of human intelligence.

So how will we satisfy society's need for massive inference ? Here are some predictions. Mark it down, you heard these here first:

First and foremost, bigger is not better. Neither is "hive intelligence", when all inference is handled by cloud providers. Bigger is not efficient and centralization leads to weakness at scale. Evolution demonstrates that efficiency is the ultimate goal. Our current direction in CPU + GPU + software architecture is woefully inefficient
Water usage of large AI models is unsustainable. Already the current generation of models make Bitcoin mining look like kindergarten. Moreover, the cooling profile at the typical big tech server farm installation is increasingly skewing hotter, as servers run more inference workloads in addition to social media and web workloads [6]. Low power Arm based servers can't handle large model inference, so more high-end GPU and Xeon x86 servers are needed. Soon climate activists will figure this out and oppose construction of new server farms (inference data centers) with the same intensity they do hydroelectric dams and nuclear reactors
The real breakthrough will be semiconductor neural memory:

-1000x more capacity than what we have today

-based on variations of "content accessible" addressing (CAM [7]), instead of a math-based address space, with huge address ranges, on the order of petabytes or higher

-slow access time, on the order of msec (that's millisecond, 3x slower than we use today)

-no EDAC [8] circuitry, which will be replaced by computational neural net connections; i.e. smart connections instead of "dumb weights". Errors will be considered unimportant and possibly even beneficial

-extremely low cost - no exotic materials

Does this start to sound vaguely familiar ? Yes, something like the billion billions of neurons and synapses in the human brain.

Recommended by LinkedIn

10 AI Predictions That Actually Matter (Part 1)

Elsa V. Paul, CGAIE, AIC 4 months ago

Deep Dive AI Before & After LLM

Muhammad Mushfiqur Rahman, PSPO-I™ 2 weeks ago

Edition -240: Superimposed Possibilities

Jayant Sharma 11 months ago

4. Training will not require "gradient descent", "simulated annealing", or other mathematically complex techniques. Nothing like this is happening in the brain - indeed, the brain has nowhere near the required level of power and error-free calculation. Instead, training will be based on structural data relationships - organization, proximity, pathways - along with persistence (repetition and forgetfulness) that support CAM methods. Unfortunately for Nvidia, complex math calculations will not be needed

5. Established semiconductor companies other than the Big 3 will be incentivized by CHIPS Act (and future government legislation) funding to develop combined CPU and neural memory devices. For example, Texas Instruments possesses archived technology that still outperforms Nvidia and Intel in "processing density" (the ratio of performance over chip package size and power consumption), even 8 years after they shelved it. In addition to funding, the government could deny TI waivers to sell to China analog and discrete devices unless they partner with a memory manufacturer and re-enter the inference device market. Other candidates for government intervention include QualComm and Analog Devices

We can't say when and we can't say who, but what we can say is the semiconductor entity - startup, established, or government sponsored consortium - that figures out neural memory will be the first "big AI", this century's success story.

[1] Generative Pre-Trained Transformer (GPT) Large Language Models (LLMs) and Large Programming Models (LPMs)

[2] In AI, extremely high numbers of layers and parameters underlies the term "deep", as in deep learning or deep neural networks (DNNs)

[3] https://gizmodo.com/chatgpt-ai-water-185000-gallons-training-nuclear-1850324249

[4] Meltdown - literally, to avoid melting the solder attaching CPU and GPU cores to their circuit board

[5] ASIC - application specific integrated circuit

[6] https://www.wsj.com/articles/rising-data-center-costs-linked-to-ai-demands-fc6adc0e

[7] Generally known as CAM - content addressable memory

[8] EDAC - error detection and correction

To view or add a comment, sign in

The Sum of All Inference

Jeff Brower

Recommended by LinkedIn

More articles by Jeff Brower

Others also viewed

AI Nutri-Grade

The Future of AI is Hardware, not Tokens.

When AI Moves From Answering to Acting, Fluency Stops Being Enough

🚀 NVIDIA’s 4B Model Breakthrough: A New Milestone in Efficient AGI Progress

The Techcurious Leader: Issue #6

🤖 AI K-news #49

Toby’s Take: What Mattered in AI This Week – Ending September 7, 2025

Where your AI (LLM) "lives" matters...

AI's Midas Touch: Are We Spending Trillions on a Bubble or a Black Box?

Beyond Bigger Models: The Practical Playbook for Faster, Safer AI

Why Use Inference-First Systems for Large Language Models

Ensuring Reliable Inference in Large Language Models

Why Large Language Models Require More Computing Power

How Large Language Models Solve Problems Without Introspection

How Large Language Models Create Text Responses

Explore content categories

Recommended by LinkedIn

More articles by Jeff Brower

The Sicilian Software Defense vs the Citrini Report

ChatGPT Likes Our Code ! ChatGPT AI Agent Code Review

Small Language Model (SLM) for On-Device Speech Recognition

Is Nvidia a monopoly ? The real question is how to energize competition in the semiconductor top 10

ONE Summit 2022 - Akraino kiosk

Make Hardware Great Again (fixing the hardware supply chain)

Are DSPs Dead (What Happened to Texas Instruments) ?

Deep Learning Model Compression Using Nvidia SoCs and Intel Atom CPUs

Can Texas Instruments c66x Do Deep Learning ?

Predictive Analytics from Visualized Log Data

Others also viewed

AI Nutri-Grade

The Future of AI is Hardware, not Tokens.

When AI Moves From Answering to Acting, Fluency Stops Being Enough

🚀 NVIDIA’s 4B Model Breakthrough: A New Milestone in Efficient AGI Progress

The Techcurious Leader: Issue #6

🤖 AI K-news #49

Toby’s Take: What Mattered in AI This Week – Ending September 7, 2025

Where your AI (LLM) "lives" matters...

AI's Midas Touch: Are We Spending Trillions on a Bubble or a Black Box?

Beyond Bigger Models: The Practical Playbook for Faster, Safer AI

Similar topics

Why Use Inference-First Systems for Large Language Models

Ensuring Reliable Inference in Large Language Models

Why Large Language Models Require More Computing Power

How Large Language Models Solve Problems Without Introspection

How Large Language Models Create Text Responses

Explore content categories