The Sum of All Inference
Credit Reality PC YouTube Channel

The Sum of All Inference

Generative AI [1] is a breakthrough of our time -- or maybe it's hype, depending on your perspective and expectations. But let's assume the former. Currently the largest AI models have more than 50 layers and over 200 billion parameters [2], needing 100s of GPUs and days or even weeks to train. That's kinda Ok as language and source codes aren't changing that fast. After all, humans take years to train, and at what expense lord only knows !

But training is one half of an AI model. Like any computer program, an AI model has to run - this is called inference. To give some perspective on large model inference, consider the following:

  • when you spend 5 minutes chatting with a large model, you pour 1L of water on the ground [3]. That's right, not re-used, but gone - and not salt water either - fresh water. This is because literally dozens of high-end CPU and GPU cores, consuming 1000s of Watts of power, are needed to calculate inference for your conversation. As you can imagine, these "inference cores" get extremely hot. To provide non-corrosive water cooling needed to avoid a meltdown [4], inference cores operate in "server farms" located near cold water, for example Google's site on the Columbia River near The Dalles, Oregon
  • the need for inference is increasing exponentially, an endless quest for more layers and more parameters. Where does this lead ? To server farms the size of Rhode Island ? Located on the floor of Lake Superior ?
  • currently only three (3) semiconductor manufacturers on the planet make off-the-shelf inference cores: Nvidia, AMD, and Intel. In addition, major cloud and social media providers (Google, Amazon, Microsoft, Apple, Facebook, etc) make their own inference cores, using ASIC [5] technology to avoid paying markup to the semiconductor Big 3

Clearly, something is wrong with this picture. The human brain takes 40 Watts, not thousands of Watts, and 1L of water lasts a day, not 5 minutes. On our current path, if our best AI model data scientists continue to make breakthroughs for the next 10 years, they won't approach even a fraction of human intelligence.

So how will we satisfy society's need for massive inference ? Here are some predictions. Mark it down, you heard these here first:

  1. First and foremost, bigger is not better. Neither is "hive intelligence", when all inference is handled by cloud providers. Bigger is not efficient and centralization leads to weakness at scale. Evolution demonstrates that efficiency is the ultimate goal. Our current direction in CPU + GPU + software architecture is woefully inefficient
  2. Water usage of large AI models is unsustainable. Already the current generation of models make Bitcoin mining look like kindergarten. Moreover, the cooling profile at the typical big tech server farm installation is increasingly skewing hotter, as servers run more inference workloads in addition to social media and web workloads [6]. Low power Arm based servers can't handle large model inference, so more high-end GPU and Xeon x86 servers are needed. Soon climate activists will figure this out and oppose construction of new server farms (inference data centers) with the same intensity they do hydroelectric dams and nuclear reactors
  3. The real breakthrough will be semiconductor neural memory:

-1000x more capacity than what we have today

-based on variations of "content accessible" addressing (CAM [7]), instead of a math-based address space, with huge address ranges, on the order of petabytes or higher

-slow access time, on the order of msec (that's millisecond, 3x slower than we use today)

-no EDAC [8] circuitry, which will be replaced by computational neural net connections; i.e. smart connections instead of "dumb weights". Errors will be considered unimportant and possibly even beneficial

-extremely low cost - no exotic materials

Does this start to sound vaguely familiar ? Yes, something like the billion billions of neurons and synapses in the human brain.

4. Training will not require "gradient descent", "simulated annealing", or other mathematically complex techniques. Nothing like this is happening in the brain - indeed, the brain has nowhere near the required level of power and error-free calculation. Instead, training will be based on structural data relationships - organization, proximity, pathways - along with persistence (repetition and forgetfulness) that support CAM methods. Unfortunately for Nvidia, complex math calculations will not be needed

5. Established semiconductor companies other than the Big 3 will be incentivized by CHIPS Act (and future government legislation) funding to develop combined CPU and neural memory devices. For example, Texas Instruments possesses archived technology that still outperforms Nvidia and Intel in "processing density" (the ratio of performance over chip package size and power consumption), even 8 years after they shelved it. In addition to funding, the government could deny TI waivers to sell to China analog and discrete devices unless they partner with a memory manufacturer and re-enter the inference device market. Other candidates for government intervention include QualComm and Analog Devices

We can't say when and we can't say who, but what we can say is the semiconductor entity - startup, established, or government sponsored consortium - that figures out neural memory will be the first "big AI", this century's success story.

[1] Generative Pre-Trained Transformer (GPT) Large Language Models (LLMs) and Large Programming Models (LPMs)

[2] In AI, extremely high numbers of layers and parameters underlies the term "deep", as in deep learning or deep neural networks (DNNs)

[3] https://gizmodo.com/chatgpt-ai-water-185000-gallons-training-nuclear-1850324249

[4] Meltdown - literally, to avoid melting the solder attaching CPU and GPU cores to their circuit board

[5] ASIC - application specific integrated circuit

[6] https://www.wsj.com/articles/rising-data-center-costs-linked-to-ai-demands-fc6adc0e

[7] Generally known as CAM - content addressable memory

[8] EDAC - error detection and correction

To view or add a comment, sign in

More articles by Jeff Brower

Others also viewed

Explore content categories