Machine Learning Hardware

Rajeev M A

Published May 19, 2019

“If you were plowing a field, which would you rather use: two strong oxen or 1,024 chickens?”. The thoughts of Seymour Cray still holds good in a world where one hardware architecture does not dominate processing all types of workloads. At times two strong oxen are wining and other times 1024 chickens are winning. The same is the case with machine Learning (ML) / Deep Learning (DL) workloads. No single hardware architecture is able to dominate this field.

The computational requirements of ML / DL are high since we are trying to identify patterns in a dataset which could be infinite. The data undergoes transformation prior to pattern identification followed by inference of learned model against data not seen in the past. In ML terminology it is called pre processing, model building, & inference at a high level. Computational requirements are much higher for model building followed by pre processing and then inference. Model building is an iterative process and requires many iterations prior to arriving at a generalized model which works fine with unseen scenarios. The entire process of ML might have more steps than the three mentioned. From the perspective of computational requirements, these three steps constitute the major chunk.

There are multiple ways to increase the throughput of a ML pipeline. Software optimization is the first step and many software frameworks are being altered to perform better in heterogeneous hardware's. Better utilization of cores in a multi core CPU (Xeon Skylake or Xeon Cascade Lake) is one such example. Other techniques include mixed precision compute to increase the throughput. Use of FP32 or FP64 is not required all the time and we can use low precision to achieve more throughput in some scenarios. OpenVINO toolkit is an example of highly optimized framework for inference on CPU, integrated graphics, & FPGA. Hardware acceleration is the other alternative. For example specialized hardware for vector processing like GPU or TPU.

The data used for ML can be represented as scalar, vector, matrix, & spatial. The matrix can be thought of as row vector or column vector. It is difficult to find a unified hardware architecture which can process all this forms of data representations equally well. CPU's are strong for processing scalar with extensions for vectors through Advanced Vector Extensions (AVX). GPU's are more ideal for vector processing compared to CPU's. That is why they are used as accelerators in some ML / DL pipelines. The spatial data is better processed using FPGA. The hardware options today we have are

CPU
GPU (Integrated Graphics & Discrete Graphics)
FPGA
ASIC / ASSP / SoC
TPU
VPU
OPU
IPU
......

The hardware options listed above are mostly used in conjunction with CPU (except few) and acts as an accelerator to CPU to speed up computation. CPU is the most general purpose among them. CPU offloading certain type of workload to the accelerator is the most common scenario. Mostly the software frameworks use the most appropriate hardware you have in your system with minor changes to code. It is clear that heterogeneous computing is here to stay and a better understanding of is needed to exploit the best out of hardware. Some of the hardware listed above are under research and development.

Model building is the most computational intensive step with high level of iteration involved. Depending on the size of data and computational requirements we can either use accelerators on a single system or build model using vertical scaling (more accelerators) or distributing the workload across multiple systems. There are many examples of HPC / Distributed architectures used for ML / DL training with varied success based on different workloads. There are many criteria's for choosing the right hardware as part the machine learning pipeline. A small list of criteria's are as follows.

Which stage of ML are we dealing with? Pre processing, model building, & inference. A hardware ideal for model building might not be the ideal one for inference.
How much data are you processing? What happens if data doesn't fit into memory?
What type of data are you processing? Scalar, Vector, Matrix, or Spatial.
Which ML / DL framework are you using? Some frameworks are more optimized for certain hardware architecture.
Is the solution deployed at edge or cloud? Power usage and latency are a great concern at edge.
Are you concerned with the power consumed or low latency or heat generated during inference?
Are you willing to compromise speed for accuracy using techniques like mixed or low precision compute?
Are you a researcher who wants to experiment quickly to arrive at alternate algorithms quickly or a user who wants to use an existing algorithm?
What type of algorithms do you use? Some algorithms perform better on certain hardware.
What is your throughput requirements? For example 30 frames per sec or similar measures.
Is your team knowledgeable enough to work with frameworks like CUDA / OpenCL / ROCm to take further advantage of hardware if needed?
Do you expect other workloads other than ML / DL to be run on the same hardware?
How many concurrent workloads are you planning to run in the same hardware?
Are you looking for on-premise computing or are fine with cloud? Some options are available only in cloud.
The cost of processing (time vs power usage vs speed)

Due to the high computational requirements of machine learning / deep learning, lot of research is going into development of new hardware's and optimization of software's to exploit the existing hardware's. The research in hardware can be classified as evolutionary or revolutionary. Use of photons instead of electrons in OPU is an example of bridging from evolutionary to revolutionary computing. We are in the early stages of revolutionary computing with options like neuromorphic computing or quantum computing in the future. A hardware revolution is happening which ones might allow us to solve problems which were out off bounds for evolutionary computing. It is not sure how the computing landscape will change in next couple of decades, but we can hope for better. It is estimated that "Data centers of the world will consume 1/5 of Earth’s power by 2025". It is a must that we produce less power hungry machines going forward as the compute requirements are constantly increasing.

Aayush G. 9mo

Good Content

Sherjung Singh 4y

Insightful

Meetu Sapra 4y

Great Content

1 Reaction

sushildutt dhoundiyal 4y

Good content 👍

1 Reaction

Shine Vijayan 4y

Great content

See more comments

To view or add a comment, sign in

Machine Learning Hardware

Rajeev M A

More articles by Rajeev M A

Others also viewed

Batch-Invariant Kernels and Deterministic AI

Unleashing the Power of Tensors in Machine Learning

Model optimization for Fast Inference and Quantization

Top Parallelization Techniques for Enhancing AI Training

Inside AI Training Stack on a Single GPU

ML Series: 7. Game of Matrix Multiplication

🔍 How Much GPU Memory Do You Really Need to Run a Large Language Model (LLM)?

The Great LLM Inference Showdown: TensorRT-LLM vs vLLM Technical Comparison

Why Deep Learning Loves GPUs?

Optimizing GPU Memory: 3 Techniques for Efficient Local Training using Pytorch

How to Optimize Machine Learning Performance

ML in high-resolution weather forecasting

Data Preprocessing for Large Language Models

Why Large Language Models Require More Computing Power

Neural Network Architectures

Explore content categories

More articles by Rajeev M A

AI: Between Innovation, Hype, and Economic Reality

Application of AI in Media Sector

Bridging the Gap: Industry and Academia in AI/ML

Vibe Coding: Where the AI Magic Fizzles Out

The 7R Model of AI Evolution: From Retrieval to Retroponitic

There is No Innovation Without an Invoice

Generative AI

Applications of Artificial Intelligence in the Power Sector

Stochasticity in Business Process

MLOps