The Move From Cloud To Real Time Embedded Machine Learning

Dr Paul Anthony Creaser

Multi-domain Researcher, Engineer etc… From research and development to practical solutions, covering AI, algorithms, software, embedded firmware, cloud, devops, mlops, to hardware and sensors …

Published Apr 1, 2017

Cloud To Local Embedded System Processing

Poor connectivity, data quantity, real time decision making requirements and security concerns means cloud computing for many customers is not an option. However moving expensive, power hungry computing processing hardware to the location of the application is also not an option.

A general requirement is an embedded type system which might be similar in size to a standard Smart Phone. General requirements are often, low power, a reasonable level of performance and near real time processing.

Our Lessons

In the move from Cloud Machine Learning to Embedded Solutions we have encountered 4 key elements, which have proved highly critical to our products.

The need for focused applications, with narrow clear objectives
The ability to update Algorithms and Parameters allowing a continuously evolving, improving system
Automatic Embedded Code Generation System to speed up the development process
Well designed customized SOC/Board design For Real Time Processing

Focused Application

Generally the requirements for Machine Learning applications are quite focussed with narrow application specific aims. They vary from Heart problem detection, to the prediction of hardware failures. The key is to clearly define your requirements and produce an algorithm which performs well for those requirements.

This alone goes a long way to enabling them to be carried out on embedded systems. Even with this reduction in complexity, they often remain beyond most standard SOC designs.

Common solutions are:-

GPU Accelerated Systems
Multi-core DSP Solutions With Instruction Customization
Custom FPGA Designs
Custom LSI

Training And Application

Training of Machine Learning algorithms remains an off line task. Embedded systems simply cannot match the processing power of common cloud servers with GPU or FPGA co-processors. The embedded system as a result simply implements the trained network.

System Updates

Updating the system can take two forms:-

Replacing the entire firmware (entire algorithm)
Updating the Machine Learning Algorithm Parameters

The ability to update the system, as training data quantity and quality increases is vital. There is the possibility of better tuned algorithms and or even a change in the algorithms.

With such possibilities, care must be taken when designing customer SOCs. There is the risk the SOC is so highly customized, future upgrades, are no longer able to take advantage of the custom hardware.

Automatic Embedded Source Code Generation

The starting point for many Machine Learning algorithm development projects are Open Source frameworks such as Chainer, Tensor Flow and Caffe. These frameworks are written in Python, which makes them highly flexible and relatively easy to use. The underlying libraries are written in C or C++ to ensure good performance.

A significant number of companies now have developed software which automatically converts code from these frameworks to RTML, C or whichever language is appropriate for their specific platform.

However with standard embedded systems, performance can be poor and power requirements are excessive.

Customized Embedded Board/SOC Designs

There are currently about four solutions.

GPU Accelerated SOCs
Multi-core DSP Solutions With Instruction Customization
Customer FPGA Designs
Customer LSI

We are currently focused on the first 3 solutions.

GPU Solution

GPU solutions are known for their high performance, but at the expense of high power requirements. Companies such as NVIDIA are making significant progress in this respect, making embedded SOCs with significant capabilities for Deep Learning applications.

For systems which have a reliable power supply or batteries of significant size with regular recharging (mobile robot) and were decision making is in the order of a fraction of a second, they are potentially practical.

Multi-Core DSP Solutions With Instruction Customization

A common solution is the use of SOCs with multiple DPS cores. We currently focus on the use of Tensilica Cores, because they are highly configurable, with the potential for high performance and low power. Examples of configurability include:-

Configurable bus widths
Configurable VLWI instructions (32->128 bit). Enabling the execution of multiple instructions in a single cycle
Custom Instructions Including SIMD and MIMD
High speed buses allowing direct connectivity between cores and Memory
DSP core options
Special DMA technologies such as Supergather

The use of multicore solutions allows us to achieve good performance and when combined with a general purpose process cores such as an ARM 9 core, application development can be relatively straightforward.

Current applications are focussed on Machine Learning Algorithms and RNN systems.

FPGA Solutions

Another solution is the FPGA. They are relatively low powered and it is possible to get good performance with respect to power. Building an FPGA design from scratch can be problematic. Recently many companies use tools to autogenerate C or RTML from the various Python toolsets, so this has been greatly simplified.

Many practical application require preprocessing of input data, for example filters, FFT, Wavelet transforms, so integrating various IP can be relatively simple.

Dr Paul Anthony Creaser

Multi-domain Researcher, Engineer etc… From research and development to practical solutions, covering AI, algorithms, software, embedded firmware, cloud, devops, mlops, to hardware and sensors …

It is 2020 and all this is becoming ubiquitous.

To view or add a comment, sign in

The Move From Cloud To Real Time Embedded Machine Learning

Dr Paul Anthony Creaser

Multi-domain Researcher, Engineer etc… From research and development to practical solutions, covering AI, algorithms, software, embedded firmware, cloud, devops, mlops, to hardware and sensors …

Cloud To Local Embedded System Processing

Our Lessons

Focused Application

Training And Application

System Updates

Automatic Embedded Source Code Generation

Customized Embedded Board/SOC Designs

GPU Solution

Multi-Core DSP Solutions With Instruction Customization

FPGA Solutions

More articles by Dr Paul Anthony Creaser

Others also viewed

Is Kubernetes powering the next generation of AI deployments?

Suitable Acceleration Unit of Deep Learning with Google Colab Cloud

OpenAI doesn't run vLLM. I spent days reverse-engineering their inference stack from public evidence, and what I found is genuinely impressive.

AIBrix: The New Frontier of Cloud-Native LLM Orchestration

Quantum Leaps in Web Development: The Full-Stack Developer's Guide to Spatial Computing and Quantum Computing 🚀🌌

Gradient accumulation and distributed training

Multiagent -LLM

Machine Learning Is Invading a Cloud Near You

The Role of Algorithms in Computing

Explore content categories

Cloud To Local Embedded System Processing

Our Lessons

Focused Application

Training And Application

System Updates

Automatic Embedded Source Code Generation

Customized Embedded Board/SOC Designs

GPU Solution

Multi-Core DSP Solutions With Instruction Customization

FPGA Solutions

More articles by Dr Paul Anthony Creaser

A Smarter Coffee Machine

Bringing Old Elevators Into the Age of IOT

Gas MEMs Sensors

Sound Detection System For Hard of Hearing

Automated AI and ML Service For Sound Classification (WIP)

Building AI Capabilities In A Company From Scratch (In Progress)

Multi Protocol Wireless Gateway

Venture meaning adVenture

Custom Tensilica Cores For Neural Networks

Thoughts For Practical Neural Networks On Embedded Systems

Others also viewed

Is Kubernetes powering the next generation of AI deployments?

Suitable Acceleration Unit of Deep Learning with Google Colab Cloud

OpenAI doesn't run vLLM. I spent days reverse-engineering their inference stack from public evidence, and what I found is genuinely impressive.

AIBrix: The New Frontier of Cloud-Native LLM Orchestration

Quantum Leaps in Web Development: The Full-Stack Developer's Guide to Spatial Computing and Quantum Computing 🚀🌌

Gradient accumulation and distributed training

Multiagent -LLM

Machine Learning Is Invading a Cloud Near You

The Role of Algorithms in Computing

Similar topics

Machine Learning Hardware Applications

Real-Time AI Processing Using Advanced Hardware

Neural Network Architectures

Explore content categories