The Move From Cloud To Real Time Embedded Machine Learning
Cloud To Local Embedded System Processing
Poor connectivity, data quantity, real time decision making requirements and security concerns means cloud computing for many customers is not an option. However moving expensive, power hungry computing processing hardware to the location of the application is also not an option.
A general requirement is an embedded type system which might be similar in size to a standard Smart Phone. General requirements are often, low power, a reasonable level of performance and near real time processing.
Our Lessons
In the move from Cloud Machine Learning to Embedded Solutions we have encountered 4 key elements, which have proved highly critical to our products.
- The need for focused applications, with narrow clear objectives
- The ability to update Algorithms and Parameters allowing a continuously evolving, improving system
- Automatic Embedded Code Generation System to speed up the development process
- Well designed customized SOC/Board design For Real Time Processing
Focused Application
Generally the requirements for Machine Learning applications are quite focussed with narrow application specific aims. They vary from Heart problem detection, to the prediction of hardware failures. The key is to clearly define your requirements and produce an algorithm which performs well for those requirements.
This alone goes a long way to enabling them to be carried out on embedded systems. Even with this reduction in complexity, they often remain beyond most standard SOC designs.
Common solutions are:-
- GPU Accelerated Systems
- Multi-core DSP Solutions With Instruction Customization
- Custom FPGA Designs
- Custom LSI
Training And Application
Training of Machine Learning algorithms remains an off line task. Embedded systems simply cannot match the processing power of common cloud servers with GPU or FPGA co-processors. The embedded system as a result simply implements the trained network.
System Updates
Updating the system can take two forms:-
- Replacing the entire firmware (entire algorithm)
- Updating the Machine Learning Algorithm Parameters
The ability to update the system, as training data quantity and quality increases is vital. There is the possibility of better tuned algorithms and or even a change in the algorithms.
With such possibilities, care must be taken when designing customer SOCs. There is the risk the SOC is so highly customized, future upgrades, are no longer able to take advantage of the custom hardware.
Automatic Embedded Source Code Generation
The starting point for many Machine Learning algorithm development projects are Open Source frameworks such as Chainer, Tensor Flow and Caffe. These frameworks are written in Python, which makes them highly flexible and relatively easy to use. The underlying libraries are written in C or C++ to ensure good performance.
A significant number of companies now have developed software which automatically converts code from these frameworks to RTML, C or whichever language is appropriate for their specific platform.
However with standard embedded systems, performance can be poor and power requirements are excessive.
Customized Embedded Board/SOC Designs
There are currently about four solutions.
- GPU Accelerated SOCs
- Multi-core DSP Solutions With Instruction Customization
- Customer FPGA Designs
- Customer LSI
We are currently focused on the first 3 solutions.
GPU Solution
GPU solutions are known for their high performance, but at the expense of high power requirements. Companies such as NVIDIA are making significant progress in this respect, making embedded SOCs with significant capabilities for Deep Learning applications.
For systems which have a reliable power supply or batteries of significant size with regular recharging (mobile robot) and were decision making is in the order of a fraction of a second, they are potentially practical.
Multi-Core DSP Solutions With Instruction Customization
A common solution is the use of SOCs with multiple DPS cores. We currently focus on the use of Tensilica Cores, because they are highly configurable, with the potential for high performance and low power. Examples of configurability include:-
- Configurable bus widths
- Configurable VLWI instructions (32->128 bit). Enabling the execution of multiple instructions in a single cycle
- Custom Instructions Including SIMD and MIMD
- High speed buses allowing direct connectivity between cores and Memory
- DSP core options
- Special DMA technologies such as Supergather
The use of multicore solutions allows us to achieve good performance and when combined with a general purpose process cores such as an ARM 9 core, application development can be relatively straightforward.
Current applications are focussed on Machine Learning Algorithms and RNN systems.
FPGA Solutions
Another solution is the FPGA. They are relatively low powered and it is possible to get good performance with respect to power. Building an FPGA design from scratch can be problematic. Recently many companies use tools to autogenerate C or RTML from the various Python toolsets, so this has been greatly simplified.
Many practical application require preprocessing of input data, for example filters, FFT, Wavelet transforms, so integrating various IP can be relatively simple.
Multi-domain Researcher, Engineer etc… From research and development to practical solutions, covering AI, algorithms, software, embedded firmware, cloud, devops, mlops, to hardware and sensors …
5yIt is 2020 and all this is becoming ubiquitous.