Model optimization for Fast Inference and Quantization

Dr. Pankaj Mishra

Published Nov 17, 2020

Recently I have been working on a very interesting project with my sponsoring company beanTech and we were trying to speed-up our final trained model. So, we tried a couple of options…….!!

Before going further let me confess, that I am an avid PyTorch user and would like to do most of my work using PyTorch.

Hence, we tried some pre-built methods given in PyTorch framework to speed-up the inference like-

Automatic mixed-precision inference
Deploying the whole script using c++, instead of python (because python is slow due to its Global Interpreter Lock).

In this quest, I came across two different frameworks, which are used for speed-up inference. These are:

TensorRT: NVIDIA TensorRT is an SDK for high-performance deep learning inference.

OpenVINO: Comprehensive toolkit from Intel, to optimize your processes for faster inference.

Well, the choice was obvious for me (as we are using the Nvidia GPUs). Hence, I opted for the TensorRT for model optimization and speed-up.

Well, hold-on..! Why I am telling you this? Because it’s not that easy as I am telling you in the above lines.

So, why take the pain? Because ends results are sweeter.

Enough of metaphors and let's see what TensorRT is (most of the below content is taken from the Nvidia TensorRT documentation, so you either directly visit that or have a happy reading..!!)

Wait wait…!! If everything below is from the website, then why you are writing this article?

Because there are many practical aspects of using TensorRT, some of which are discussed in end (so the end is classic, do check there, don’t miss the show hit..!!)

NVIDIA TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.

What Is TensorRT?

The core of NVIDIA® TensorRT™ is a C++ library that facilitates high-performance inference on NVIDIA graphics processing units (GPUs). It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already-trained network quickly and efficiently on a GPU for the purpose of generating a result (a process that is referred to in various places as scoring, detecting, regression, or inference).

Some training frameworks such as TensorFlow have integrated TensorRT so that it can be used to accelerate inference within the framework. Alternatively, TensorRT can be used as a library within a user application. It includes parsers for importing existing models from Caffe, ONNX, or TensorFlow, and C++ and Python APIs for building models programmatically.

TensorRT provides APIs via C++ and Python that help to express deep learning models via the Network Definition API or load a pre-defined model via the parsers that allow TensorRT to optimize and run them on a NVIDIA GPU. TensorRT applies graph optimizations, layer fusion, among other optimizations, while also finding the fastest implementation of that model leveraging a diverse collection of highly optimized kernels. TensorRT also supplies a runtime that you can use to execute this network on all of NVIDIA's GPUs from the Kepler generation onwards.

Source: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

Benefits Of TensorRT

TensorRT combines layers, optimizes kernel selection, and also performs normalization and conversion to optimized matrix math depending on the specified precision (FP32, FP16, or INT8) for improved latency, throughput, and efficiency.

For deep learning inference, there are 5 critical factors that are used to measure software:

Throughput

The volume of output within a given period. Often measured in inferences/second or samples/second, per-server throughput is critical to cost-effective scaling in data centers.

Efficiency

Amount of throughput delivered per unit-power often expressed as performance/watt. Efficiency is another key factor to cost-effective data center scaling, since servers, server racks, and entire data centers must operate within fixed power budgets.

Latency

Time to execute an inference, usually measured in milliseconds. Low latency is critical to delivering rapidly growing, real-time inference-based services.

Accuracy

A trained neural network’s ability to deliver the correct answer. For image classification based usages, the critical metric is expressed as a top-5 or top-1 percentage.

Memory usage

The host and device memory that need to be reserved to do inference on a network depend on the algorithms used. This constrains what networks and what combinations of networks can run on a given inference platform. This is particularly important for systems where multiple networks are needed and memory resources are limited - such as cascading multi-class detection networks used in intelligent video analytics and multi-camera, multi-network autonomous driving systems.

Alternatives to using TensorRT include:

Using the training framework itself to perform inference.
Writing a custom application that is designed specifically to execute the network using low-level libraries and math operations.

Where Does TensorRT Fit?

Generally, the workflow for developing and deploying a deep learning model goes through three phases.

Phase 1 is training
Phase 2 is developing a deployment solution, and
Phase 3 is the deployment of that solution

TensorRT provides a fast, modular, compact, robust, reliable inference engine that can support the inference needs within the deployment architecture.

The TensorRT library will be linked to the deployment application which will call into the library when it wants an inference result.

To initialize the inference engine, the application will first deserialize the model from the plan file into an inference engine.TensorRT is usually used asynchronously, therefore, when the input data arrives, the program calls an enqueue function with the input buffer and the buffer in which TensorRT should put the result.

Some Practical aspects of using TensorRT :

1- TensorRT comes in the following combination, so don’t waste time as I did:

TensorRT supports C++ and python API on Linux OS
For windows, you only have C++ API

2- Problems with installation: This will be part of the next article in this series

3- Model export in ONNX and problem with ONNX models and Pytorch

4- Best way to use TensorRT: Docker container (tips and tricks)

5- Why TensorRT ONNX parser fails, while parsing the ONNX model: Tips and tricks to win

6- A practical example in Pytorch: Are we really reaping the benefit?

The above six episodes in this series will cover most of the common problems, errors, and mistakes and will try to suggest the way through. It will gonna be very interesting and application-driven. So stay tuned…!!

WAIT WAIT…..!

What is this “Quantization”?

Quantization refers to techniques for performing computations and storing tensors at lower bandwidths than floating-point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms.

Well, quantization will also be a part of this article series and more updates about this will come in the coming articles.

Till then enjoy oranges (mandarini)…!!

Thanks :)

To view or add a comment, sign in

Model optimization for Fast Inference and Quantization

Dr. Pankaj Mishra

What Is TensorRT?

Benefits Of TensorRT

Where Does TensorRT Fit?

Some Practical aspects of using TensorRT :

More articles by Dr. Pankaj Mishra

Others also viewed

Reimagining LLM Memory, Train an AI Agent for Command-Line Tasks, and More

Understanding TPUs vs GPUs in AI: A Complete, Human-Friendly Guide

The Great LLM Inference Showdown: TensorRT-LLM vs vLLM Technical Comparison

Role of NVIDIA TensorRT in Video Analytics

Exploring FDTD Simulations with GPU Acceleration & ML — A Deep Dive into fdtd-ml

CUDA Kernel Engineering: Bridging Performance and Efficiency in AI

Run BAGEL VLM on a DigtialOcean GPU Droplet

Why Deep Learning Loves GPUs?

How to train a Custom PyTorch Neural Network Model on Google Cloud GKE with accelerators - Multi-hosts training with NVIDIA H200/B200 GPUs

Accelerating Transformer Position Embeddings: A Deep Dive into RoPE CUDA Implementation

Explore content categories

What Is TensorRT?

Benefits Of TensorRT

Where Does TensorRT Fit?

Some Practical aspects of using TensorRT :

More articles by Dr. Pankaj Mishra

Smarter Quality Control: How Robots and Computer Vision Are Transforming Modern Manufacturing

Vector Database: A System that handles THOUSANDS of concurrent searches

Building an AI Agent to Chat with Any YouTube Video Using LangChain and Ollama

Principle of Prompting Large Language Models

A meteoric rise of Diffusion Models: A simple understanding

How fast is your Data Loader...?

AI on Edge

Green AI: Time to know our CO2 load

Why TensorRT ONNX parser fails, while parsing the ONNX model? Tips and tricks to win

ONNX Model: Export Using Pytorch, Problems, and Solutions

Others also viewed

Reimagining LLM Memory, Train an AI Agent for Command-Line Tasks, and More

Understanding TPUs vs GPUs in AI: A Complete, Human-Friendly Guide

The Great LLM Inference Showdown: TensorRT-LLM vs vLLM Technical Comparison

Role of NVIDIA TensorRT in Video Analytics

Exploring FDTD Simulations with GPU Acceleration & ML — A Deep Dive into fdtd-ml

CUDA Kernel Engineering: Bridging Performance and Efficiency in AI

Run BAGEL VLM on a DigtialOcean GPU Droplet

Why Deep Learning Loves GPUs?

How to train a Custom PyTorch Neural Network Model on Google Cloud GKE with accelerators - Multi-hosts training with NVIDIA H200/B200 GPUs

Accelerating Transformer Position Embeddings: A Deep Dive into RoPE CUDA Implementation

Explore content categories