DRIVING THE FUTURE OF MACHINE-LEARNING ACCELERATION WITH CUSTOM CHIPS

Gadi Hutt

Published Feb 10, 2021

At Annapurna we are dedicated to building innovation in silicon and software for our AWS customers. We are at the forefront of developing technology combining a global presence with very talented engineers working together in a start-up culture. We are looking for exceptional candidates in all disciplines across engineering (silicon/software/compilers), product management and customer solution experts to join us on this incredible journey. Specifically for my team, we are keen to quickly hire superstars that are passionate about customers and enabling cutting edge machine-learning technology at cloud-scale:

https://www.amazon.jobs/en/jobs/1330879/solutions-architect

https://www.amazon.jobs/en/jobs/1298203/senior-technical-product-manager-aws-inferentia

For our broader team, we are looking for strong ML software and SOC architects and developers. If you believe you are the one, please go ahead and apply / reach-out to me directly, if not, I’d appreciate if you can forward this to your network. This is truly a unique opportunity to be part of one of the fastest growing and the most innovative IT organization on the planet. All positions available here: https://www.amazon.jobs/en/landing_pages/annapurna%20labs,

Here is a bit more on who we are and what we have been busy with lately. We started back in 2011 as an innovative semiconductor startup, founded with a dream to accelerate cloud infrastructure via innovations in silicon, software and system integration. In 2015, we joined Amazon, and since then we are at the forefront of cloud infrastructure with solutions like Graviton, Nitro and custom ML chips like Inferentia driving AWS machine-learning acceleration.

Back in early 2018 we decided to open a product line focused on machine-learning acceleration. The premise was simple: improve performance and reduce cost for our customers who are running significant ML workloads in the cloud. Improving performance dramatically, while providing a significant reduction in cost is not an easy task. We had to innovate across the stack in order to be able to pull it off. New silicon, new server hardware, new software including Neuron, a complete machine-learning software stack, with a compiler written from scratch, natively integrated into popular ML frameworks like TensorFlow and PyTorch; and all of that needed to work, at cloud scale from day-1.

Our focus was, like with every Amazon team, to work backwards from our customers. Two key tenets guided our journey: (a) Prioritization: what should we focus on first? We had to make a hard choice between training and inference. We eventually decided to focus on inference first because many of our customers spend more compute on production inference than on training, and so started the journey to design the Inferentia silicon and system. And (b) ease-of-use: we had to ensure that our customers would need to make minimal changes to their existing ML apps. We designed our SDK Neuron to integrate with PyTorch, TensorFlow and other popular frameworks and ensure low-touch when customers port their ML applications to AWS ML chips. The seamless integration of Neuron with ML frameworks proved to be a key factor in our success.

At re:Invent 2019, less than two years since we started the journey, we launched the Inf1 instances, enabling our customers to accelerate machine learning inference applications and lower their inference cost. Inf1 is ideal for customers that run production machine learning workloads with large amounts of images, text, or speech data. The result is a dramatic reduction in cost versus standard GPUs and a significant improvement in performance, with minimal code changes.

I am very proud of the team’s work, continuous focus on our customer needs, and regular cadence of usability and performance improvements. Customers from a wide range of industry segments like #snapchat #autodesk #anthem , Amazon Alexa and many others deploy ML applications on the Inferentia chips, pushing us to continue improving on performance and rapidly add more features. In most cases, Inferentia allows our customers to deploy models as is, while increasing performance compared to running on the most optimized GPUs. For our customers, besides to obvious improvment in bottom line, cutting cost is meaningful because it allows development teams to invest time in improving their applications for their customers rather than trying to manually optimize models with techniques like quantization, or compromise on accuracy with models that are less computational heavy.

We have a busy and exciting roadmap ahead.. at re:Invent 2020 we announced the upcoming Trainium, our second generation ML chip offering the highest performance and the best price to performance for training machine-learning models in the cloud. We are hard at work to enable our customers with the most cost-effective ML training in the cloud later this year. Join us!

To view or add a comment, sign in

DRIVING THE FUTURE OF MACHINE-LEARNING ACCELERATION WITH CUSTOM CHIPS

Gadi Hutt

Others also viewed

Building the Right Infrastructure Strategy for Generative AI: Lessons from the Field

Breakdown the BMC: Felafax

How to Scale AI Models Without Turning It Into a Disaster

Tips for building a scalable IT Infrastructure for running AI/ML

Amazon’s Strategic Move into Custom Chips to Excel in the Generative AI Race

How January 2026 changed everything about AI infrastructure

Optimizing AI Workloads with Kubernetes & Docker – AI Model Deployment Strategies

OpenAI’s $40B Move: From AI Lab to Infrastructure Powerhouse

Distributed Training at Billion-User Scale: 5 Patterns

Why Compute Is the Real Currency of AI

Explore content categories