The "Cold Start" problem is the silent killer of GenAI ROI When a traffic spike hits your AI application, waiting 10+ minutes for a new GPU node to pull a 100GB+ model is the difference between a seamless user experience and a timeout error. It also means expensive GPUs are sitting idle, burning budget while waiting for data. Fortunately, a powerful new infrastructure "trifecta" on Google #Kubernetes Engine (#GKE) is solving this, slashing load times from minutes to mere seconds. By combining NVIDIA’s Run:ai streamer, Google Cloud's storage caching, and GKE's native streaming, we can finally maximize GPU utilization and achieve true, fast autoscaling. Here is the stack that is changing the game: 🚀 1. GKE Image Streaming Before you load the model, you need to load the container. GKE Image Streaming lets pods enter the Running state instantly by fetching only necessary startup data on demand, rather than waiting for the full image download. ⚡ 2. Google Cloud "Anywhere Cache" Even with fast streaming, latency matters. Anywhere Cache provides an SSD-backed read cache zonally co-located with your GKE nodes. This means sub-millisecond access to weights and massive throughput, ensuring new nodes don't bottleneck at the storage layer. 🧠 3. NVIDIA Run:ai Model Streamer This is the biggest shift. Traditionally, you download a model to disk, then load it to RAM, then transfer to GPU. The Run:ai streamer (integrated with vLLM) flips the script. It streams model weights directly from Cloud Storage into GPU memory, bypassing the local disk entirely. The Result: Your GPUs spend their time computing, not waiting. For MLOps teams, this means Horizontal Pod Autoscaling (HPA) actually works in real-time - scaling out for spikes and back down quickly, significantly reducing TCO. Have you implemented model streaming in your infrastructure yet? #GenAI #Kubernetes #GKE #NVIDIA #MLOps #CloudComputing #GPU #AIInfrastructure #GoogleCloud
Latest Google Kubernetes Engine Feature Releases
Explore top LinkedIn content from expert professionals.
Summary
Google Kubernetes Engine (GKE) is a managed platform for running containerized applications, and its latest feature releases are focused on improving AI workload performance, security, and cluster management. These updates make it easier for organizations to deploy, scale, and secure their applications on Google Cloud while supporting advanced AI and GPU usage.
- Streamline model loading: Use GKE's image streaming and NVIDIA model streamer integration to dramatically reduce AI application startup times and keep GPUs working on computations instead of waiting for data.
- Automate GPU management: Enable automatic GPU driver installation and multi-process scheduling to simplify setup and boost the efficiency of GPU resources for demanding workloads.
- Simplify access and security: Switch to DNS-based endpoints and IAM policy controls to make cluster access more flexible and secure, removing the need for complex networking setups.
-
-
During the past few months, we've made a lot of improvements for customers who want to run AI workloads on Google Kubernetes Engine from Google Cloud. Since the AI space is moving really quickly with a lot of new innovations, I want to enumerate those that are relevant to AI on GKE. There are so many amazing improvements that LinkedIn won't let me write a post long enough to list all of them :) so this is the second of 5 posts. Today I'm focused on GPU-related improvements. Yesterday, I discussed TPU-related improvements. In the coming days, I'll share improvements related to training, inference, obtaining capacity, and more. * 𝐆𝐨𝐨𝐠𝐥𝐞-𝐦𝐚𝐧𝐚𝐠𝐞𝐝 𝐆𝐏𝐔 𝐝𝐫𝐢𝐯𝐞𝐫 𝐜𝐚𝐧 𝐧𝐨𝐰 𝐛𝐞 𝐚𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐜𝐚𝐥𝐥𝐲 𝐢𝐧𝐬𝐭𝐚𝐥𝐥𝐞𝐝: For newly created GPU node pools, you have the option to automatically install the NVIDIA GPU driver on GKE. This alleviates the need to install the DaemonSet yourself and keep your GPU driver up-to-date over time. https://lnkd.in/ejCYGXA3 * 𝐍𝐕𝐈𝐃𝐈𝐀 𝐌𝐮𝐥𝐭𝐢-𝐏𝐫𝐨𝐜𝐞𝐬𝐬 𝐒𝐞𝐫𝐯𝐢𝐜𝐞𝐬 𝐢𝐬 𝐆𝐀: Improve the utilization of your GPUs using NVIDIA Multi-Process Services (MPS) to schedule multiple containers on the same GPU. MPS provides concurrency without the context switching overhead of traditional GPU time-sharing. To maximize your GPU utilization, MPS can be combined with multi-instance GPUs. https://lnkd.in/ekmar2JY * 𝐌𝐚𝐱𝐢𝐦𝐢𝐳𝐞 𝐆𝐏𝐔 𝐧𝐞𝐭𝐰𝐨𝐫𝐤 𝐛𝐚𝐧𝐝𝐰𝐢𝐝𝐭𝐡 𝐰𝐢𝐭𝐡 𝐆𝐏𝐔𝐃𝐢𝐫𝐞𝐜𝐭-𝐓𝐂𝐏𝐗: By reducing the overhead required to transfer packet payloads to and from GPUs, GPUDirect-TCPX maximizes network bandwidth and significantly improves throughput at scale for high-performance GPU workloads on GKE (such as training or large model inference). https://lnkd.in/e83tGs4A * 𝐆𝐏𝐔𝐬 (𝐇𝟏𝟎𝟎, 𝐀𝟏𝟎𝟎, 𝐋𝟒 & 𝐓𝟒) 𝐚𝐫𝐞 𝐧𝐨𝐰 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐞𝐝 𝐰𝐢𝐭𝐡 𝐆𝐊𝐄 𝐒𝐚𝐧𝐝𝐛𝐨𝐱: GKE Sandbox provides an extra layer of security to prevent untrusted code from affecting the host kernel on your cluster nodes. For GPUs, while GKE Sandbox doesn't mitigate all NVIDIA driver vulnerabilities, it helps protect against Linux kernel vulnerabilities. https://lnkd.in/epV26EmK
-
GKE just released custom compute classes which I think is a killer and unique feature that only #GoogleCloud has for now 🎉 Custom Compute Classes are a mechanism in #GoogleKubernetesEngine that allows you to configure a set of node configurations you want to workload to run on along side the order you want these configurations to be provisioned in with a fallback order, scale out configs and defaults [1]. Let's take an example. Say for example I want my Pods to run on Spot Virtual Machines unless they are not available in which case I want to fallover to standard VM's. But I also want to fallback to Spot when they become available and In case neither Spot nor the config I want is available I want a default node config. Typically in #Kubernetes you will have to define some sort of custom logic with Labels, Tolerations, Selectors, BallonPods... With Custom Compute Classes you can define a custom cluster (as a CRD) like the image below. This will tell GKE to: - Provision an N2 with Min 64 Cores as Spot VM's. - If not fallover to N2 with any number of Core as Spot VM's. - If not fallover to N2 standard (not Spot). You define all of these are a custom compute class object and label the namespace [2]. This works for both Standard and Autopilot clusters and is supported with Autoscaler. [1] https://lnkd.in/eQwrTn6t [2] https://lnkd.in/ea7bqAsE
-
Today, at #GoogleCloudNext, we’re announcing significant improvements to Google Kubernetes Engine (GKE) to help platform teams succeed with AI: * Cluster Director for GKE, now generally available, lets you deploy and manage large clusters of accelerated VMs with compute, storage, and networking — all operating as a single unit. * GKE Inference Quickstart, now in public preview, which simplifies the selection of infrastructure and deployment of AI models, while delivering benchmarked performance characteristics. * GKE Inference Gateway, now in public preview, provides intelligent routing and load balancing for AI inference on GKE. * A new container-optimized compute platform is rolling out on GKE Autopilot today, and in Q3, Autopilot’s compute platform will be made available to standard GKE clusters. * Gemini Cloud Assist Investigations, now in private preview, helps with GKE troubleshooting, decreasing the time it takes to understand the root cause and resolve issues. * With a built-in partnership with Anyscale, RayTurbo on GKE will launch later this year to deliver superior GPU/TPU performance, rapid cluster startup, and robust autoscaling. More details in the blog post below... https://lnkd.in/gxuzMJaJ
-
So many exciting GKE news from Kubecon this week, but this launch is special to me: we announced a new flexible DNS-based endpoint for accessing the GKE control plane. Blog from Ninad Desai and Chris Gonterman : https://lnkd.in/gxzQGf6e Customers were asking for it and it is here now! 🔒 Secure Access from Anywhere: The DNS-based endpoint eliminates the need for proxies or bastion hosts. **Bastion hosts are notoriously hard to setup - no more toil and IP headaches!** Authorized users can connect from home, on-prem, or other clouds seamlessly. 🔧 Dynamic and Simplified Security: IAM policies handle user authentication without relying on static network IPs. To revoke access just update permissions - no reconfiguring more firewalls. 🛡️ Multi-Layer Protection: Combine the ease IAM policies with VPC Service Controls for robust, context-aware access from approved origins -now you have these two layers or protection 🚀 Effortless Setup: Enable DNS-based access for any cluster with just a `gcloud` command and upgrade it in minutes. With this feature, we’re making GKE cluster access more flexible, secure, and user-friendly. #GKE #Kubernetes #CloudSecurity
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development