How LinkedIn Reduced Latency with JSON. This case study is a goldmine for anyone working on large-scale systems. The Challenge: -- 900M+ members across 200 countries -- 1000s of backend services, 10,000s of API endpoints -- JSON serialization causing high bandwidth usage and latency Initial Attempt: Tried compression algorithms like gzip. Result: Reduced payload size but worsened serialization/deserialization latency The Solution: After evaluating multiple alternatives, LinkedIn chose Protocol Buffers (Protobuf). Here's why: Protobuf Benefits: • Smaller payload (e.g., 2 bytes vs 10 bytes for simple messages) • Faster serialization/deserialization • Strong typing with predefined schemas • Wide language support • Backward compatibility for schema updates Alternatives Considered: • Flatbuffers: Offers "zero-copy deserialization" • MessagePack: No predefined schema is required • CBOR: Inspired by MessagePack, with additional features Results: 1. Up to 60% latency improvement for large payloads 2. Increased throughput for both request and response payloads 3. No significant degradations compared to JSON LinkedIn's Rest.li Framework: They've open-sourced Rest. li, a Java framework for RESTful services that handles: 1. Service Discovery (e.g., translating d2:// to https://lnkd.in/g9euZv8g) 2. Type Safety 3. Load Balancing 4. Common Request Patterns (e.g., Scatter-Gather) Key Takeaways: 1. Data serialization format significantly impacts system performance at scale 2. Binary formats like Protobuf can offer substantial benefits over text-based formats like JSON 3. Consider trade-offs between human readability, flexibility, and performance 4. Custom frameworks like Rest.li can simplify microservices architecture management This case study shows how seemingly small optimizations in system design can lead to massive performance gains at scale. What's your experience with data serialization in large-scale systems? Have you used Protobuf or similar binary formats? Let me know in the comments! #SystemDesign #TechOptimization #Protobuf #REST
Network Latency Reduction
Explore top LinkedIn content from expert professionals.
Summary
Network latency reduction is all about minimizing the time it takes for data to travel between systems, which directly impacts how quickly users receive responses from applications. Achieving lower latency often requires looking beyond hardware and focusing on efficient design, smart caching, and streamlined data flow throughout your network and backend systems.
- Streamline data paths: Re-examine how information travels within your architecture, minimizing unnecessary hops and redundant processing to cut down response time.
- Adopt better serialization: Switch from text-heavy formats like JSON to compact binary formats such as Protocol Buffers to speed up data encoding and decoding.
- Use smart caching: Implement shared caches and in-memory solutions to serve frequently needed data quickly instead of repeatedly hitting the database.
-
-
Everyone talks about scalability. Very few talk about where the latency is hiding. I once worked on a system where a single API call took ~450ms. The team kept trying to “scale the service” by adding more replicas. Pods were multiplied. Autoscaling was tuned. Dashboards were made fancier. But the request still took ~450ms. Because the problem was never about scale. It was this: - 180ms spent waiting on a downstream service. - 120ms on a database round-trip over a noisy network hop. - 80ms wasted in JSON -> DTO -> Internal Model conversions. - 40ms in logging + metrics I/O. - The actual business logic: ~15ms. We were scaling the symptom, not the cause. Optimizing that request had nothing to do with distributed systems wizardry. It was mostly about treating latency as a budget, not as a consequence. Here’s the framework we used that changed everything: - Latency Budget = Time Allowed for Request - Breakdown = Where That Time Is Actually Spent - Gap = Budget - Breakdown And then we asked just one question: “What is the single biggest chunk of time we can remove without changing the system’s behavior?” This is what we ended up doing: - Moved DB calls to a closer subnet (dropped ~60ms) - Cached the downstream call response intelligently (saved ~150ms) - Switched internal models to protobuf (saved ~40ms) - Batched our metrics (saved ~20ms) The API dropped to ~120ms. Without more servers. Without more Kubernetes magic. Just engineering clarity. 🚀 Scalability isn’t just about adding compute. It’s about understanding where the time goes. Most “slow” systems aren’t slow. They’re just unobserved.
-
🚀 Reduced API Latency by ~40% — Here’s What Actually Works While going through performance optimization techniques for Spring Boot APIs, I came across a really practical PDF that shows how API latency was reduced from 800ms → 480ms using real-world backend strategies. Thought this was worth sharing 👇 📘 What this guide covers: ⚡ 1. Query Optimization (Biggest Bottleneck) • Fixed N+1 issues using proper joins (JOIN FETCH) • Added indexes → ~60% faster lookups • Selected only required fields instead of full entities • Query time improved. 🧠 2. Redis Caching • Cache-aside pattern (DB hit only on cache miss) • TTL + cache invalidation strategy • Cache warming for hot data • Result → ~70% fewer DB calls 🔌 3. Connection Pooling (HikariCP) • Reused DB connections instead of creating new ones • Tuned pool size & timeouts • Result → ~25% faster DB operations 📄 4. Smart Pagination • Avoid fetching massive datasets • Reduced response size from 500KB → 15KB (~97% less) • Used Spring Pageable for clean implementation ⚙️ 5. Async Processing (@Async) • Offloaded heavy tasks (emails, PDFs, external APIs) • Faster user response → backend continues work in background 📊 6. Monitoring & Observability • Logging + Actuator + slow query tracking • Faster debugging and performance insights 💡 Final Outcome: All combined → ~40% faster APIs in a real-world setup 📎 I’m sharing this PDF in the post for anyone building high-performance backend systems. If you’re working with Java, Spring Boot, Microservices, or System Design, these are the kind of optimizations that actually matter in production. #SpringBoot #Java #BackendDevelopment #SystemDesign #Microservices #Performance #Redis #APIDesign #SoftwareEngineering
-
🚀 Latency Is One of the First Problems You Notice in Production Systems While working with distributed systems and backend services, one thing becomes very clear: Latency rarely comes from one place. It builds up across multiple layers such ad database queries, network calls, serialization, external APIs, and service-to-service communication. A few milliseconds at each layer can quickly turn into hundreds of milliseconds for the end user. Over time, I’ve noticed that improving system performance usually comes down to a set of practical latency-reduction techniques used across the stack. Here are some that consistently make a difference: 🔹 In-Memory Caching Serving frequently accessed data directly from memory avoids repeated database calls. 🔹 Database Indexing Proper indexing often turns slow queries into fast ones by eliminating full table scans. 🔹 Connection Pooling Reusing connections avoids the overhead of repeatedly creating new ones. 🔹 Payload Compression Compressing responses using Gzip or Brotli reduces network transfer time. 🔹 CDN Distribution Static assets served closer to users significantly improve response time globally. 🔹 HTTP/2 Multiplexing Sending multiple requests over a single connection reduces network overhead. 🔹 Request Batching Combining smaller requests can reduce unnecessary network round trips. 🔹 Async Message Queues Offloading heavy tasks to background workers improves response time for user-facing services. 🔹 Load Balancing Distributing traffic across instances helps prevent single service bottlenecks. 🔹 Reducing External Dependencies Third-party APIs can introduce unpredictable latency. 🔹 Edge Computing Processing data closer to the user can significantly reduce response time. 🔹 Efficient Serialization Formats like Protobuf or Avro can reduce encoding/decoding overhead compared to larger payload formats. 🔹 Vertical Scaling Sometimes increasing compute resources for latency-critical services is the simplest improvement. 🔹 Lazy Loading Deferring non-critical resources can improve perceived application speed. 🔹 Client-Side Rendering Offloading rendering work to the browser can reduce backend load. 🔹 Prefetching Critical Resources Loading data ahead of time helps reduce waiting time for users. What I’ve learned is that low latency rarely comes from a single optimization. It usually comes from small improvements across multiple layers of the architecture. That’s why performance engineering becomes an important part of designing scalable systems. 💬 Curious to hear, which optimization has given you the biggest latency improvement in production systems? #SystemDesign #BackendEngineering #MicroservicesArchitecture #DistributedSystems #JavaDeveloper #PerformanceEngineering #ScalableSystems #CloudArchitecture #C2C #SpringBoot #SoftwareEngineering #DevOps #LatencyOptimization #CloudNative
-
A junior reached out to me last week. One of our APIs was collapsing under 150 requests per second. Yes — only 150. He had tried everything: * Added an in-memory cache * Scaled the K8s pods * Increased CPU and memory Nothing worked. The API still couldn’t scale beyond 150 RPS. Latency? Upwards of 1 minute. 🤯 Brain = Blown. So I rolled up my sleeves and started digging; studied the code, the query patterns, and the call graphs. Turns out, the problem wasn’t hardware. It was design. It was a bulk API processing 70 requests per call. For every request: 1. Making multiple synchronous downstream calls 2. Hitting the DB repeatedly for the same data for every request 3. Using local caches (different for each of 15 pods!) So instead of adding more pods, we redesigned the flow: 1. Reduced 350 DB calls → 5 DB calls 2. Built a common context object shared across all requests 3. Shifted reads to dedicated read replicas 4. Moved from in-memory to Redis cache (shared across pods) Results: 1. 20× higher throughput — 3K QPS 2. 60× lower latency (~60s → 0.8s) 3. 50% lower infra cost (fewer pods, better design) The insight? 1. Most scalability issues aren’t infrastructure limits; they’re architectural inefficiencies disguised as capacity problems. 2. Scaling isn’t about throwing hardware at the problem. It’s about tightening data paths, minimizing redundancy, and respecting latency budgets. Before you spin up the next node, ask yourself: Is my architecture optimized enough to earn that node?
-
We just made Next.js 93% faster in Kubernetes. Median latency dropped from 182ms to 11.6ms, and success rates jumped from 91.9% to 99.8%. The solution was surprisingly simple: stop fighting the Linux kernel and start working with it. If you run Node.js at scale, you know the pain. Traffic spikes cause some pods to max out at 100% CPU while others idle at 30%. You overprovision to compensate, your cloud bill explodes, but the problem persists. Traditional approaches are broken. PM2 adds 30% IPC overhead for worker coordination. Single-CPU pods create isolated queues where one pod drowns while another sits idle. We solved this with Watt, the Node.js application server, leveraging SO_REUSEPORT, a kernel feature introduced in 2013 that almost nobody uses properly. Instead of master-worker coordination, the kernel distributes connections directly. Zero overhead, pure efficiency. The AWS EKS benchmarks under 1000 req/s load tell the story. With identical 6 CPU resources, single-CPU pods hit 155ms median latency, PM2 reached 182ms, while Watt delivered 11.6ms. At P95, Watt stays at 235ms versus PM2's 1260ms. That's not marginal improvement, that's transformative. In e-commerce, the difference between 182ms and 11.6ms is the difference between a sale and an abandoned cart. Every 100ms of latency measurably impacts conversion rates. Implementation is trivial. From PM2, remove ecosystem files and set worker count. From single-CPU pods, reduce pod count and increase CPU per pod. No code changes, just better architecture. This works for any CPU-bound Node.js workload. GraphQL servers, API gateways, SSR frameworks. If you're running Node in Kubernetes, you're leaving performance on the table. Watt is open source, production-ready, and already delivering these results at scale. 93.6% faster latency, 99.8% reliability, 9.6% more throughput with the same resources. Full technical deep dive at our blog, code at https://lnkd.in/dsmneTBt
-
Recently, our AI voice bot was giving 2–3 seconds latency locally, perfectly fine. but in production? 5–6 seconds!! For something that’s supposed to feel “real-time,” that’s a disaster. My first instinct: it’s a deployment issue. But the problem was to find where is the issue as the deployment stack had too many moving parts to just “guess”: - The agent responsiveness and how quickly it returns audio - The telephony layer (Twilio) — maybe there was some lag or processing overhead there - The EC2 instance size and networking capacity - The reverse proxy (nginx) behavior in front of the app Here’s how it went down: Step 1: Agent tuning I started by tweaking the agent’s internal responsiveness, thinking maybe it was buffering too much before streaming back. Tested multiple configs, no noticeable difference. Step 2: Telephony tweaks Jumped into Twilio settings. Checked if media streaming or jitter buffers were adding delay. Reduced buffer sizes, changed audio settings, still stuck at 5–6 seconds. Step 3: Server horsepower Maybe the EC2 was choking. Upgraded from a smaller t3 to a larger t3's, then to larger c5 for better network performance. CPU and memory were healthy, but… latency didn’t reduced. Step 4: The reverse proxy suspicion I had nginx running as a reverse proxy, forwarding all HTTP and WebSocket traffic from 80/443 to a custom port for the bot. AI suggested the issue might be nginx’s caching and buffering behavior, which could add milliseconds that, in real-time streaming, feel huge. Step 5: The real fix Swapped nginx entirely with an AWS ALB (Application Load Balancer), keeping the same SSL termination and routing rules. Immediately saw latency drop to 1–2 seconds consistently. The reason was that nginx was caching data which prevented real time interaction and causing larger delays. Key learnings: - Throwing more compute at the problem doesn’t guarantee speed, you have to hunt for bottlenecks. - AI is fantastic for hinting at possible angles, but you need your own architectural intuition to validate and act. - In complex systems, the real enemy is often a small, overlooked layer, not the obvious big pieces.
-
99% of latency issues are caused by the user app logic. App logic here includes libraries and frameworks used by the app. Sometimes however, the 1% could be the kernel. Here is one example where a config to TCP can improve backend and frontend network latency. When an app writes to a socket connection, whether its the frontend sending a request or the backend sending a response, the raw bytes are copied from the user space process application to the kernel memory where TCP/IP takes place. Each connection has a receive buffer where data from the other party arrives and a send buffer where the data from the application goes in before it is sent to the network. Both send and receive buffers live in the kernel space. Data written by the app to the kernel are not sent immediately but instead buffered in the kernel. The kernel hope is to get a full size worth of data to fill TCP segment. This is called an MSS or maximum segment size which is often around 1500 bytes. The reason of the buffering is segment overhead, Each segment comes with ~40 bytes header, the overhead of sending few bytes with such a large header can lead to inefficient use of network bandwidth. Thus the buffering, classic computer science stuff. So by default the kernel delays sending the segments out through the network in hopes of receiving more data from the application to fill a full MSS. The algorithm which specifies when to delay and by how much is called Nagle’s algorithm. You can disable this delay by setting the TCP_NODELAY on the socket option while creating the connection. This causes the kernel to send whatever it has in the send buffer even if it’s a few bytes. This is great because sometimes few bytes is all what we have. This essentially favor low latency over network efficiency. For the backend, applications can benefit from enabling this option (disabling the delay) especially when writing responses back to the client. This is because responses are sent through the send buffer. Delaying sending segments just because they are not full can lead to slowdowns in writing responses. NodeJS has enabled this option in this PR. For the frontend, apps can benefit from this option. In 2016 the creator of cURL Daniel spent hours debugging a TLS latency issue only to find out that the kernel was sitting on few bytes of TLS content in segment waiting for a full MSS. That caused the cURL project to set the TCP_NODELAY by default.
-
Cloudflare recently redesigned its cache purge system, reducing global purge times from 1.5 seconds to 150 milliseconds (P50). Here’s how they did it step by step: ► The Problem with the Old System 1. Centralized Core-Based Purge: - All purge requests went through a centralized database, Quicksilver. - High latency: A request from, say, Australia, had to travel to the US core server before propagating back globally. 2. Lazy Purging: - Cached content wasn’t actively deleted. - Instead, purge requests were stored and checked only when users accessed the content. - Result: Large purge histories, slower queries, and wasted disk space. 3. Scalability Issues: - Purge requests grew rapidly, but Quicksilver wasn’t built for high-frequency operations. - Disk space meant for cached content was consumed by purge metadata. ► The New System 1. Peer-to-Peer Purging: - Cloudflare ditched the centralized approach for a decentralized, peer-to-peer model. - Each data center communicates directly with its neighbors, propagating purge requests across the network. - Result: 150ms global purge latency (P50). 2. Active Purging with RocksDB: - Each server now uses RocksDB to maintain a local index of cached content. - Purge requests are immediately processed, actively deleting expired content. - Indexing ensures efficient lookups for flexible purge requests, like deleting all JSON files. 3. Efficiency Gains: - 10x storage savings: Metadata is efficiently managed using RocksDB. - Faster responses: Purge requests now process instantly, without waiting for user access. ► Key Takeaways - Cloudflare’s switch to peer-to-peer distribution eliminated the bottleneck of a centralized system. - Using RocksDB and active purging reduced latency while saving storage space. - P50 purge times are now 150ms globally, with further optimizations underway. This redesign is a masterclass in scaling distributed systems for speed and efficiency.
-
Fast GPUs don’t guarantee fast training. A weak network can turn a $10M AI cluster into idle silicon. Because at scale, AI isn’t limited by compute… It’s limited by how fast GPUs can talk to each other. Training workloads generate brutal traffic patterns: gradient sync, collective ops, parameter exchange, microbursts… And if your network can’t handle that pressure, you get: slowdowns, packet drops, retries, unstable throughput and wasted GPUs. That’s why AI data centers follow very specific networking patterns. Here are 15 networking patterns that power modern AI workloads : ✅ Low-Latency Transport Kernel Bypass, RDMA, RoCEv2 → reduce CPU overhead and make GPU communication ultra-fast. ✅ Congestion Control ECN, DCTCP, DCQCN → prevents congestion from destroying bandwidth and collapsing distributed training. ✅ Lossless Ethernet PFC, Flow Control, Packet Loss Recovery → avoids drops during RDMA traffic and stabilizes throughput under heavy load. ✅ Routing + Load Efficiency ECMP Load Balancing → spreads traffic across equal-cost paths to prevent hot links and bottlenecks. ✅ Traffic Behavior Problems Incast, Microbursts, Bufferbloat → the silent killers that create sudden queue spikes and unpredictable latency. The big takeaway? AI networking isn’t “traditional networking with more bandwidth.” It’s a different game: where timing, queue control, and predictable delivery decide whether training flies… or fails. This is why the future of AI infrastructure isn’t just GPU upgrades… It’s fabric engineering. If you’re building AI clusters or training platforms, save this post - these 15 patterns are the backbone. Follow for more #AI_Infrastructure_Media
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development