Zero Copy Data Transfer in HPC: A common technique for loading data in high performance applications is called “zero copy” because, well, it doesn’t require a copy. But what does that mean, and why is it useful? As I harp on in many of my posts, data movement is typically one of the largest bottlenecks and biggest challenges in high performance computing today. If we think about a 405B parameter LLM, we are transferring around, at a minimum, 405GB of data in memory. But this is virtually nothing when compared to the petabytes of data required to train that model. Traditional data transfer methods involve multiple copying of data between user space and kernel space, leading to increased CPU usage and reduced throughput. Let’s dive deeper: Problems with traditional data transfer: In a conventional data transfer operation, say from disk to a network interface, the data typically goes through multiple stages: - Reading from disk into kernel buffer - Copy from kernel buffer to user space - transform and copy back to kernel before network send - transmitted to network interface for sending Each requires a copy, requiring cpu cycles and memory bandwidth ultimately becoming rate limiting for large data. How Zero Copy Works: Zero Copy eliminates redundant data copies by using system-level techniques that allow data to be transferred directly between kernel space and the target destination without intermediary copies. Several Zero Copy techniques are implemented in modern operating systems: - Memory Mapping (mmap): mmap allows files to be mapped directly into the address space of a process. This means that the file contents can be accessed as if they were in memory, reducing the need for copying between kernel and user space. - Sendfile(): In networked applications, the sendfile() system call enables data to be sent directly from a file descriptor (such as a file on disk) to a socket, bypassing user space entirely. - Direct I/O: Direct I/O bypasses the kernel’s buffering mechanisms, allowing data to be read or written directly to and from disk. - DMA (Direct Memory Access): hardware-level technique where data is transferred directly between the memory and a device without CPU intervention. Ultimately, zero copy provides reduced CPU utilization, lower latency access, increased throughput, and more efficient memory usage. Several technologies exist that leverage zero copy architecture directly, such as GPU Direct Storage by NVIDIA, RDMA over Converged Ethernet, and even Network Filesystems. Diving into understanding this will help you better understand how to efficiently move data in your HPC applications. If you like my content, feel free to follow or connect! #softwareengineering #hpc
Optimizing Data Transfer
Explore top LinkedIn content from expert professionals.
Summary
Optimizing data transfer means making the movement of information between systems as fast and efficient as possible, which is crucial for handling large files, streaming live events, or improving application performance. By using smarter protocols, adjusting settings, and reducing unnecessary steps, organizations can speed up transfers, reduce delays, and use less computing power.
- Adjust protocol settings: Tweak window sizes and buffer configurations to match your network’s speed and distance, so transfers don’t slow down due to waiting for acknowledgments.
- Use smart system calls: Implement techniques like zero-copy or scatter-gather I/O to cut down on redundant copying, freeing up memory and CPU for other tasks.
- Balance real-time needs: Carefully manage buffering and error recovery when streaming live events to avoid delays and keep the experience smooth for viewers.
-
-
𝗪𝗵𝘆 𝗬𝗼𝘂𝗿 𝟭𝗚𝗯𝗽𝘀 𝗟𝗶𝗻𝗸 𝗢𝗻𝗹𝘆 𝗗𝗲𝗹𝗶𝘃𝗲𝗿𝘀 𝟭𝟬𝗠𝗯𝗽𝘀 𝗳𝗼𝗿 𝗦𝗙𝗧𝗣 𝗔𝗻𝗱 𝗛𝗼𝘄 𝘁𝗼 𝗙𝗶𝘅 𝗜𝘁 You upgraded the circuit. You verified the bandwidth. Then your 50GB SFTP transfer runs at 8–12 Mbps. Sound familiar? This isn’t a bandwidth problem. It’s TCP physics. 𝗧𝗵𝗲 𝗥𝗲𝗮𝗹 𝗜𝘀𝘀𝘂𝗲: 𝗕𝗮𝗻𝗱𝘄𝗶𝗱𝘁𝗵-𝗗𝗲𝗹𝗮𝘆 𝗣𝗿𝗼𝗱𝘂𝗰𝘁 SFTP runs over TCP. And TCP performance over long distances is governed by: Bandwidth × Round-Trip Time (RTT) If you have: • 1 Gbps link • 150ms latency (typical intercontinental) • You need ~19MB of data “in flight” to fully utilize the link. If your TCP window is smaller than that, the sender pauses constantly waiting for acknowledgments. Result? Your 1Gbps link behaves like 10Mbps. 𝗜’𝘃𝗲 𝗦𝗲𝗲𝗻 𝗧𝗵𝗶𝘀 𝗕𝗲𝗳𝗼𝗿𝗲 Years ago, when I worked as a Unix Systems Administrator, I used to manually tune: • tcp_sendspace • tcp_recvspace • window scaling • kernel buffer sizes We calculated bandwidth-delay product per route and tuned Solaris and AIX systems just to make transcontinental transfers usable. Most organizations don’t want to tweak kernel parameters on production MFT servers anymore. Modern Fix #1: TCP Optimization Inside the Application Modern MFT platforms have evolved. TDXchange supports TCP tuning directly within the application for both SFTP server and client connections without requiring OS-level changes. This allows you to: • Optimize socket buffers • Improve window utilization • Increase throughput on high-latency routes • Avoid modifying cloud or container kernel settings For moderate latency links, this can improve performance 3–5x. 𝗕𝘂𝘁 𝗧𝗖𝗣 𝘀𝘁𝗶𝗹𝗹 𝗵𝗮𝘀 𝗹𝗶𝗺𝗶𝘁𝘀. The Hard Ceiling of TCP Even perfectly tuned TCP: • Slows aggressively on minor packet loss • Remains tied to latency • Never fully eliminates ACK overhead • On 150–200ms links, TCP often caps at 10–20% utilization. That’s math, not misconfiguration. 𝗠𝗼𝗱𝗲𝗿𝗻 𝗙𝗶𝘅 #𝟮: 𝗨𝗗𝗣-𝗕𝗮𝘀𝗲𝗱 𝗔𝗰𝗰𝗲𝗹𝗲𝗿𝗮𝘁𝗶𝗼𝗻 This is where acceleration changes everything. bTrade’s AFTP (Accelerated File Transfer Protocol) uses UDP with custom congestion control and selective retransmission. Instead of waiting for acknowledgments, it keeps the pipe full. Real-world results: • SFTP: 45 Mbps on 1Gbps link • AFTP: 890 Mbps on same link Same circuit. Same distance. Different protocol behavior. When to Use What Use TCP tuning when: • Compliance mandates SFTP • Latency is moderate • Files are smaller Use UDP acceleration when: • Transfers exceed 10GB • Latency exceeds 100ms • Batch windows are tight • WAN utilization is under 20% Many organizations use both. 𝗙𝗶𝗻𝗮𝗹 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆 If your 1Gbps link only delivers 10Mbps: • It’s not your ISP. • It’s not your firewall. • It’s not your storage. 𝗜𝘁’𝘀 𝗧𝗖𝗣 𝘄𝗶𝗻𝗱𝗼𝘄 𝗽𝗵𝘆𝘀𝗶𝗰𝘀. I used to solve this by tuning Unix kernels manually. The physics haven’t changed. The tooling has.
-
Scatter-Gather I/O in C for Improved Performance 🚀 We often talk about "zero-copy" in high-performance networking, but the mechanics of how we achieve it at the syscall level are worth understanding. If you are building a server framework or working with high-throughput network protocols, you inevitably run into the "Header + Body" problem: you have a protocol header in one buffer and a data payload in another. The simple approach is to allocate a new, larger buffer, memcpy the header, memcpy the body, write() the larger buffer to the socket. This burns CPU cycles on redundant copying and puts unnecessary pressure on your allocator. Let's be sympathetic to the hardware with readv and writev. These two system calls allow you to pass a vector of buffers (iovec structs) directly to the kernel. You hand the OS a list of pointers and lengths, and it gathers the data atomically during the transmission. This results in fewer system calls by merging multiple calls into one and reduces memory bandwidth by eliminating the intermediate memcpy. The data moves directly from your original buffers to the kernel's network stack. You also get the added benefit of a fully atomic write from n buffers. If you're going to go in C, go fast. #softwareengineering
-
The Real Challenge Behind Live Event Feed Contribution Transporting live event or match feeds from a venue to a central broadcast facility sounds straightforward. In reality, it remains one of the most complex problems in live sports production. Today, we are expected to deliver high-quality, low-latency video for both linear TV (Star Sports) and OTT platforms (JioHotstar), Securely, Reliably and at Scale, while simultaneously reducing operational costs. That combination is fundamentally difficult. Live sports broadcasting has always been about precision. However, that precision is no longer defined only by camera placement, OB vans, or on-ground logistics. Increasingly, it is defined by how intelligently we move live video across IP networks. At the heart of the challenge lies a simple but unavoidable fact: Security, Reliability and Latency are inherently competing forces. Public IP networks, especially Internet connections, are unstable by nature ! Packet loss is inevitable, particularly at scale. When packets drop between the encoder and decoder, recovery mechanisms must activate, through retransmission, forward error correction (FEC) or a combination of both. Recovery takes time, and time directly introduces latency. This is where many live contribution strategies either succeed or fail ! Buffers play a critical role in this process. They absorb jitter, enable packet loss detection and allow streams to recover gracefully. Poorly managed buffers configuration increase end to end delay. The real optimisation challenge is not eliminating buffering, but making it adaptive. - Detect packet loss early - Recover only what is necessary - Continuously monitor and control latency drift - Keep glass-to-glass delay stable This is why modern contribution workflows must be evaluated at a technical system level. Bandwidth alone does not solve the problem. What matters is how intelligently the transport protocol, codec behaviour and buffering logic operate together as a single, cohesive system, balancing stability, security and latency in real time. As live production continues to scale across linear and digital platforms, control over this balance will increasingly define who can deliver consistently high-quality live sports experiences, and who cannot.
-
How to Improve API Performance Improving API performance can significantly enhance the user experience and overall efficiency of your application. 1.Optimize Data Transfer ✅️Reduce Payload Size: Use techniques like data compression (e.g., Gzip) and minimize the amount of data sent in responses by removing unnecessary fields. ✅️Pagination: Implement pagination for large datasets to avoid overwhelming the client with data. ✅️Filtering and Sorting: Allow clients to request only the data they need (e.g., specific fields, filtered results). 2.Improve Caching 🛎HTTP Caching: Use appropriate cache headers (e.g., `Cache-Control`, `ETag`, `Last-Modified`) to allow clients and intermediaries to cache responses. 🛎Server-Side Caching: Implement caching strategies on the server-side (e.g., in-memory caches like Redis or Memcached) to store frequently accessed data. 3.Optimize Database Queries 🪛Indexing: Ensure that your database queries are optimized with proper indexing, which can significantly reduce query execution time. 🪛Query Optimization: Analyze and optimize slow queries, using tools like query analyzers to find bottlenecks. 🪛Use Connection Pooling: Maintain a pool of database connections to reduce the overhead of establishing new connections. 4.Leverage Asynchronous Processing 🧲Background Processing: For long-running tasks, consider using background jobs (via tools like RabbitMQ, Celery, or AWS Lambda) to prevent blocking the API response. 🧲WebSockets or Server-Sent Events: For real-time updates, consider using WebSockets instead of polling the API repeatedly. 5.Scale Infrastructure 🪚Load Balancing: Use load balancers to distribute traffic across multiple servers, ensuring no single server becomes a bottleneck. 🪚Horizontal Scaling: Add more servers to handle increased load rather than relying solely on vertical scaling (upgrading existing servers). 6.Reduce Latency 📎Content Delivery Network (CDN): Use a CDN to cache responses closer to users, reducing latency for static assets. 📎Geographic Distribution: Deploy your API servers in multiple geographic locations to reduce latency for global users. 7.Use API Gateways 📍API Gateway: Implement an API gateway to handle tasks like rate limiting, authentication, and logging, which can offload these responsibilities from your main application. 8.Monitor and Profile Performance 🖥Logging and Monitoring: Use tools like New Relic, Datadog, or Prometheus to monitor API performance and identify bottlenecks. 🖥Profiling: Regularly profile your API to understand which parts of your code are slow and need optimization. Want to know more? Follow me or connect🥂 Please don't forget to like❤️ and comment💭 and repost♻️, thank you🌹🙏 #Csharp #EFCore #dotnet #dotnetCore
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development