Efficiently Distributing Large-Scale ML Programs: Layering Packages

Hao YAN

Published Oct 12, 2025

A critical challenge in developing distributed machine learning systems is the efficient packaging and distribution of the application across all nodes in the training cluster. "Packaging" means bundling the training program and all its dependencies (large frameworks, utility libraries, etc.) into a single file. "Distribution" involves copying this package to every node so the application can launch and execute the coordinated training task.

Due to the complexity of distributed ML programs, which rely on massive frameworks like PyTorch and numerous toolkits, the final package size often reaches several gigabytes. Re-packaging and re-distributing this monolithic application package after every small code change becomes a major bottleneck, severely hindering developer productivity and iteration speed.

The Need for Incremental Updates

The ideal solution for minimizing the overhead of re-packaging and re-distribution is to support partial updates, where only the modified components are transmitted. The key to achieving this efficiency lies in decomposing the package into multiple Layers. By only building and distributing the layer that has changed, we can update the entire application with minimal cost.

In the open-source world, Docker is the prime example of this layering philosophy. A Docker Image is essentially a layered filesystem. Each command run during the build process adds a new layer, and more importantly, each layer can be independently cached and reused when building or pulling the image. This mechanism ensures that only the variable layers are updated during packaging and distribution, drastically reducing overhead.

Overcoming Proprietary Constraints

However, not all organizations can adopt Docker. At my previous workplace, Meta, historical reasons and unique requirements for underlying performance and security led us to use a proprietary binary package format called fbpkg. An fbpkg is a tarball-like binary that does not natively support layering. This forced us to endure an inefficient packaging and distribution workflow for distributed training programs.

To emulate the effect of Docker layering without modifying the core fbpkg format, we devised a simplified approach based on decoupled builds and tiered distribution. The central idea was to split the application into two independent fbpkg files, build and distribute them separately, and dynamically merge them at runtime.

We defined two logical layers:

Recommended by LinkedIn

Google Colab: A Powerful Tool for Machine Learning

NITHINBHARATHI T 2 years ago

Fullstack Machine Learning

Rajesh Radhakrishnan 4 years ago

CopilotKit with CrewAI, Running LLMs with Docker…

Rami Krispin 1 year ago

Base Layer: Contains all large, relatively stable dependencies, such as PyTorch, NumPy, and CUDA. This is the large, infrequently updated fbpkg.
App Layer: Contains the ML model code, training scripts, and configuration files under active development. This is the small, frequently iterated fbpkg.

During development, any change to the training system code only requires rebuilding and uploading the small App Layer. Each node in the training cluster retrieves the specified version of both the App Layer and the Base Layer. Because the Base Layer rarely changes, it is typically served instantly from the node's local cache, drastically minimizing download time.

Achieving Transparent Layer Merging with OverlayFS

Simply having two independent package files is insufficient to launch the training program. To ensure the application runs as if it were accessing a single, complete package, we needed to transparently merge the filesystems of both layers. To achieve this, we drew inspiration from the Overlay Filesystem concept used in container technologies like Docker.

First, we encapsulated the files within both fbpkg files into a read-only SquashFS filesystem format. Since SquashFS is read-only, we could not directly copy and overwrite the contents of the App Layer onto the Base Layer. Therefore, we introduced the OverlayFS utility to handle the dynamic stacking.

OverlayFS allows us to mount multiple read-only filesystems as lower layers and combine them into a unified, virtual view. When the application attempts to access a file, the system prioritizes the lookup in the upper layer (the App Layer). By superimposing the App Layer onto the Base Layer, we create a logically complete filesystem containing files from both layers.

This technique allowed the program to launch and run normally within the merged virtual filesystem without requiring the Base Layer to be re-packaged or entirely re-transmitted.

Conclusion

By strategically combining a smart build and packaging strategy with distribution caching at the node level, we successfully transformed the overhead of building and transferring multi-gigabyte packages. This cost was broken down into infrequent baseline setup and frequent, low-cost incremental updates. This solution significantly accelerated our iteration speed and enhanced engineering efficiency when developing distributed machine learning systems.

Jesse Williams 3mo

Hao YAN, I'm curious if you're familiar with the ModelPack specification that we've been working on with the CNCF? If not, you should check it out. There are a lot of synergies with what you helped develop at fb (modelpack.org)

Yi Wang 4mo

The title could be "Redo Docker at Meta".

2 Reactions

See more comments

Efficiently Distributing Large-Scale ML Programs: Layering Packages

Hao YAN

The Need for Incremental Updates

Overcoming Proprietary Constraints

Recommended by LinkedIn

Achieving Transparent Layer Merging with OverlayFS

Conclusion

Others also viewed

A Project On Complete Automation of Machine Learning & DL Model Auto-tuning

Accelerating ML Model Communication with gRPC

Unlocking a New Era of AI Development: Key Takeaways from Dr. Gael Reinaudi's Seminar on Cursor

Machine Learning Prediction using a pre-created model inside Docker Container.

Beyond Algorithms: The Essential Skills for Thriving as a Machine Learning Engineer

Model Context Protocol (MCP) vs Application Programming Interface(API)

A Full-Stack Engineer's Roadmap to ML & GenAI — Part 2: The 5 Projects That Actually Prepare You for ML Roles

Agent-Ready APIs, 2025's Top GitHub Repos, Scaling Inference for Gen AI, and Upcoming Meetups

Harnessing Machine Learning to Drive Quality in Manufacturing

OpenAI Open Source Models, Data Engineering with DBT, MLOps with Databricks, Scikit-Learn Crash Course

Explore content categories