Efficiently Distributing Large-Scale ML Programs: Layering Packages
A critical challenge in developing distributed machine learning systems is the efficient packaging and distribution of the application across all nodes in the training cluster. "Packaging" means bundling the training program and all its dependencies (large frameworks, utility libraries, etc.) into a single file. "Distribution" involves copying this package to every node so the application can launch and execute the coordinated training task.
Due to the complexity of distributed ML programs, which rely on massive frameworks like PyTorch and numerous toolkits, the final package size often reaches several gigabytes. Re-packaging and re-distributing this monolithic application package after every small code change becomes a major bottleneck, severely hindering developer productivity and iteration speed.
The Need for Incremental Updates
The ideal solution for minimizing the overhead of re-packaging and re-distribution is to support partial updates, where only the modified components are transmitted. The key to achieving this efficiency lies in decomposing the package into multiple Layers. By only building and distributing the layer that has changed, we can update the entire application with minimal cost.
In the open-source world, Docker is the prime example of this layering philosophy. A Docker Image is essentially a layered filesystem. Each command run during the build process adds a new layer, and more importantly, each layer can be independently cached and reused when building or pulling the image. This mechanism ensures that only the variable layers are updated during packaging and distribution, drastically reducing overhead.
Overcoming Proprietary Constraints
However, not all organizations can adopt Docker. At my previous workplace, Meta, historical reasons and unique requirements for underlying performance and security led us to use a proprietary binary package format called fbpkg. An fbpkg is a tarball-like binary that does not natively support layering. This forced us to endure an inefficient packaging and distribution workflow for distributed training programs.
To emulate the effect of Docker layering without modifying the core fbpkg format, we devised a simplified approach based on decoupled builds and tiered distribution. The central idea was to split the application into two independent fbpkg files, build and distribute them separately, and dynamically merge them at runtime.
We defined two logical layers:
Recommended by LinkedIn
During development, any change to the training system code only requires rebuilding and uploading the small App Layer. Each node in the training cluster retrieves the specified version of both the App Layer and the Base Layer. Because the Base Layer rarely changes, it is typically served instantly from the node's local cache, drastically minimizing download time.
Achieving Transparent Layer Merging with OverlayFS
Simply having two independent package files is insufficient to launch the training program. To ensure the application runs as if it were accessing a single, complete package, we needed to transparently merge the filesystems of both layers. To achieve this, we drew inspiration from the Overlay Filesystem concept used in container technologies like Docker.
First, we encapsulated the files within both fbpkg files into a read-only SquashFS filesystem format. Since SquashFS is read-only, we could not directly copy and overwrite the contents of the App Layer onto the Base Layer. Therefore, we introduced the OverlayFS utility to handle the dynamic stacking.
OverlayFS allows us to mount multiple read-only filesystems as lower layers and combine them into a unified, virtual view. When the application attempts to access a file, the system prioritizes the lookup in the upper layer (the App Layer). By superimposing the App Layer onto the Base Layer, we create a logically complete filesystem containing files from both layers.
This technique allowed the program to launch and run normally within the merged virtual filesystem without requiring the Base Layer to be re-packaged or entirely re-transmitted.
Conclusion
By strategically combining a smart build and packaging strategy with distribution caching at the node level, we successfully transformed the overhead of building and transferring multi-gigabyte packages. This cost was broken down into infrequent baseline setup and frequent, low-cost incremental updates. This solution significantly accelerated our iteration speed and enhanced engineering efficiency when developing distributed machine learning systems.
Hao YAN, I'm curious if you're familiar with the ModelPack specification that we've been working on with the CNCF? If not, you should check it out. There are a lot of synergies with what you helped develop at fb (modelpack.org)
The title could be "Redo Docker at Meta".