Federated Learning - Build Models without Sharing Data
Some of the popular high-quality AI applications that we use today, such as Google’s Gemini or PaLM, or OpenAI’s ChatGPT, are proprietary AI models. We do not have access to the source code; these companies maintain ownership and control over access. If you want to use some of their advanced features, these are locked behind a paywall. Closed-source code, restrictions on use, and paywalls are not inherently evil or immoral. Companies need to make money, and one way they do this is by protecting their IP and charging for their services. The algorithm behind the AI model is but one part of the puzzle. Unlike other procedural or logic-based code, AI models rely heavily on the data they are trained on, and the energy and infrastructure expenditure to maintain these AI models are prohibitive for these companies to offer all their services for free, even if they wanted to.
Ever since I entered the world of Ubuntu Linux, I have been a fan of open-source software. In the past, I had to pay high prices or was locked out of software like Microsoft Office, Adobe Photoshop, Adobe Illustrator, Mathwork’s MatLab, and more, until I learned that Linux had open-source equivalents such as Libreoffice, Krita, Gimp, Inkscape, and Octave that satisfied my needs. This software was free. Free in price and in what we are allowed to do with it. Users like me could use this software without worrying about company usage limitations. We could write code to extend and expand the software. We could use add-ons from other users unrelated to the original software creators. And often, this software was available on multiple operating systems. The beauty of open-source software is that the software they develop is usually patent-free and available to all. It is software democratized.
But like in many democracies, people always want benefits, but rarely want to pay for them. Open-source software is often built and maintained by enthusiasts and receives little to no funding support. This makes long-term development, maintenance, and updates of open-source software difficult. The challenge is even more stark when the software developed by the open-source community requires running and maintaining resources, such as in the cloud.
AI models are not simply software algorithms. They require data to be collected, stored, analyzed, trained, fine-tuned, and deployed regularly. High-tech companies spend a lot of money on servers, energy, and engineers to review and fine-tune their models, which is difficult for a standard open-source project to sustain.
Enter Federated Learning
One way for open-source communities to build AI models is federated learning. Federated learning, or collaborative learning, is a technique in which multiple clients collaborate to train a model while keeping their data decentralized. This methodology was primarily designed to minimize data sharing and enhance data privacy. Unlike a central AI, where all data for training is fed, you have several localized models in different machines that train based on data available only on that local machine. Once the model is trained based on the data available to that local machine, only the model parameters, such as the weights and biases, are shared. The original data remains on the local machine, minimizing data sharing and thus data breaches over networks.
There are several ways to design a system to accomplish federated learning, and I will discuss two here.
Centralized Federated Learning
In centralized federated learning, there is a central machine and several local machines, which can be spread out across the web. The AI models are initially the same in both the global (called the centralized server) and local machines (called clients), but as each machine trains the model on the data in its local databases, the models diverge. The local machine only sends the updated model parameters to the global machine, which then updates its model. It doesn’t send the data itself.
Decentralized Federated Learning
In decentralized federated learning, there is no global machine or centralized server. Instead, only client machines communicate with each other peer-to-peer. Client machines can join or leave the network anytime, and there is no single point of failure. This makes the decentralized system more flexible and resilient. The local machines (clients) train their local AI models based on data in their local database and only share the model parameters with other client machines.
Recommended by LinkedIn
How is Federated Learning helpful to the Open-source community?
Federated learning is an excellent method that the open-source community should use for training AI models. Consider the benefits:
This should reduce the expenses needed for an AI project to be run by the open-source community by distributing the cost of training across client machines. It also eases the burden of data collection, consumption, and cleaning on the users‘ local machines.
A Real World Example
Researchers from Arizona State University developed a machine learning model called Ark+. This model uses federated learning to evaluate images of chest radiography and diagnose diseases.
Chest radiography is used to diagnose lung diseases. If given chest radiographic data, machine learning models can be trained to detect and diagnose various lung diseases. However, a challenge researchers face is the lack of data to train these models. Healthcare data is protected in the United States, and sharing images of patient radiography carries risks. If the data is leaked, it can expose hospitals and medical institutions to legal risks and can impact the patients themselves. Therefore, institutions generally guard this data and are reticent to share medical or personally identifiable information.
For researchers, this reluctance or hesitancy to share medical data with machine learning researchers poses a problem, as it is hard to train a machine learning model without sufficient data. Machine learning models before Ark+ faced issues of generalizability, adaptability, robustness, and extensibility.
Ark+’s federated learning methodology solves the problem of sharing sensitive healthcare data. The initial Ark+ model is shared with other institutions, who can use the model to train on their local data. Once the model is trained on the local data, only the model parameters are shared with the central research hub. This way, the sensitive data is never shared over networks and remains in the local/client machines. Over time, the central model updates and improves based on the model parameters received from remote clients.
References
Ma, D., Pang, J., Gotway, M. B., & Liang, J. (2025). A fully open AI foundation model applied to chest radiography. Nature, 1–11. https://doi.org/10.1038/s41586-025-09079-8
Kim, N. (2025). An open AI model could help medical experts to interpret chest X-rays. Nature. https://doi.org/10.1038/d41586-025-01525-x