Federated Learning - Build Models without Sharing Data

Federated Learning - Build Models without Sharing Data

Some of the popular high-quality AI applications that we use today, such as Google’s Gemini or PaLM, or OpenAI’s ChatGPT, are proprietary AI models. We do not have access to the source code; these companies maintain ownership and control over access. If you want to use some of their advanced features, these are locked behind a paywall. Closed-source code, restrictions on use, and paywalls are not inherently evil or immoral. Companies need to make money, and one way they do this is by protecting their IP and charging for their services. The algorithm behind the AI model is but one part of the puzzle. Unlike other procedural or logic-based code, AI models rely heavily on the data they are trained on, and the energy and infrastructure expenditure to maintain these AI models are prohibitive for these companies to offer all their services for free, even if they wanted to.

Ever since I entered the world of Ubuntu Linux, I have been a fan of open-source software. In the past, I had to pay high prices or was locked out of software like Microsoft Office, Adobe Photoshop, Adobe Illustrator, Mathwork’s MatLab, and more, until I learned that Linux had open-source equivalents such as Libreoffice, Krita, Gimp, Inkscape, and Octave that satisfied my needs. This software was free. Free in price and in what we are allowed to do with it. Users like me could use this software without worrying about company usage limitations. We could write code to extend and expand the software. We could use add-ons from other users unrelated to the original software creators. And often, this software was available on multiple operating systems. The beauty of open-source software is that the software they develop is usually patent-free and available to all. It is software democratized.

But like in many democracies, people always want benefits, but rarely want to pay for them. Open-source software is often built and maintained by enthusiasts and receives little to no funding support. This makes long-term development, maintenance, and updates of open-source software difficult. The challenge is even more stark when the software developed by the open-source community requires running and maintaining resources, such as in the cloud.

AI models are not simply software algorithms. They require data to be collected, stored, analyzed, trained, fine-tuned, and deployed regularly. High-tech companies spend a lot of money on servers, energy, and engineers to review and fine-tune their models, which is difficult for a standard open-source project to sustain.

Enter Federated Learning

One way for open-source communities to build AI models is federated learning. Federated learning, or collaborative learning, is a technique in which multiple clients collaborate to train a model while keeping their data decentralized. This methodology was primarily designed to minimize data sharing and enhance data privacy. Unlike a central AI, where all data for training is fed, you have several localized models in different machines that train based on data available only on that local machine. Once the model is trained based on the data available to that local machine, only the model parameters, such as the weights and biases, are shared. The original data remains on the local machine, minimizing data sharing and thus data breaches over networks.

Article content
Figure 1: Federated learning using a centralized, orchestrated setup.

There are several ways to design a system to accomplish federated learning, and I will discuss two here.

Centralized Federated Learning

In centralized federated learning, there is a central machine and several local machines, which can be spread out across the web. The AI models are initially the same in both the global (called the centralized server) and local machines (called clients), but as each machine trains the model on the data in its local databases, the models diverge. The local machine only sends the updated model parameters to the global machine, which then updates its model. It doesn’t send the data itself.


Article content
Figure 2:

Decentralized Federated Learning

In decentralized federated learning, there is no global machine or centralized server. Instead, only client machines communicate with each other peer-to-peer. Client machines can join or leave the network anytime, and there is no single point of failure. This makes the decentralized system more flexible and resilient. The local machines (clients) train their local AI models based on data in their local database and only share the model parameters with other client machines.

Article content
Figure 3: Decentralized Federated Learning. There is no central global AI. Instead, only client machines train the model based on local data and then share the updated model parameters with other client machines.

How is Federated Learning helpful to the Open-source community?

Federated learning is an excellent method that the open-source community should use for training AI models. Consider the benefits:

  1. Collaboration and Data Privacy: This allows developers to collaborate on training models without sharing sensitive data, which improves data privacy and collaboration.
  2. Resource Efficiency: Whether an open-source project chooses centralized or decentralized federated learning, no single model needs to consume all the data in a central server, which increases cost and may even require significant bandwidth to acquire the data. Instead, smaller volumes of data are trained on separate local machines, and only the model parameters are shared. Resource usage is distributed across multiple machines, and bandwidth usage is reduced because now only the model parameters are shared, not the data itself.
  3. Diverse Data Utilization: Since any client machine can run the local AI model, the data on which the AI model is trained benefits from greater diversity. No central authority or organization decides which data to use.
  4. Continuous Learning: Anyone can update open-source software or fork it to create their own variant. With an open-source AI algorithm and a federated learning system, the AI model can be maintained and improved by parties other than the original creators of the AI software. The models can be continuously improved, and in the case of centralized federated learning, if the central server is taken down, a new one can be set up in its place.

This should reduce the expenses needed for an AI project to be run by the open-source community by distributing the cost of training across client machines. It also eases the burden of data collection, consumption, and cleaning on the users‘ local machines.

A Real World Example

Researchers from Arizona State University developed a machine learning model called Ark+. This model uses federated learning to evaluate images of chest radiography and diagnose diseases.

Chest radiography is used to diagnose lung diseases. If given chest radiographic data, machine learning models can be trained to detect and diagnose various lung diseases. However, a challenge researchers face is the lack of data to train these models. Healthcare data is protected in the United States, and sharing images of patient radiography carries risks. If the data is leaked, it can expose hospitals and medical institutions to legal risks and can impact the patients themselves. Therefore, institutions generally guard this data and are reticent to share medical or personally identifiable information.

Article content
Figure 4: This diagram shows how the Ark+ model was trained using data from distinct sources with heterogeneous expert labels using federated learning(Ma et al., 2025)

For researchers, this reluctance or hesitancy to share medical data with machine learning researchers poses a problem, as it is hard to train a machine learning model without sufficient data. Machine learning models before Ark+ faced issues of generalizability, adaptability, robustness, and extensibility.

Ark+’s federated learning methodology solves the problem of sharing sensitive healthcare data. The initial Ark+ model is shared with other institutions, who can use the model to train on their local data. Once the model is trained on the local data, only the model parameters are shared with the central research hub. This way, the sensitive data is never shared over networks and remains in the local/client machines. Over time, the central model updates and improves based on the model parameters received from remote clients.

References

Ma, D., Pang, J., Gotway, M. B., & Liang, J. (2025). A fully open AI foundation model applied to chest radiography. Nature, 1–11. https://doi.org/10.1038/s41586-025-09079-8

Kim, N. (2025). An open AI model could help medical experts to interpret chest X-rays. Nature. https://doi.org/10.1038/d41586-025-01525-x

To view or add a comment, sign in

More articles by Krishna K

Others also viewed

Explore content categories