Model centric AI vs. Data centric AI
(This article is an explanation of https://arxiv.org/abs/2212.11854v4)
Machine Learning Research Evolution
For many years, machine learning (ML) research has primarily concentrated on optimizing the models used for prediction and classification tasks. Key areas of focus include:
To measure progress and compare different approaches, the ML community frequently uses benchmark datasets. These publicly available datasets, used in both academic research and practical competitions (e.g., Kaggle), provide a standardized way to evaluate different models. Benchmarking enables reproducibility, encourages innovation, and has driven significant improvements in model performance. Over time, this has led to the maturation of model types, architectures, and hyperparameter optimization.
Model-Centric AI
The model-centric approach to AI places the emphasis on optimizing the model itself. The aim is to identify the best combination of model type, architecture, and hyperparameters to achieve peak performance. This strategy has led to significant advances in ML, particularly when working with predefined datasets. However, in recent years, the returns from this approach have begun to diminish, as improving model complexity no longer consistently results in large performance gains.
For instance, while deep learning models, with their complex architectures and numerous layers, may outperform simpler models on certain tasks, they do not always perform better on real-world datasets, which may be noisy, sparse, or poorly labeled. Moreover, many real-world problems face challenges such as the lack of accessible datasets or pre-trained models, making the model-centric approach less effective in these cases.
Data-Centric AI
In response to the limitations of the model-centric approach, both researchers and practitioners have increasingly shifted their focus towards the data used to train models. This change reflects the growing understanding that improving the quality and quantity of data can have a more profound impact on model performance than solely optimizing the model itself. This shift has led to the emergence of data-centric AI.
Data-centric AI emphasizes the importance of improving the data pipeline, rather than exclusively refining the model. The key principles of this approach include:
In this framework, improvements in model performance are viewed as a result of better data. The impact of changes in the data is reflected in the model's performance metrics (e.g., accuracy, precision, recall).
Recommended by LinkedIn
Complementary Nature of Model-Centric and Data-Centric AI
Although model-centric and data-centric AI may seem distinct, they are inherently complementary. Both paradigms are critical for developing effective AI systems. In practice, improvements to both the model (via refined architectures or hyperparameters) and the data (by improving quality and relevance) are necessary for optimal results.
To clarify this relationship:
As AI systems evolve, it is increasingly evident that integrating both paradigms is essential. Improving data quality can help models generalize better, while refining model architectures can make more effective use of high-quality data.
Data-Centric AI in Relation to Related Concepts
Data-Centric AI shares similarities with several concepts in the Business Information Systems Engineering (BISE) community, such as Big Data, MLOps, and data-driven methods, but it also differs from each of these.
Conclusion
In conclusion, the evolution of AI research is shifting from a model-centric approach (focused on optimizing the model) to a data-centric approach (focused on improving the data). Both paradigms are essential for building effective AI systems, and they complement each other. While model-centric AI focuses on refining the algorithm, data-centric AI emphasizes the need to improve the dataset, making both equally important in the development of high-performance AI systems.
Reference:
Jakubik, J., Vössing, M., Kühl, N., Walk, J., & Satzger, G. (2024). Data-centric artificial intelligence (arXiv:2212.11854v4 [cs.AI]). https://arxiv.org/abs/2212.11854v4