Model centric AI vs. Data centric AI

Model centric AI vs. Data centric AI

(This article is an explanation of https://arxiv.org/abs/2212.11854v4)

Machine Learning Research Evolution

For many years, machine learning (ML) research has primarily concentrated on optimizing the models used for prediction and classification tasks. Key areas of focus include:

  • Model Development: Creating and testing various types of models, such as neural networks, decision trees, etc.
  • Architectural Experimentation: Exploring different model architectures, like the number of layers or neurons in a neural network.
  • Hyperparameter Tuning: Adjusting hyperparameters, such as learning rates and regularization strength, which control the training process.

To measure progress and compare different approaches, the ML community frequently uses benchmark datasets. These publicly available datasets, used in both academic research and practical competitions (e.g., Kaggle), provide a standardized way to evaluate different models. Benchmarking enables reproducibility, encourages innovation, and has driven significant improvements in model performance. Over time, this has led to the maturation of model types, architectures, and hyperparameter optimization.

Model-Centric AI

The model-centric approach to AI places the emphasis on optimizing the model itself. The aim is to identify the best combination of model type, architecture, and hyperparameters to achieve peak performance. This strategy has led to significant advances in ML, particularly when working with predefined datasets. However, in recent years, the returns from this approach have begun to diminish, as improving model complexity no longer consistently results in large performance gains.

For instance, while deep learning models, with their complex architectures and numerous layers, may outperform simpler models on certain tasks, they do not always perform better on real-world datasets, which may be noisy, sparse, or poorly labeled. Moreover, many real-world problems face challenges such as the lack of accessible datasets or pre-trained models, making the model-centric approach less effective in these cases.

Data-Centric AI

In response to the limitations of the model-centric approach, both researchers and practitioners have increasingly shifted their focus towards the data used to train models. This change reflects the growing understanding that improving the quality and quantity of data can have a more profound impact on model performance than solely optimizing the model itself. This shift has led to the emergence of data-centric AI.

Data-centric AI emphasizes the importance of improving the data pipeline, rather than exclusively refining the model. The key principles of this approach include:

  • Focus on Data: In data-centric AI, the model remains largely fixed, and improvements are made by refining the data. This can involve cleaning the data, augmenting it, or expanding the dataset to improve its relevance, consistency, and comprehensiveness.
  • Domain-Specific Data Work: This paradigm requires deep integration of domain-specific knowledge to better understand and enhance the data. This includes correcting mislabeled data, generating new relevant data, or augmenting existing datasets with new contexts.
  • Improved Data Quality: Data-centric AI focuses on optimizing the quality of the data. Key aspects of data quality include:

In this framework, improvements in model performance are viewed as a result of better data. The impact of changes in the data is reflected in the model's performance metrics (e.g., accuracy, precision, recall).

Complementary Nature of Model-Centric and Data-Centric AI

Although model-centric and data-centric AI may seem distinct, they are inherently complementary. Both paradigms are critical for developing effective AI systems. In practice, improvements to both the model (via refined architectures or hyperparameters) and the data (by improving quality and relevance) are necessary for optimal results.

To clarify this relationship:

  • Model-Centric AI enhances performance by continuously refining the model’s structure and parameters.
  • Data-Centric AI improves performance by enhancing the data used to train the model, ensuring it more accurately reflects the real-world problem the model is meant to solve.

As AI systems evolve, it is increasingly evident that integrating both paradigms is essential. Improving data quality can help models generalize better, while refining model architectures can make more effective use of high-quality data.

Data-Centric AI in Relation to Related Concepts

Data-Centric AI shares similarities with several concepts in the Business Information Systems Engineering (BISE) community, such as Big Data, MLOps, and data-driven methods, but it also differs from each of these.

  • Big Data vs. Data-Centric AI: Both paradigms emphasize gathering large volumes of data to improve analytics and predictions. However, Big Data focuses primarily on the collection, storage, and processing of vast amounts of data, often without considering its quality. The assumption is that more data is always beneficial. In contrast, Data-Centric AI aims to improve performance by acquiring not just more data, but better-quality data, removing irrelevant or poor-quality data. This is particularly important in specialized domains where collecting large amounts of data may not be feasible.
  • MLOps vs. Data-Centric AI: MLOps is focused on the operationalization of AI projects, ensuring they are deployed successfully by addressing challenges like continuous development, monitoring, and reproducibility. While MLOps acknowledges the importance of data, its primary focus is on engineering principles, such as deployment pipelines, versioning, and orchestration. Data-Centric AI, on the other hand, emphasizes the iterative refinement of the data throughout the lifecycle of the AI project, including tracking and versioning datasets to assess their impact on model performance.
  • Data-Driven Methods vs. Data-Centric AI: Data-driven methods are concerned with processing data to generate actionable information for decision-making, whereas model-driven methods focus on using mathematical models (e.g., optimization or simulation). ML models typically combine both data-driven and model-driven approaches. The distinction between model-centric and data-centric AI is at a more granular level, as it refers specifically to the development of the AI model itself. Model-centric AI focuses on optimizing the model, while data-centric AI focuses on improving the data used to train it.

Conclusion

In conclusion, the evolution of AI research is shifting from a model-centric approach (focused on optimizing the model) to a data-centric approach (focused on improving the data). Both paradigms are essential for building effective AI systems, and they complement each other. While model-centric AI focuses on refining the algorithm, data-centric AI emphasizes the need to improve the dataset, making both equally important in the development of high-performance AI systems.

Reference:

Jakubik, J., Vössing, M., Kühl, N., Walk, J., & Satzger, G. (2024). Data-centric artificial intelligence (arXiv:2212.11854v4 [cs.AI]). https://arxiv.org/abs/2212.11854v4


To view or add a comment, sign in

More articles by Sandeep Kumar E., Ph.D.(Engg.), (MBA)

Others also viewed

Explore content categories