Tuesday, May 28, 2024
HomeData ScienceMoving from model-centric to data-centric AI!

Moving from model-centric to data-centric AI!


Table of Contents

  • Introduction:
  • Data-Centric AI:
  • Steps for a data-centric model development approach:
    • 1. Scoping and pre-model development:
    • 2. Labeling (labeling):
    • 3. Model Development and Error Analysis Driven Iteration:
    • 4. Data monitoring:
    • 5. Model monitoring and retraining:
  • Conclusion:


AI is a rapidly evolving field, but until a few years ago there were different model architectures for different tasks, such as CNNs for visual tasks and LSTM-like networks for language tasks. However, with the invention of the Transformer architecture and its subsequent 2-3 years of development, these tasks now seem to be handled very well by a single architecture with only a few tweaks for different use cases. Became. New larger models are released every few months, and their accuracy exceeds the state-of-the-art models by several percentage points.

When developing AI solutions in practice, it is easy to get bogged down in searching aimlessly for better architectures, larger pretrained models, or optimal sets of hyperparameters with the goal of achieving even slightly higher accuracy. . This has been the case for the (AI) community since the mid-2000s, but in early 2021, a movement led by Professor Andrew Ng has shifted the way practitioners think about model improvement. Research and development of models and architectures is still going on (for good reason), but a new method of model development called data-centric AI has emerged.

Data-Centric AI:

How is data-centric AI different from model-centric AI? Model-centric AI development is what the industry has traditionally done. Have a fixed dataset, train a model, evaluate that model on a holdout test set, tweak the hyperparameters or modify the model to get better results, and get the desired metric. continue them until is achieved.

A data-centric approach differs (from model-centric development) by key concepts such as: The concept is to fix the model and hyperparameters and apply Error analysis driven data iteration to improve model performance. Of great importance here is error analysis driven data repetition. The idea is to analyze the errors of different classes and samples to guide further processes such as data acquisition, collection and enrichment. This process also helps strategically improve model performance instead of randomly collecting more data or augmenting specific classes.

But why move away from the current model-centric approach that has worked so well? The answer is simple. Because in ML/AI, data is code. The data defines how the model behaves, and in doing so the code plays a role. What data you feed the model determines its future behavior. Using a model-centric approach hits a performance ceiling, after which a more data-centric approach can significantly improve results. Remember the garbage-in-garbage-out principle. The following slides from Professor Andrew Ng’s course show how taking a data-centric approach can improve model performance by orders of magnitude.

Remember, data is code.

(*Translation Note 1) Garbage In, Garbage Out (GIGO for short) is an aphorism in computer science that means meaningless output from meaningless input data . be.
Online Courses Coursera has a specialization in Machine Learning Engineering for Product (MLOps) taught by Professor Ng . Effectively implementing machine learning models requires “competence more commonly found in technical areas such as DevOps” (i.e. skills dealing with MLOps ), the course explains, and such skills are ” Work with ever-evolving data in product systems, as opposed to standard machine learning modeling .” This skill is the data-centric AI-like development method discussed in this article.

(*Translation Note 3) The table above translates to:

metal defect detection solar panel surface inspection
Base line 76.2% 75.68% 85.05%
model centric +0% +0.04% +0.00%
data-centric +16.9% +3.06% +0.4%


Steps for a data-centric model development approach:

How do you get started to take a data-centric approach to model development? The steps begin at the scoping stage and continue through the post-implementation stage. Now let me explain these steps in detail.

1. Scoping and pre-model development:

Make sure you get the right data: Data is code, so it’s important to be able to collect the right data. It’s also important to make sure the data contains the signals we’re interested in. For example, if a scratch on a cell phone screen is visible to a human inspector on a manufacturing site, but not in an image taken by the same inspector with an image acquisition device, the lighting used to acquire the image may You may need to change the camera resolution. Good data coverage is also important, covering as many cases as possible in the training and test datasets.

Establish Human Level Performance (HLP): For unstructured data such as audio and images, establishing an HLP can set a baseline for error analysis-driven data iterations of the model. And you can drive more iterations by identifying classes that can be improved and lead to the most positive returns.

2. Labeling (labeling):

Whether it is model-centric or data-centric, the labeling stage is very important. It is important that the labels are consistent and clean, and even more so if the dataset is small, noisy data will adversely affect model performance. Labelers need a rulebook to complete their labeling and must follow it to ensure consistency among labelers.

For example, in the image below, a labeler might label individual kittens (left) or as a group (right). Both are technically correct, but you need to set the rules at the beginning of labeling and let labelers know which is the preferred method. Tools like LandingLens can help streamline this process.

Multiple labeling methods for a single image. Image source: Unsplash

(*Translation Note 4) LandingLens is a platform for developing image recognition models using a data-centric AI approach provided by Landing AI, a company founded by Professor Ng . Besides data labeling, you can also train and evaluate models.

3. Model Development and Error Analysis Driven Iteration:

Model-centric development involves iteratively improving model performance by tuning hyperparameters and modifying models while maintaining a fixed dataset. For this reason, model development usually occurs after all necessary/collectable data has been collected.

Data-centric development requires a different approach. Start early and move fast. Instead of waiting months or years for just enough data to gather, start with a relatively small dataset and train your model as soon as the data size is right for your architecture.

Real-world data doesn’t always work. Take advantage of our free webinars to ensure the success of your prepared experiments.

By doing so, it is possible to know whether the correct data is being collected or whether the data collection process needs to be changed. For example, if some kind of scratch on the car is not visible in the image due to factory lighting, the problem should be identified until later (like waiting a year for data collection).

Data versioning tools such as Comet ML and dvc can be used to track changes in datasets over time during an iterative process.

After model training, error analysis is performed on the initial model. This process is where the data-centric approach is most interesting. Error analysis is used together with HLP (if available) and key indicators to guide future actions and data collection strategies.

For example, if a flaw detection model is not performing well for a 2cm or smaller scratch on a silver car door, error analysis can be used to drive future data collection strategies. Tools like Comet ML help identify cases where model performance is subpar.

The number of samples (of data) can be increased by:

  • Collect more data using your current data collection pipeline.
  • data augmentation.
  • Synthetic data generation with packages such as Unity Perception.
(*Translation Note 5) Unity Perception is a library that generates learning data for image recognition provided by Unity Techologies, developers of the Unity game engine. The image recognition development service provided by the company is provided by a product called Unity Computer Vision .

In most modern deep learning architectures, adding data rarely hurts, especially when large models are used. Rather, adding more data usually improves the performance of similar classes slightly. In other words, if it is difficult to collect many samples of scratches less than 2 cm on a silver door, collecting samples of other similar-looking doors with scratches less than 2 cm (e.g. gray doors) will result in a silver door. It is likely that the accuracy of the model for

By prioritizing model performance improvements for one class at a time, you can focus available resources. This determination can be made in a variety of ways, such as:

  • Measure the gap between the model’s performance in a particular class and the HLP for that class (if available).
  • We use the misjudgment cost of a particular class (a confusion matrix that weights each set of misjudgments).
  • Determining the percentage of data about a particular class that you are likely to encounter during service delivery.
  • Combine information such as HLP gaps and data percentages (as shown below).

A table with prioritization of improvements by class. created by the author

(*Translation Note 6) The table above translates to:

Classification type accuracy HLP Difference from HLP Probability of occurrence of data Maximum improvement value
red car 95% 100% Five% 30% 1.5%
gray car 85% 90% Five% 50% 2.5%
silver car 82% 90% 8% 20% 1.6%

“Maximum improvement value” on the right side of the table is the product of “difference from HLP” and “probability of appearance of data” . The maximum improvement value is calculated based on the idea that “a class that has a large difference from HLP and contains data with a high appearance probability has the highest improvement effect “.

From the table above, we can see that focusing on improving the performance of gray cars provides more value to the business than other cars. Doing so should also improve the accuracy of other similar classes.

Another approach is to test the latest iteration model on samples that were misclassified in the previous iteration model and analyze the performance improvement in the latest iteration model. This method also allows us to analyze how well our model training and data collection strategies are working. This approach also helps identify samples that consistently fail to make accurate predictions across different iterative models. Identification of such samples also provides direction for future data collection strategies.

4. Data monitoring:

After the model is in production, the incoming data should be continuously monitored. Because the real world is constantly evolving, so is the data fed into the models to make predictions. There are usually two types of drift that occur.

Data Drift: A change in the distribution of the input X. This drift can occur for various reasons, as follows.

  • Differences in voice input occur because people of different nationalities started using it than what the NLP system learned.
  • In the case of visual inspection equipment, the camera was knocked out of the desired field of view.
  • Website changes. For example, users can now enter negative values ​​instead of only positive values ​​when the model was trained.

Concept drift: A change in the mapping from X to y. For example, the house price dataset we trained in 2000 probably wouldn’t work today. Concept drift usually happens more slowly than data drift, because human emotions and behaviors rarely change abruptly . But events like COVID-19 can suddenly change the concept, as more people order their groceries online than they used to.

Both types of drift adversely affect model performance. These drifts can occur in a variety of ways. Therefore, how to deal with drift is case-by-case. That said, in order to know if data drift is occurring, it is necessary to first monitor the incoming data when the model is provided. Data monitoring can be done by plotting a histogram of pixel values ​​of the input image and comparing it with a set schema (which can be obtained during training). It can be done in a variety of ways, such as calculating various statistics such as range.

Tools like Comet MPM help monitor production models and detect concept drift. Chip Huyen’s blog post ” Migrating Data Distributions and Monitoring ” is a must-read article detailing how to detect drift in various cases.

Samples plagued by data drift and concept drift sometimes bypass surveillance systems. For example, even a slight blurring of the camera image due to factory vibrations can result in nearly identical statistical properties, bypassing data drift detectors. Therefore, it is also important to track the performance of the model.

(*Translation Note 7) According to the article published in February 2022 by Heartbeat, a media that disseminates information on data science and machine learning, which published this article, ” How to deal with concept drift during production “, concept drift are of the following four types:

4 concept drifts

  • Sudden Drift: Concept drift caused by unforeseen circumstances such as the Covid- 19 pandemic. Characterized by abrupt changes.
  • Gradual drift: concept drift caused by seasonal changes and aging . Things that change slowly also need to be dealt with.
  • Incremental Drift: Predictable concept drift caused by technological innovation, etc. For example, the penetration rate of smartphones can be predicted to some extent from the speed at which users replace their mobile phones.
  • Iterative concepts: concept drifts that occur at specific times . For example, the number of sales during the year-end sales season has a different characteristic from other seasons. Iterative concepts can change their character significantly due to sudden drift. For example, during the COVID-19 crisis, homecoming, which is one of the recurring concepts, has decreased significantly.

For concept drift, see also the AINOW translation article “ Why machine learning models deteriorate when commercialized ”.

Computer scientist and author Chip Huyen has written a book on how to deal with drift, Designing Machine Learning Systems (O’Reilly, 2022). He has also published a draft of the book .

5. Model monitoring and retraining:

Model monitoring is the process of monitoring how predictive accuracy is during model delivery.

Models can be monitored in a variety of ways, including human feedback in a loop or automated feedback systems such as click-through rates in recommendation systems. Some feedback comes sooner than others.

Model performance can be degraded by concept drift, data drift, or a combination of both. When the model performance for real-world inference falls below a set threshold (depending on the use case below the threshold), retraining is required. There are various methods of retraining, and there is a method of retraining using new data on top of the old model (transfer learning), but if the concept changes drastically, retraining a completely new model with new data. There is a need.

It would be a good idea to have a system in place to store correctly and incorrectly classified samples separately at the time of delivery and use them later for analysis and retraining.

Once retrained, the performance of the new model can be evaluated by implementing the model using different implementation strategies such as shadow mode, canary implementation, and blue-green implementation.

The document Application Deployment and Testing Strategies published by Google’s Cloud Architecture Center describes various deployment (implementation) and testing strategies. Here’s a quick rundown of the strategies mentioned in this article: Various application implementation and testing strategies

  • Blue/Green deployment pattern: An implementation strategy that implements an old and new system at the same time, tests the new system and runs the old system at the same time , and then runs the new system after testing is complete. Old system shuts down after testing.
  • Canary test pattern: An implementation strategy in which an old and new system are implemented at the same time, and then part of the production traffic is transferred to the new system to test the new system while the old and new systems are running at the same time. If there is no problem in evaluating the new system, use only the new system.
  • Shadow test pattern: An implementation strategy in which the old and new systems are implemented simultaneously, but the new system is evaluated in a test environment .

For more information on the above implementation strategies, please refer to the aforementioned document.


As we have seen, taking a data-centric approach to a problem is an iterative, data-guided process.

Data-centric means not just using good quality data to train the model, but using the data to measure the performance of the model, performing error analysis-driven data iterations to improve the model, and and to monitor the model’s performance against it and maintain the model.

With the development of more developer-friendly tools for taking a data-centric approach, such as Comet ML and LandingLens, the data-centric approach to model development will become a major impetus for model development and improvement in the future. deaf. Share your thoughts on this fairly new approach to model development, thinking about how you might introduce a more data-centric approach to your current and future projects.



Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments