Data labeling pipelines for machine learning.

Data labeling pipelines for machine learning.

Welcome, Marcin Bera! Today, we’ll be talking about "Data Labeling Pipelines for Machine Learning." Thank you for joining us!

Marcin Bera:

Good morning, Olga. Thank you very much for the invitation.

To start, please tell our listeners why data labeling has become so important right now.

First of all, it’s crucial for machine learning and artificial intelligence. These fields essentially live on data. Traditionally, most applications of AI and machine learning are in supervised learning, and it's predicted to stay that way for at least several more years. And for supervised learning, we need labeled data. Labeling is the process of adding one or more meaningful labels to raw data to provide context that enables a machine learning model to learn. Sometimes these labels are clear and can be extracted directly from systems. For example, if we want to predict whether a customer will leave within the next 30 days, we typically have transactional data, from which we can extract such information and verify whether the customer actually left in that period. This is a good case from the model-building perspective, as the data is certain. However, sometimes we have to create these labels ourselves. This is especially true for machine learning projects that involve data we typically associate with AI today—text, audio, images, or video. In such cases, labeling is effectively a form of domain knowledge application, which is essential for these models to work properly. A label in this context represents the truth about a given record. Labeling is crucial not only for the technical feasibility of AI projects but also from an organizational perspective, as it’s not always clear who should own the labeling process. It’s also one of the bottlenecks in many machine learning projects.

A label in this context represents the truth about a given record. Labeling is crucial not only for the technical feasibility of AI projects but also from an organizational perspective, as it’s not always clear who should own the labeling process. It’s also one of the bottlenecks in many machine learning projects.

You mentioned that labeling is key for machine learning. So how are these data labels created? You’ve already touched on some points, but I’m curious about the entire process. I’m sure our listeners are too.

There are three main approaches. The first is manual labeling. Each label is added manually—for example, if we have a thousand images of animals and we want to build a model that recognizes a wide range of animals and also shows where on the image the animal is, we have to label each animal manually on every image. The second approach is assisted or hybrid labeling, which generally follows a few steps: we label part of the data manually, build a machine learning model based on that subset, and then use the model to label the remaining data. We only verify and potentially correct the labels from the model. The third approach is automatic labeling, also known as programmatic labeling, where we label data in bulk using certain functions. For example, when labeling product reviews, if a review contains the word "good" or "excellent," we can immediately label all records containing those words as positive reviews. Of course, the functions can be more or less complex, and this is a simple example. However, process-wise, it’s an entirely separate category, and the most automated.

Since the labeling process can be manual or automatic, let’s consider the pros and cons of these different approaches to data labeling.

Manual labeling is particularly useful for new or emerging areas. The manual labeling process provides value beyond just creating the labels themselves, as it helps us better understand our data. In some cases, manual labeling is the only viable option, especially when deep domain expertise is required. From the cons side, the most obvious drawback is that it's slow, potentially costly, and not very scalable. Privacy and security issues can also arise when a lot of people manually review the data. And, of course, the risk of human error is a major concern, which can make manual labeling less effective.

Hybrid labeling has the advantage that we only need a small amount of manually labeled data, making it more scalable and, in the medium to long term, cheaper. We can also assign more complex or edge cases to the manual labeling team, optimizing the overall workflow. The downside is that building even an intermediate model takes time. There’s no guarantee the results will be sufficiently accurate or of the desired quality. Hybrid labeling assumes the model provides labels, and the team reviews them, but there's a risk that reviewing pre-labeled data may be less thorough than labeling it from scratch. This can affect the quality of the labels, so this approach is not always recommended when label quality is critical. As for automatic labeling, the pros are that it's super fast, scalable, and repeatable. It also allows domain experts to apply their knowledge to building the functions used for labeling. It’s promising for very large datasets, and there's flexibility, as the set of rules used for programmatic labeling is constantly growing. However, automatic labeling is still a relatively new technology, requiring investments. The rules may not always be precise enough for the task at hand, raising questions about how far this approach can truly go.

Considering the advantages and disadvantages of different data labeling approaches, what key elements should be carefully managed during the data labeling process?

The labeling process requires active management and quality control. It’s important to have clarity on how the data will be used, which includes understanding the project's goals, expected metrics, model performance, and its anticipated impact on business metrics. A proper taxonomy and written instructions should be established for the labeling team, and the team must understand the process. If the project is long-term, ensure knowledge transfer within the team, which may change over time. This is critical because the quality and consistency of the labels are key to the project's success. It’s also important to choose the right labeling tools that meet both technical and UX/UI needs and provide the labeling team with access to the data. Where possible, it’s worth automating the pipelines involved.

You’ve mentioned how crucial consistency and label quality are. What distinguishes data labeling pipelines from other data pipelines?

As we've established, data labeling pipelines can often be, and to a large extent are, manual, which is a significant distinction. Another difference is that there’s no natural owner of the process. On one hand, data pipelines are typically owned by data engineers, but in this specific case, they serve to build machine learning models. Therefore, the data scientist, in my opinion, should take a leading role, as they are best equipped to understand the consequences of labeling errors and can design or adjust the process to ensure consistency, especially in edge cases. Another difference is the high degree of uncertainty typical of machine learning projects. We don’t usually have complete control over the data we’ll be processing, which makes it essential to pay attention to legal limitations, such as data privacy regulations like GDPR or the EU AI Act.

Since data labeling pipelines differ significantly from other data pipelines, what are the best practices for designing data labeling pipelines?

It’s important to automate the parts of the process that are used regularly. We already know that not the entire process can be automated, but we can automate the parts leading to labeling and the parts that follow it. This includes automating how data enters the labeling system and how the labeled results are processed afterward. It’s also worth considering whether we only need the latest version of the labels or the entire history. If we need history, versioning the databases and the labels themselves is a good practice. We could also consider implementing automatic retraining of machine learning models when new, updated labels appear. After all, the idea behind labeling is to build increasingly better, up-to-date machine learning models. Retraining the model when new data arrives is worth considering. We could also reduce the manual component of labeling over time. Consider programmatic labeling where it makes sense, and continuously evaluate the maturity of the labeling process, so at some point, you can transition from manual to hybrid labeling. You could also continue manual labeling but reduce its scale by using techniques such as active learning.

This machine learning technique helps identify the most useful records for the target model. Instead of labeling randomly, we focus on labeling the most valuable data for our task.

In my opinion, the key question here is how data labels impact the overall quality of machine learning models.

That’s a great question. A machine learning model performs well to the extent that it accurately reflects reality. In supervised learning, we have precise metrics to measure model performance. The model has several sources of error, the two most important being model error and data label error. Model error occurs because no model can perfectly predict 100% of what we want it to. Label error arises when the data label doesn’t match reality. For instance, let’s say we’re building a model to distinguish between cats and dogs based on images. Model error would occur if the model sometimes misclassifies a cat as a dog. Label error would happen if an image of a dog is labeled as a cat. This creates confusion and makes the model less confident in its predictions. The quantity of labels helps the model learn, but good, consistent labels help minimize this second type of error.

You’ve emphasized that accurate data labeling is crucial. How can we manage data labeling taxonomy to maintain this accuracy?

A taxonomy in data labeling is a systematic way of organizing knowledge and classifying data into structures, usually hierarchical ones. In the context of data labeling, taxonomy defines sets of concepts and relationships between them to allow for consistent, understandable labeling. I would split the answer into two aspects: process-related and quality-related. Process-wise, it’s crucial to carefully think through the taxonomy, to avoid relabeling the same objects, especially when it comes to manual labeling. You should consider whether the taxonomy will need to be extended in the future, and if so, design the data model to be flexible—for example, using NoSQL databases. Additionally, you need to ensure a user-friendly and intuitive interface for the labeling team. This might not seem obvious at first, but it becomes crucial with even a bit of experience in these types of projects.

You want to minimize the cognitive load on the labeling team by ensuring interfaces are user-friendly and the time between switching between objects is as short as possible, so the team can focus on the core task without losing focus. It’s also essential to involve domain experts, especially in new areas, and to monitor if external factors change. For instance, if we’re building a model to detect financial fraud and certain transaction patterns were labeled as potential fraud a year or two ago, those patterns may no longer be relevant today. It's important to account for this in the labeling process.

Marcin, we now know how data labeling affects the quality of machine learning models. Moving forward, how can we ensure the highest quality of data labels?

There are several specific techniques to improve labeling accuracy. First, as mentioned before, you need to establish clear guidelines and instructions for labeling teams. In more difficult cases, you can use consensus labeling, where multiple people label the same data to see if they interpret the instructions consistently. This helps minimize errors due to individual interpretation. You should also audit the labeling process to check the quality of the labels and update them if necessary. This includes identifying particularly challenging cases. These audits should be both qualitative and quantitative. Quantitative approaches, like consensus, help by monitoring how often there’s agreement between different people labeling the same data and checking if this changes over time. You should also maintain quick, regular communication between the labeling team and the development team to ensure difficult cases are addressed promptly and labeled consistently. Finally, active learning, which we discussed earlier, is another technique that can help.

We’ve learned a lot about data labeling today, so let’s look to the future. Marcin, what new developments can we expect in data labeling tools in the near future?

Quantitative approaches, like consensus, help by monitoring how often there’s agreement between different people labeling the same data and checking if this changes over time.

You should also maintain quick, regular communication between the labeling team and the development team to ensure difficult cases are addressed promptly and labeled consistently. Finally, active learning, which we discussed earlier, is another technique that can help. The main bottleneck in the process right now is the manual component of labeling. I expect we’ll see more pre-annotated datasets available, which will help lower the cost of acquiring new data. I also think it will become easier to add multiple labels and quickly adjust original labels that come from pre-annotated data. The use of programmatic labeling will likely increase, and I believe there will be a rise in active learning as well. This field may even see new algorithms that better identify the data with the most potential for labeling, making our machine learning tasks more efficient.

Thank you for a very interesting conversation and for sharing your knowledge. We’ve learned how important data labeling is for machine learning and the quality of the models we build. As we all know, various types of machine learning models are not just our present but also our future.

Thanks again for the invitation. See you next time.


Article content


To view or add a comment, sign in

More articles by Sofixit

Others also viewed

Explore content categories