From the course: Hands-On Data Annotation: Applied Machine Learning

Principles of data annotation

Let's now look at some of the core principles that enable high-quality output when you're annotating your data. First and most important, perhaps, is accuracy. The goal of data annotation in machine learning is to create a good standard for the training data. Creates clear and easy-to-use guidelines with labels or tags that are clear and not ambiguous. Recruit the right annotators for your task. Some annotation tasks can be done by almost anyone who can read and write if trained, of course. And there are other tasks that require specialized domain expertise. If you're dealing with medical data, you perhaps need someone comfortable with the information being annotated. It should also go without saying that all annotators should be trained to watch out for bias and fairness. The second core principle is relevance. A data set can be annotated by its might be relevant to the task at hand. For example, an image can be perfectly annotated for object detection, but if you're doing an image classification instead, the object detection annotation is useless. Similarly, the dataset being annotated should be representative and well-sampled to be as close as possible to the final use case. For example, if the use case is to make predictions on web data, the annotation should be done on similar web data. Third, quality. To guarantee the integrity of annotation, it's important to implement a process for sampling and checking annotations. Think of a production line where samples are tested at predetermined intervals, and sometimes in small enough chunks, to be able to eyeball each annotation for quality. Another way to assess quality is what we call inter-annotator agreement scoring. It's where multiple annotators label the same data iteratively or concurrently to be able to compare how the annotation agree. Fourth principle efficiency. Efficient cost management is about timeliness and the cost of engaging with annotators. It is important this is well planned and documented so that the costs are fair, but do not exceed the value proposition of the task itself. To drive efficiency, it's important that the annotation tool of choice is intuitive, easy to use, and efficient for the task. Inefficient tools can cause both lag and difficult user experience to assign, verify, monitor, and submit annotation tasks.

Contents