--- title: "Using human annotators" --- import bestAnnotationPractices from '../../assets/image/best_annotation_practices.png'; import Image from '../../../components/Image.astro'; import HtmlEmbed from "../../../components/HtmlEmbed.astro"; import Note from "../../../components/Note.astro"; import Sidenote from "../../../components/Sidenote.astro"; import Accordion from "../../../components/Accordion.astro"; #### Using human annotators I suggest reading Section 3 of this [review](https://aclanthology.org/2024.cl-3.1/) of good practices in data annotation quality. If you want production level quality and have the means to implement all of these methods, go ahead! However, important guidelines (no matter your project size) are the following, once you defined your task and scoring guidelines. - **Workforce selection, and if you can monetary incentive** You likely want the people working on your task to: 1) obey some demographics. Some examples: be native speakers of the target language, have a higher education level, be experts in a specific domain, be diverse in their geographical origins, etc. Your needs will vary depending on your task. 1) produce high quality work. It's notably important now to add a way to check if answers are LLM-generated, and you'll need to filter some annotators out of your pool. *Imo, unless you're counting on highly motivated crowdsourced annotators, it's always better to pay your annotators correctly.* Unless you have highly motivated crowdsourced annotators, always pay fairly. Underpaid annotators produce lower quality work, introduce more errors, and may use LLMs to complete tasks quickly. - **Guideline design** Make sure to spend a lot of time really brainstorming your guidelines! That's one of the points on which we spent the most time for the [GAIA](https://huggingface.co/gaia-benchmark) dataset. When creating the [GAIA benchmark](https://huggingface.co/gaia-benchmark), guideline design consumed more time than any other phase. Clear, unambiguous guidelines are worth the investment—they prevent costly re-annotation rounds. - **Iterative annotation** Be ready to try several rounds of annotations, as your annotators will misunderstand your guidelines (they are more ambiguous than you think)! Generating samples several times will allow your annotators to really converge on what you need. - **Quality estimation** and **Manual curation** You want to control answers (notably via inter-annotator agreement if you can get it) and do a final selection to keep only the highest quality/most relevant answers. Specialized tools to build annotated high quality datasets like [Argilla](https://argilla.io/) can also help you. - ⭐ [How to set up your own annotator platform in a couple minutes](https://huggingface.co/learn/cookbook/enterprise_cookbook_argilla), by Moritz Laurer. A good read to get some hands on experience using open source tools (like Argilla and Hugging Face), and understanding better the dos and don'ts of human annotation at scale. - ⭐ [A guide on annotation good practices](https://aclanthology.org/2024.cl-3.1/). It's a review of all papers about human annotation dating from 2023, and it is very complete. Slightly dense, but very understandable. - [Another guide on annotation good practices](https://scale.com/guides/data-labeling-annotation-guide), by ScaleAI, specialised in human evaluations. Its a more lightweigth complement to the above document. - [Assumptions and Challenges of Capturing Human Labels](https://aclanthology.org/2024.naacl-long.126/) is a paper on how to look at source of annotator disagreement and mitigate them in practice Here are a few practical tips you might want consider when using human annotators to build an evaluation dataset. **Designing the task** - **Simple is better**: Annotation tasks can get unnecessarily complex, so keep it as simple as possible. Keeping the cognitive load of the annotators to a minimum will help you ensure that they stay focused and make annotations of a higher quality. - **Check what you show**: Only show the necessary information for annotators to complete the task and make sure you don't include anything that could introduce extra bias. - **Consider your annotators time**: Where and how things are displayed can introduce extra work or cognitive load and therefore negatively impact in the quality of results. For example, make sure that the texts and the task are visible together and avoid unnecessary scrolling. If you combine tasks and the result of one informs the other, you can display them sequentially. Think about how everything is displayed in your annotation tool and see if there's any way you can simplify even more. - **Test the setup**: Once you have your task designed and some guidelines in place, make sure you test it yourself on a few samples before involving the whole team, and iterate as needed. **During the annotation** - **Annotators should work independently**: It's better if annotators don't help each other or see each other's work during the task, as they can propagate their own biases and cause annotation drift. Alignment should always happen through comprehensive guidelines. You may want to train any new team members first on a separate dataset and/or use inter-annotator agreement metrics to make sure the team is aligned. - **Consistency is key**: If you make important changes to your guidelines (e.g., changed a definition or instruction, or have added/removed labels), consider if you need to iterate over the annotated data. At least, you should track the changes in your dataset through a metadata value like `guidelines-v1`. **Hybrid human-machine annotation** Sometimes teams face contraints on time and resources but don't want to sacrifice on the pros of human evaluation. In these cases, you may use the help of models to make the task more efficient. - **Model-aided annotation**: You may use the predictions or generations of a model as pre-annotations, so that the annotation team doesn't need to start from scratch. Just note that this could introduce the model's biases into human annotations, and that if the model's accuracy is poor it may increase work for annotators. - **Supervise model as a judge**: You can combine the power of the model as a judge methodology (see the section on "Model as a judge") and human supervisors who validate or discard the results. Note that the biases discussed in the "Pros and cons of human evaluation" will apply here. - **Idenfity edge cases**: For an even faster task, use a jury of models and then have your human supervisor(s) step in where models disagree or there's a tie to break. Again, be aware of the biases discussed in the "Pros and cons of human evaluation".