| | --- |
| | license: apache-2.0 |
| | language: |
| | - ru |
| | - en |
| | metrics: |
| | - accuracy |
| | - f1 |
| | - seqeval |
| | tags: |
| | - ner |
| | - active-learning |
| | - K-Fold Cross-Validation |
| | - Uncertainty Estimation |
| | - Iterative Fine-Tuning |
| | - Expert Annotation Integration |
| | - Preliminary Threshold |
| | --- |
| | # Model Card: NER Active Learning Models |
| |
|
| | ## Overview |
| |
|
| | This repository contains a series of NER models trained using an active learning framework. Our approach leverages a low-quality (cheap) dataset combined with high-quality expert annotations to iteratively improve entity recognition performance. The core idea is to begin with a model trained solely on the cheap dataset (model_llm_pure) and then incrementally fine-tune it by selecting the most uncertain expert examples based on an uncertainty estimation module. |
| | |
| | Our baseline model, **model_llm_pure**, achieves limited performance, while the model **model_init_12**, fine-tuned on the cheap dataset plus an additional 12% of expert examples, demonstrates a significant improvement. The active learning loop further refines the model by iteratively adding the most informative examples and saving each intermediate model in a dedicated branch. |
| |
|
| | ## 1. Entity-Level Evaluation Module |
| |
|
| | This module provides an improved metric to evaluate model performance at the entity level, which is crucial for NER. A correct prediction requires that the entire entity (with proper boundaries and correct labels) is recognized correctly. |
| |
|
| | **Key Steps:** |
| |
|
| | 1. **Prediction Collection:** |
| | The evaluation function processes each batch from the evaluation DataLoader and, for each sentence, collects predicted and true labels in a list-of-lists format. |
| |
|
| | 2. **Metric Calculation:** |
| | Using the `seqeval` library, we compute: |
| | - **Seqeval Accuracy:** Overall accuracy at the entity level. |
| | - **F1-Score:** The harmonic mean of precision and recall computed over complete entities. |
| | - **Classification Report:** Detailed precision, recall, and F1-score for each entity type. |
| |
|
| | ## 2. Uncertainty Estimation Module |
| |
|
| | This module estimates the uncertainty of each sentence by computing the average entropy of its tokens. A high average entropy indicates that the model is less confident in its predictions for that sentence. |
| |
|
| | **Process:** |
| |
|
| | 1. Pass each sentence (example) through the model in evaluation mode (with gradients disabled). |
| | 2. Retrieve logits and apply softmax to obtain a probability distribution over labels for each token. |
| | 3. Compute the entropy for each valid token (i.e., where `ner_tag_mask == 1`): |
| |
|
| | $$ |
| | H(token) = - \sum_{y} P(y\mid token) \log P(y\mid token) |
| | $$ |
| | |
| | 4. The average entropy across valid tokens serves as the sentence’s uncertainty measure. |
| | |
| | ## 3. Preliminary Threshold Experiment with K-Fold Cross-Validation |
| | |
| | Before initiating the active learning loop, we run a preliminary experiment using k-fold (5-fold) cross-validation on the expert dataset. This experiment determines the minimal volume of expert examples that yield a significant improvement over the baseline model. |
| | |
| | **Procedure:** |
| | |
| | 1. For each percentage value (e.g., 1%, 2%, 3%, 5%, 7%, 10%) of the cheap dataset size, the corresponding number of expert examples is determined. |
| | 2. The expert dataset is split into 5 folds. |
| | 3. For each fold, a subset of expert examples is selected, combined with the cheap dataset, and the model is fine-tuned for a few epochs. |
| | 4. Evaluation metrics (F1, seqeval accuracy, validation loss) are computed and averaged over all folds. |
| | 5. A graph is then constructed plotting F1-score versus the number of added expert examples to identify the point where improvements saturate. |
| | |
| |  |
| | |
| | ### Model Comparison |
| | |
| | Below is a comparison of the initial evaluation metrics for two baseline models: |
| | |
| | | Model | Validation Loss | Seqeval Accuracy | F1-Score | |
| | |-----------------|-----------------|------------------|----------| |
| | | **model_llm_pure** | 0.53443 | 0.85185 | 0.47493 | |
| | | **model_init_12** | 0.33402 | 0.93084 | 0.65344 | |
| | |
| | Model **model_init_12** is obtained by fine-tuning the base model on the cheap dataset combined with an additional 12% of expert examples, demonstrating significantly improved performance. |
| | |
| | ## 4. Active Learning Loop |
| | |
| | The core active learning loop starts from a pre-trained model (typically **model_init_12**) and iteratively: |
| | - Computes uncertainty for the remaining expert examples. |
| | - Selects the top uncertain examples (batch size controlled by `batch_to_add`). |
| | - Fine-tunes the model on the combined dataset (cheap data + newly added expert examples). |
| | - Saves the intermediate model in a separate branch on Hugging Face. |
| | - Stops when the improvement in F1-score is below a set threshold after a minimum number of iterations. |
| | |
| | **Note:** Each intermediate model is saved in its own branch (e.g., `active_iter_1_added_20`, `active_iter_2_added_40`, etc.), which allows for easy comparison and retrieval later. |
| | |
| | --- |
| | |
| | ## How to Use This Repository |
| | |
| | ### Saving Intermediate Models |
| | |
| | Each time the model is fine-tuned in the active learning loop, it is saved to a dedicated branch on Hugging Face. For example, to save the current model in a branch, use: |
| | |
| | ```python |
| | branch_name = "active_iter_1_added_20" # Example branch name |
| | save_model_to_branch(model, REPO_NAME, branch_name) |
| | ``` |
| | |
| | ### Loading a Model |
| | |
| | To load an intermediate model from a specific branch: |
| | |
| | ```python |
| | loaded_model = load_model_from_branch(REPO_NAME, "active_iter_1_added_20") |
| | ``` |
| | |
| | ### Recommended Workflow |
| | |
| | 1. **Preliminary Experiment:** |
| | Run the preliminary threshold experiment (with k-fold cross-validation) to determine the optimal percentage of expert data to start with. For instance, if the analysis indicates that adding 7% of expert examples provides a stable improvement, use that as your baseline for active learning. |
| | |
| | 2. **Initialize with Expert Data:** |
| | Fine-tune the base model (model_llm_pure) with the selected percentage (e.g., 7%) to produce `model_init_12` (or a similar variant). Save this model in a dedicated branch (e.g., `model_percentage_12`). |
| | |
| | 3. **Active Learning Loop:** |
| | Start the active learning loop with the pre-trained `model_init_12` (by setting `use_initial_training=False`) and iteratively add batches of expert examples selected by the uncertainty estimation module. |
| | |
| | 4. **Graph Analysis:** |
| | After the active learning loop completes, plot the graph of F1-score vs. the total number of added expert examples. This graph illustrates the improvement (or saturation) of the model as more high-quality data is incorporated. |
| | |
| | --- |
| | |
| | ## Conclusion |
| | |
| | This repository documents a complete active learning workflow for NER. Our approach includes: |
| | - An entity-level evaluation module to accurately assess performance. |
| | - An uncertainty estimation module using average token entropy. |
| | - A preliminary threshold experiment using k-fold cross-validation to robustly determine the minimal volume of expert data needed. |
| | - An iterative active learning loop that fine-tunes the model and saves intermediate checkpoints in separate branches on Hugging Face. |
| | |
| | By following this workflow, one can observe the improvement in model performance (primarily measured by entity-level F1-score) as additional expert data is added. The saved intermediate models allow for comprehensive analysis and comparison. |