| | --- |
| | library_name: transformers |
| | tags: |
| | - ner |
| | license: apache-2.0 |
| | language: |
| | - ru |
| | - en |
| | base_model: |
| | - google-bert/bert-base-uncased |
| | --- |
| | |
| | # NERC Extraction – Stage 2 Models |
| |
|
| | This repository contains two small neural models for Named Entity Recognition (NER) that have been trained using different annotation sources: |
| |
|
| | - **model_llm_pure**: Trained solely on low-quality annotations generated by a Large Language Model (LLM). |
| | - **primary_model**: Fine-tuned on the original, ground-truth annotations from the CoNLL2003 dataset. |
| | |
| | Both models use a hybrid architecture combining a pre-trained BERT model for contextualized word embeddings, a bidirectional LSTM layer to capture sequence dependencies, and a linear classifier to predict NER tags. The models are evaluated using an entity-level evaluation strategy that measures the correctness of entire entities (including boundaries and labels) using the `seqeval` library. |
| | |
| | --- |
| | |
| | ## Model Architecture |
| | |
| | **Core Components:** |
| | |
| | 1. **Pre-trained BERT Encoder:** |
| | Uses `bert-base-cased` to generate high-quality contextualized embeddings for input tokens. |
| | |
| | 2. **Bidirectional LSTM (BiLSTM):** |
| | Processes the sequence of BERT embeddings to capture sequential dependencies, ensuring that both left and right contexts are taken into account. |
| | |
| | 3. **Linear Classification Layer:** |
| | Maps the output of the BiLSTM to the set of NER tags defined in the project. |
| | The tag set includes standard BIO tags for Person, Organization, Location, Miscellaneous, and additional special tokens (`[CLS]`, `[SEP]`, `X`). |
| | |
| | --- |
| | |
| | ## Training Data & Annotation Sources |
| | |
| | - **Low-Quality (LLM) Annotations:** |
| | The **model_llm_pure** was trained on a dataset generated using the best method from the first stage of the project. This dataset contains approximately 1,000 sentences with LLM-generated annotations. |
| | |
| | - **Ground-Truth Annotations (CoNLL2003):** |
| | The **primary_model** was trained on the original expert annotations from the CoNLL2003 dataset (approximately 14,000 sentences). |
| | As a result, **primary_model** exhibits significantly improved performance over **model_llm_pure**. |
| | |
| | --- |
| | |
| | ## Evaluation Metrics |
| | |
| | Our evaluation strategy is based on an entity-level approach: |
| | |
| | 1. **Entity-Level Evaluation Module:** |
| | - **Prediction Collection:** For each sentence, predicted and true labels are collected in a list-of-lists format. |
| | - **Seqeval Accuracy:** Measures the overall accuracy at the entity level. |
| | - **F1-Score:** Calculated as the harmonic mean of precision and recall for entire entities. A correct prediction requires that the full entity (with correct boundaries and label) is identified. |
| | - **Classification Report:** Provides detailed precision, recall, and F1-scores for each entity type. |
| | |
| | 2. **Results Comparison:** |
| | |
| | | Model | Validation Loss | Seqeval Accuracy | F1-Score | |
| | |-----------------|-----------------|------------------|----------| |
| | | **model_llm_pure** | 0.53443 | 0.85185 | 0.47493 | |
| | | **primary_model** | 0.09430 | 0.97955 | 0.88959 | |
| |
|
| | These results demonstrate that **primary_model** (trained on ground-truth CoNLL2003 data) achieves significantly better performance compared to **model_llm_pure**, reflecting the importance of high-quality annotations in NER. |
| | |
| | --- |
| | |
| | ## Usage |
| | |
| | ### Inference |
| | |
| | You can load any of the models using the Hugging Face `from_pretrained` API. For example, to load the primary model: |
| | |
| | ```python |
| | from transformers import BertTokenizer |
| | from your_model_module import NERSmall # make sure NERSmall is imported |
| | |
| | tokenizer = BertTokenizer.from_pretrained("bert-base-cased") |
| | model_primary = NERSmall.from_pretrained("estnafinema0/nerc-extraction", revision="model_primary").to("cuda") |
| | ``` |
| | |
| | Similarly, to load the LLM-based model: |
| | |
| | ```python |
| | model_llm_pure = NERSmall.from_pretrained("estnafinema0/nerc-extraction", revision="main").to("cuda") |
| | ``` |
| | |
| | ### Fine-tuning & Active Learning |
| | |
| | This repository also serves as the basis for further active learning experiments. The evaluation is performed using an entity-level strategy that ensures that only complete entities (with correct boundaries and labels) are counted as correct. Our active learning experiments (described in additional documentation) have demonstrated that adding high-quality expert examples significantly improves the F1-score. |
| | |
| | --- |
| | |
| | ## Training & Evaluation |
| | |
| | **Training Environment:** |
| | |
| | - **Optimizer:** Stochastic Gradient Descent (SGD) with learning rate 0.001 and momentum 0.9. |
| | - **Batch Size:** 32 |
| | - **Epochs:** Models are trained for 5 epochs during initial training (with further fine-tuning as part of active learning experiments). |
| | |
| | **Evaluation Function:** |
| | |
| | Our evaluation function computes entity-level metrics (F1, seqeval accuracy, and validation loss) by processing batches and collecting predictions in a list-of-lists format to ensure that only correctly identified complete entities contribute to the final score. |
| | |
| | --- |
| | |
| | ## Additional Information |
| | |
| | - **Repository:** All models and intermediate checkpoints are stored in separate branches of the repository. For instance, **primary_model** is available in the branch `model_primary`, while other models (from active learning experiments) are stored in branches with names indicating the iteration and percentage of added expert data (e.g., `active_iter_1_added_20`). |