nerc-extraction / README.md
estnafinema0's picture
Update README.md
bd1fe9b verified
---
library_name: transformers
tags:
- ner
license: apache-2.0
language:
- ru
- en
base_model:
- google-bert/bert-base-uncased
---
# NERC Extraction – Stage 2 Models
This repository contains two small neural models for Named Entity Recognition (NER) that have been trained using different annotation sources:
- **model_llm_pure**: Trained solely on low-quality annotations generated by a Large Language Model (LLM).
- **primary_model**: Fine-tuned on the original, ground-truth annotations from the CoNLL2003 dataset.
Both models use a hybrid architecture combining a pre-trained BERT model for contextualized word embeddings, a bidirectional LSTM layer to capture sequence dependencies, and a linear classifier to predict NER tags. The models are evaluated using an entity-level evaluation strategy that measures the correctness of entire entities (including boundaries and labels) using the `seqeval` library.
---
## Model Architecture
**Core Components:**
1. **Pre-trained BERT Encoder:**
Uses `bert-base-cased` to generate high-quality contextualized embeddings for input tokens.
2. **Bidirectional LSTM (BiLSTM):**
Processes the sequence of BERT embeddings to capture sequential dependencies, ensuring that both left and right contexts are taken into account.
3. **Linear Classification Layer:**
Maps the output of the BiLSTM to the set of NER tags defined in the project.
The tag set includes standard BIO tags for Person, Organization, Location, Miscellaneous, and additional special tokens (`[CLS]`, `[SEP]`, `X`).
---
## Training Data & Annotation Sources
- **Low-Quality (LLM) Annotations:**
The **model_llm_pure** was trained on a dataset generated using the best method from the first stage of the project. This dataset contains approximately 1,000 sentences with LLM-generated annotations.
- **Ground-Truth Annotations (CoNLL2003):**
The **primary_model** was trained on the original expert annotations from the CoNLL2003 dataset (approximately 14,000 sentences).
As a result, **primary_model** exhibits significantly improved performance over **model_llm_pure**.
---
## Evaluation Metrics
Our evaluation strategy is based on an entity-level approach:
1. **Entity-Level Evaluation Module:**
- **Prediction Collection:** For each sentence, predicted and true labels are collected in a list-of-lists format.
- **Seqeval Accuracy:** Measures the overall accuracy at the entity level.
- **F1-Score:** Calculated as the harmonic mean of precision and recall for entire entities. A correct prediction requires that the full entity (with correct boundaries and label) is identified.
- **Classification Report:** Provides detailed precision, recall, and F1-scores for each entity type.
2. **Results Comparison:**
| Model | Validation Loss | Seqeval Accuracy | F1-Score |
|-----------------|-----------------|------------------|----------|
| **model_llm_pure** | 0.53443 | 0.85185 | 0.47493 |
| **primary_model** | 0.09430 | 0.97955 | 0.88959 |
These results demonstrate that **primary_model** (trained on ground-truth CoNLL2003 data) achieves significantly better performance compared to **model_llm_pure**, reflecting the importance of high-quality annotations in NER.
---
## Usage
### Inference
You can load any of the models using the Hugging Face `from_pretrained` API. For example, to load the primary model:
```python
from transformers import BertTokenizer
from your_model_module import NERSmall # make sure NERSmall is imported
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model_primary = NERSmall.from_pretrained("estnafinema0/nerc-extraction", revision="model_primary").to("cuda")
```
Similarly, to load the LLM-based model:
```python
model_llm_pure = NERSmall.from_pretrained("estnafinema0/nerc-extraction", revision="main").to("cuda")
```
### Fine-tuning & Active Learning
This repository also serves as the basis for further active learning experiments. The evaluation is performed using an entity-level strategy that ensures that only complete entities (with correct boundaries and labels) are counted as correct. Our active learning experiments (described in additional documentation) have demonstrated that adding high-quality expert examples significantly improves the F1-score.
---
## Training & Evaluation
**Training Environment:**
- **Optimizer:** Stochastic Gradient Descent (SGD) with learning rate 0.001 and momentum 0.9.
- **Batch Size:** 32
- **Epochs:** Models are trained for 5 epochs during initial training (with further fine-tuning as part of active learning experiments).
**Evaluation Function:**
Our evaluation function computes entity-level metrics (F1, seqeval accuracy, and validation loss) by processing batches and collecting predictions in a list-of-lists format to ensure that only correctly identified complete entities contribute to the final score.
---
## Additional Information
- **Repository:** All models and intermediate checkpoints are stored in separate branches of the repository. For instance, **primary_model** is available in the branch `model_primary`, while other models (from active learning experiments) are stored in branches with names indicating the iteration and percentage of added expert data (e.g., `active_iter_1_added_20`).