violetar
/

Ner-model

Token Classification

named-entity-recognition

Model card Files Files and versions

Ner-model / README.md

violetar's picture

Create README.md

8377af9 verified 13 days ago

|

history blame contribute delete

3.24 kB

	---
	language: en
	license: apache-2.0
	tags:
	- token-classification
	- named-entity-recognition
	- conll2003
	- modernbert
	datasets:
	- lhoestq/conll2003
	metrics:
	- seqeval
	library_name: transformers
	pipeline_tag: token-classification
	---

	# Model Card for ModernBERT-large fine-tuned on CoNLL-2003 (NER)

	A Named Entity Recognition model based on `answerdotai/ModernBERT-large`, fine-tuned on the English CoNLL-2003 dataset. It identifies and classifies entities into four types: Person, Organization, Location, and Miscellaneous.

	## Model Details

	- Base model: [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
	- Task: Token classification (NER)
	- Dataset: [lhoestq/conll2003](https://huggingface.co/datasets/lhoestq/conll2003) (CoNLL-2003 English)
	- Number of labels: 9 (BIO format)
	- O (0)
	- B-PER (1), I-PER (2)
	- B-ORG (3), I-ORG (4)
	- B-LOC (5), I-LOC (6)
	- B-MISC (7), I-MISC (8)
	- Training procedure: Fine-tuning with Optuna hyperparameter search (20 trials)
	- Evaluation metric: `seqeval` (overall precision, recall, F1, accuracy)

	### Label Mapping
	\| Label ID \| Entity Tag \|
	\|----------\|-------------\|
	\| 0 \| O \|
	\| 1 \| B-PER \|
	\| 2 \| I-PER \|
	\| 3 \| B-ORG \|
	\| 4 \| I-ORG \|
	\| 5 \| B-LOC \|
	\| 6 \| I-LOC \|
	\| 7 \| B-MISC \|
	\| 8 \| I-MISC \|

	## Training Procedure

	### Hyperparameter Search

	An Optuna study (20 trials) maximized validation F1 over the following search space:

	- Learning rate: `[1e-5, 5e-4]` (log scale)
	- Batch size per device: `[8, 16, 32]`
	- Number of epochs: `[2, 6]`
	- Weight decay: `[0.0, 0.1]`
	- Warmup ratio: `[0.0, 0.2]`
	- Gradient accumulation steps: `[1, 4]`

	Other fixed training arguments:
	- Evaluation batch size: 8
	- Max sequence length: 256
	- Evaluation strategy: epoch
	- Save strategy: epoch
	- Best model selection based on validation F1
	- Seed: 42

	### Training Data

	- Training set: CoNLL-2003 `train` split
	- Validation set: CoNLL-2003 `validation` split (used for early stopping / best model selection)
	- Test set: CoNLL-2003 `test` split (final evaluation)

	### Tokenizer Alignment

	During tokenization, the original tokens are split into subwords. Subword tokens that are continuations of the same word are assigned the inside label of the corresponding entity class, if applicable. For example, if “Microsoft” is tokenized into `["Micro", "##soft"]` and the original tag is `B-ORG`, the first subword gets `B-ORG` and the second gets `I-ORG`. This is implemented in the `align_labels` function.

	## Evaluation Results

	After hyperparameter search, the best trial achieved the following results on the test set:

	- Precision: 0.87
	- Recall: 0.91
	- F1: 0.89
	- Accuracy: 0.97


	## How to Use

	### Quick Pipeline

	```python
	from transformers import pipeline

	ner = pipeline("token-classification", model="violetar/ner-model", aggregation_strategy="simple")
	sentence = "John Smith works at Microsoft in New York."
	results = ner(sentence)

	for entity in results:
	print(f"{entity['word']} -> {entity['entity_group']} (score: {entity['score']:.2f})")
	```