harpertokenNER / README.md

Niladri Das

Metadata Enhancements:

6b3acc4 verified 11 months ago

4.05 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	- token-classification
	- ner
	- nlp
	datasets:
	- conll2003
	language:
	- en
	pipeline_tag: token-classification
	library_name: transformers
	base_model: bert-base-uncased
	model-index:
	- name: token-classification-ai-fine-tune
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition (NER)
	dataset:
	name: CoNLL-2003
	type: conll2003
	metrics:
	- name: Validation Loss
	type: loss
	value: 0.0474
	widget:
	- text: "Apple is buying a U.K. startup for $1 billion"
	---

	# token-classification-ai-fine-tune

	[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/bniladridas/token-classification-ai-fine-tune)

	This model is a fine-tuned version of [bert-base-uncased](https://huggingface.co/bert-base-uncased) on the [CoNLL-2003](https://huggingface.co/datasets/conll2003) dataset. It achieves a validation loss of 0.0474 on the evaluation set.

	## Model Description

	This is a token classification model fine-tuned for Named Entity Recognition (NER), built on the `bert-base-uncased` architecture. It’s crafted to identify entities (like people, organizations, and locations) in text, optimized here for CPU accessibility. Uploaded by [bniladridas](https://huggingface.co/bniladridas), it delivers strong NER performance on the CoNLL-2003 benchmark. For a GPU-accelerated version with CUDA support, see the [GitHub repository](https://github.com/bniladridas/token-classification-ai-fine-tune).

	## Intended Uses & Limitations

	### Intended Uses
	- Extracting named entities from unstructured text (e.g., news articles, reports)
	- Powering NLP pipelines on CPU-based systems
	- Research or lightweight production use

	### Limitations
	- Trained on English text from CoNLL-2003, so it may not generalize well to other languages or domains
	- Uses `bert-base-uncased` tokenization (lowercase-only), potentially missing case-sensitive nuances
	- Optimized for NER; additional tuning needed for other token-classification tasks

	## Training and Evaluation Data

	The model was trained and evaluated on the [CoNLL-2003](https://huggingface.co/datasets/conll2003) dataset, a standard NER benchmark. It features annotated English news articles with entities like persons, organizations, and locations, split into training, validation, and test sets. Metrics here reflect the evaluation subset.

	## Training Procedure

	### Training Hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 3

	### Training Results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|
	\| 0.048 \| 1.0 \| 1756 \| 0.0531 \|
	\| 0.0251 \| 2.0 \| 3512 \| 0.0473 \|
	\| 0.016 \| 3.0 \| 5268 \| 0.0474 \|

	### Framework Versions

	- Transformers: 4.28.1
	- PyTorch: 2.0.1
	- Datasets: 1.18.3
	- Tokenizers: 0.13.3

	### Additional Notes
	This version is optimized for CPU use with these intentional adjustments:
	1. Full-precision training: Swapped out fp16 for broader compatibility
	2. Streamlined batch sizes: Set to 8 for efficient CPU processing
	3. Simplified workflow: Skipped gradient accumulation for smoother CPU runs
	4. Full feature set: Retained all monitoring (e.g., TensorBoard) and saving capabilities

	For the GPU version with CUDA, mixed precision, and gradient accumulation, check out the [GitHub repository](https://github.com/bniladridas/token-classification-ai-fine-tune). To clone it, run:

	```bash
	git clone https://github.com/bniladridas/token-classification-ai-fine-tune.git
	```

	This model was pushed to the Hugging Face Hub for easy CPU-based deployment.