sfarrukhm
/

bert-conll-ner

@@ -1,93 +1,104 @@
----
-library_name: transformers
-license: apache-2.0
-base_model: google-bert/bert-base-uncased
-tags:
-- generated_from_trainer
-datasets:
-- conll2003
-metrics:
-- precision
-- recall
-- f1
-- accuracy
-model-index:
-- name: modernbert-conll-ner
-  results:
-  - task:
-      name: Token Classification
-      type: token-classification
-    dataset:
-      name: conll2003
-      type: conll2003
-      config: conll2003
-      split: None
-      args: conll2003
-    metrics:
-    - name: Precision
-      type: precision
-      value: 0.9358846918489065
-    - name: Recall
-      type: recall
-      value: 0.9506900033658701
-    - name: F1
-      type: f1
-      value: 0.943229253631658
-    - name: Accuracy
-      type: accuracy
-      value: 0.9879263507395111
----
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# modernbert-conll-ner
-This model is a fine-tuned version of [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) on the conll2003 dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.0649
-- Precision: 0.9359
-- Recall: 0.9507
-- F1: 0.9432
-- Accuracy: 0.9879
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 8
-- eval_batch_size: 8
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- num_epochs: 3
-### Training results
-| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
-|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
-| 0.023         | 1.0   | 1756 | 0.0683          | 0.9201    | 0.9416 | 0.9307 | 0.9859   |
-| 0.0222        | 2.0   | 3512 | 0.0614          | 0.9345    | 0.9514 | 0.9429 | 0.9874   |
-| 0.0097        | 3.0   | 5268 | 0.0649          | 0.9359    | 0.9507 | 0.9432 | 0.9879   |
-### Framework versions
-- Transformers 4.47.1
-- Pytorch 2.5.1+cu121
-- Datasets 3.2.0
-- Tokenizers 0.21.0

+# Model Card: BERT for Named Entity Recognition (NER)
+## Model Overview
+This model, **sbert-conll-ner**, is a fine-tuned version of `bert-base-uncased` trained for the task of Named Entity Recognition (NER) using the CoNLL-2003 dataset. It is designed to identify and classify entities in text, such as **person names (PER)**, **organizations (ORG)**, **locations (LOC)**, and **miscellaneous (MISC)** entities.
+### Model Architecture
+- **Base Model**: BERT (Bidirectional Encoder Representations from Transformers) with the `bert-base-uncased` architecture.
+- **Task**: Token Classification (NER).
+## Training Dataset
+- **Dataset**: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
+- **Classes**:
+  - `PER` (Person)
+  - `ORG` (Organization)
+  - `LOC` (Location)
+  - `MISC` (Miscellaneous)
+  - `O` (Outside of any entity span)
+## Performance Metrics
+The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:
+| Metric      | Value      |
+|-------------|------------|
+| **Loss**    | 0.0649     |
+| **Precision** | 93.59%    |
+| **Recall**  | 95.07%     |
+| **F1 Score** | 94.32%    |
+| **Accuracy** | 98.79%    |
+These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.
+## Training Details
+- **Optimizer**: AdamW (Adam with weight decay)
+- **Learning Rate**: 2e-5
+- **Batch Size**: 8
+- **Number of Epochs**: 3
+- **Scheduler**: Linear scheduler with warm-up steps
+- **Loss Function**: Cross-entropy loss with ignored index (`-100`) for padding tokens
+## Model Input/Output
+- **Input Format**: Tokenized text with special tokens `[CLS]` and `[SEP]`.
+- **Output Format**: Token-level predictions with corresponding labels from the NER tag set (`B-PER`, `I-PER`, etc.).
+## How to Use the Model
+### Installation
+```bash
+pip install transformers
+```
+### Loading the Model
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
+model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")
+```
+### Running Inference
+```python
+from transformers import pipeline
+nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+text = "John lives in New York City."
+result = nlp(text)
+print(result)
+```
+```json
+[{'entity_group': 'PER',
+  'score': 0.99912304,
+  'word': 'john',
+  'start': 0,
+  'end': 4},
+ {'entity_group': 'LOC',
+  'score': 0.9993351,
+  'word': 'new york city',
+  'start': 14,
+  'end': 27}]
+```
+## Limitations
+1. **Domain-Specific Adaptability**: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
+2. **Ambiguity**: Ambiguous entities or overlapping spans are not explicitly handled.
+## Recommendations
+- For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
+- Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.
+## Acknowledgements
+- **Transformers Library**: Hugging Face
+- **Dataset**: CoNLL-2003
+- **Base Model**: `bert-base-uncased` by Google