|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- token-classification |
|
|
- named-entity-recognition |
|
|
- ner |
|
|
- contact-management |
|
|
- roberta |
|
|
base_model: roberta-base |
|
|
datasets: |
|
|
- custom |
|
|
model-index: |
|
|
- name: assistant-bot-ner-model |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.951 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 0.946 |
|
|
name: F1 Score |
|
|
--- |
|
|
|
|
|
# NER Model for Contact Management Assistant Bot |
|
|
|
|
|
This model is a fine-tuned RoBERTa-base model for Named Entity Recognition (NER) in contact management tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Developed by:** Mykyta Kotenko |
|
|
- **Base Model:** [roberta-base](https://huggingface.co/roberta-base) by Facebook AI |
|
|
- **Task:** Token Classification (Named Entity Recognition) |
|
|
- **Language:** English |
|
|
- **License:** MIT |
|
|
- **Accuracy:** 95.1% |
|
|
- **Entity Accuracy:** 93.7% |
|
|
- **F1 Score:** 94.6% |
|
|
|
|
|
## Supported Entities |
|
|
|
|
|
This model extracts the following entity types: |
|
|
|
|
|
- **NAME**: Person's full name |
|
|
- **PHONE**: Phone numbers in various formats |
|
|
- **EMAIL**: Email addresses |
|
|
- **ADDRESS**: Full street addresses (including building numbers, street names, apartments, cities, states, ZIP codes) |
|
|
- **BIRTHDAY**: Dates of birth |
|
|
- **TAG**: Contact tags |
|
|
- **NOTE_TEXT**: Note content |
|
|
- **ID**: Contact/note identifiers |
|
|
- **DAYS**: Time periods |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("kms-engineer/assistant-bot-ner-model") |
|
|
model = AutoModelForTokenClassification.from_pretrained("kms-engineer/assistant-bot-ner-model") |
|
|
|
|
|
# Create NER pipeline |
|
|
ner_pipeline = pipeline( |
|
|
"token-classification", |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
aggregation_strategy="simple" # Merge B-/I- tokens |
|
|
) |
|
|
|
|
|
# Extract entities |
|
|
text = "Add contact John Smith 212-555-0123 john@example.com 123 Broadway, New York" |
|
|
results = ner_pipeline(text) |
|
|
|
|
|
for result in results: |
|
|
print(f"{result['entity_group']}: {result['word']}") |
|
|
``` |
|
|
|
|
|
**Output:** |
|
|
``` |
|
|
NAME: John Smith |
|
|
PHONE: 212-555-0123 |
|
|
EMAIL: john@example.com |
|
|
ADDRESS: 123 Broadway, New York |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Address Recognition |
|
|
|
|
|
```python |
|
|
# Example with full address including building number |
|
|
text = "Add contact Alon 212-555-0123 alon@example.com 45, 5 Ave, unit 34, New York" |
|
|
results = ner_pipeline(text) |
|
|
|
|
|
for result in results: |
|
|
print(f"{result['entity_group']}: {result['word']}") |
|
|
``` |
|
|
|
|
|
**Output:** |
|
|
``` |
|
|
NAME: Alon |
|
|
PHONE: 212-555-0123 |
|
|
EMAIL: alon@example.com |
|
|
ADDRESS: 45, 5 Ave, unit 34, New York |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
texts = [ |
|
|
"Add contact Sarah 718-555-4567 sarah@email.com lives at 123 Broadway, Apt 5B, NY 10001", |
|
|
"Create contact Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901", |
|
|
"Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028" |
|
|
] |
|
|
|
|
|
for text in texts: |
|
|
results = ner_pipeline(text) |
|
|
print(f"\nText: {text}") |
|
|
for result in results: |
|
|
print(f" - {result['entity_group']}: {result['word']}") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
- **Size:** 2,185 training examples |
|
|
- **ADDRESS entities:** 543 occurrences (including full street addresses with building numbers) |
|
|
- **NAME entities:** 1,897 occurrences |
|
|
- **PHONE entities:** 564 occurrences |
|
|
- **EMAIL entities:** 415 occurrences |
|
|
- **BIRTHDAY entities:** 252 occurrences |
|
|
|
|
|
### Training Configuration |
|
|
- **Base Model:** roberta-base |
|
|
- **Learning Rate:** 3e-5 |
|
|
- **Batch Size:** 16 |
|
|
- **Max Length:** 128 tokens |
|
|
- **Epochs:** 5 |
|
|
- **Optimizer:** AdamW |
|
|
- **Training Framework:** Hugging Face Transformers |
|
|
|
|
|
### Performance Metrics |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Accuracy | 95.1% | |
|
|
| Entity Accuracy | 93.7% | |
|
|
| Precision | 94.9% | |
|
|
| Recall | 95.1% | |
|
|
| F1 Score | 94.6% | |
|
|
|
|
|
## Key Features |
|
|
|
|
|
### ✅ Full Address Recognition |
|
|
Unlike many NER models that only recognize city names, this model recognizes **complete street addresses** including: |
|
|
- Building numbers (45, 123, 1234, etc.) |
|
|
- Street names (Broadway, 5 Ave, Sunset Boulevard, etc.) |
|
|
- Unit/Apartment numbers (unit 34, Apt 5B, Suite 12, Floor 3) |
|
|
- Cities and states (New York, NY, Los Angeles, CA, etc.) |
|
|
- ZIP codes (10001, 90028, 77002, etc.) |
|
|
|
|
|
### Example: Full Address Recognition |
|
|
|
|
|
**Before (typical NER models):** |
|
|
``` |
|
|
Input: "add address for Alon 45, 5 ave, unit 34, New York" |
|
|
ADDRESS: "New York" ❌ (only city) |
|
|
``` |
|
|
|
|
|
**After (this model):** |
|
|
``` |
|
|
Input: "add address for Alon 45, 5 ave, unit 34, New York" |
|
|
ADDRESS: "45, 5 ave, unit 34, New York" ✅ (full address with building number!) |
|
|
``` |
|
|
|
|
|
## Example Predictions |
|
|
|
|
|
### Example 1: Complete Contact |
|
|
```python |
|
|
text = "Add contact John Smith 212-555-0123 john@example.com 45, 5 Ave, unit 34, New York" |
|
|
``` |
|
|
**Extracted Entities:** |
|
|
- NAME: John Smith |
|
|
- PHONE: 212-555-0123 |
|
|
- EMAIL: john@example.com |
|
|
- ADDRESS: 45, 5 Ave, unit 34, New York |
|
|
|
|
|
### Example 2: Address with ZIP Code |
|
|
```python |
|
|
text = "Create contact Sarah at 123 Broadway, Apt 5B, New York, NY 10001" |
|
|
``` |
|
|
**Extracted Entities:** |
|
|
- NAME: Sarah |
|
|
- ADDRESS: 123 Broadway, Apt 5B, New York, NY 10001 |
|
|
|
|
|
### Example 3: Complex Address |
|
|
```python |
|
|
text = "Save contact for Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901" |
|
|
``` |
|
|
**Extracted Entities:** |
|
|
- NAME: Michael |
|
|
- PHONE: 917-555-8901 |
|
|
- ADDRESS: 789 Park Avenue, Suite 12, Manhattan, NY 10021 |
|
|
|
|
|
### Example 4: Different City |
|
|
```python |
|
|
text = "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028" |
|
|
``` |
|
|
**Extracted Entities:** |
|
|
- NAME: David Martinez |
|
|
- ADDRESS: 1234 Sunset Boulevard, Los Angeles, CA 90028 |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Contact management applications |
|
|
- Personal assistant bots |
|
|
- CRM systems with natural language interface |
|
|
- Address extraction from text |
|
|
- Contact information parsing |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Optimized for US-style addresses** - International addresses not yet in training data |
|
|
- **Best performance on English text** - Other languages not supported |
|
|
- **Contact management domain** - May not generalize well to other domains without fine-tuning |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
Based on RoBERTa (Robustly Optimized BERT Pretraining Approach): |
|
|
- **Layers:** 12 transformer layers |
|
|
- **Hidden size:** 768 |
|
|
- **Attention heads:** 12 |
|
|
- **Parameters:** ~125M |
|
|
- **Task:** Token Classification with IOB2 tagging scheme |
|
|
|
|
|
## Entity Label Format |
|
|
|
|
|
The model uses IOB2 (Inside-Outside-Beginning) format: |
|
|
- `B-{ENTITY}`: Beginning of entity |
|
|
- `I-{ENTITY}`: Inside/continuation of entity |
|
|
- `O`: Outside any entity |
|
|
|
|
|
Example: |
|
|
``` |
|
|
Tokens: ["Add", "contact", "John", "Smith", "212", "-", "555", "-", "0123"] |
|
|
Labels: ["O", "O", "B-NAME", "I-NAME", "B-PHONE", "I-PHONE", "I-PHONE", "I-PHONE", "I-PHONE"] |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{kotenko2025nermodel, |
|
|
author = {Kotenko, Mykyta}, |
|
|
title = {NER Model for Contact Management Assistant Bot}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/kms-engineer/assistant-bot-ner-model}}, |
|
|
note = {Based on RoBERTa by Facebook AI. Achieves 95.1\% accuracy with full address recognition including building numbers.} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Base Model:** RoBERTa by Facebook AI Research |
|
|
- **Framework:** Hugging Face Transformers |
|
|
- **Training:** Fine-tuned on custom contact management dataset with 2,185 examples |
|
|
- **Special Feature:** Enhanced address recognition with building numbers, apartments, and full street addresses |
|
|
|
|
|
## Technical Improvements |
|
|
|
|
|
This model includes several technical improvements over standard NER models: |
|
|
|
|
|
1. **Enhanced Tokenization:** Improved handling of addresses with fuzzy matching algorithm |
|
|
2. **Rich Training Data:** 115+ real-world address examples from major US cities |
|
|
3. **Address Variations:** Multiple formats including "address-first" patterns |
|
|
4. **High Accuracy:** 95.1% overall accuracy, 93.7% entity-level accuracy |
|
|
|
|
|
## Updates |
|
|
|
|
|
- **v1.0.0 (2025-01-18):** Initial release |
|
|
- 95.1% accuracy |
|
|
- Full address recognition with building numbers |
|
|
- 2,185 training examples |
|
|
- Support for 9 entity types |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See LICENSE file for details. |
|
|
|
|
|
This model is a derivative work based on RoBERTa, which is licensed under MIT License by Facebook, Inc. |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Author:** Mykyta Kotenko |
|
|
- **Repository:** [assistant-bot](https://github.com/kms-engineer/assistant-bot) |
|
|
- **Issues:** Please report issues on GitHub |
|
|
- **Hugging Face:** [kms-engineer](https://huggingface.co/kms-engineer) |
|
|
|
|
|
## Related Models |
|
|
|
|
|
- **Intent Classifier:** [kms-engineer/assistant-bot-intent-classifier](https://huggingface.co/kms-engineer/assistant-bot-intent-classifier) |
|
|
- **Dataset:** [kms-engineer/assistant-bot-ner-dataset](https://huggingface.co/datasets/kms-engineer/assistant-bot-ner-dataset) |
|
|
|