File size: 8,801 Bytes

---
language: en
license: mit
tags:
- token-classification
- named-entity-recognition
- ner
- contact-management
- roberta
base_model: roberta-base
datasets:
- custom
model-index:
- name: assistant-bot-ner-model
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    metrics:
    - type: accuracy
      value: 0.951
      name: Accuracy
    - type: f1
      value: 0.946
      name: F1 Score
---

# NER Model for Contact Management Assistant Bot

This model is a fine-tuned RoBERTa-base model for Named Entity Recognition (NER) in contact management tasks.

## Model Description

- **Developed by:** Mykyta Kotenko
- **Base Model:** [roberta-base](https://huggingface.co/roberta-base) by Facebook AI
- **Task:** Token Classification (Named Entity Recognition)
- **Language:** English
- **License:** MIT
- **Accuracy:** 95.1%
- **Entity Accuracy:** 93.7%
- **F1 Score:** 94.6%

## Supported Entities

This model extracts the following entity types:

- **NAME**: Person's full name
- **PHONE**: Phone numbers in various formats
- **EMAIL**: Email addresses
- **ADDRESS**: Full street addresses (including building numbers, street names, apartments, cities, states, ZIP codes)
- **BIRTHDAY**: Dates of birth
- **TAG**: Contact tags
- **NOTE_TEXT**: Note content
- **ID**: Contact/note identifiers
- **DAYS**: Time periods

## Usage

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("kms-engineer/assistant-bot-ner-model")
model = AutoModelForTokenClassification.from_pretrained("kms-engineer/assistant-bot-ner-model")

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Merge B-/I- tokens
)

# Extract entities
text = "Add contact John Smith 212-555-0123 john@example.com 123 Broadway, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")
```

**Output:**
```
NAME: John Smith
PHONE: 212-555-0123
EMAIL: john@example.com
ADDRESS: 123 Broadway, New York
```

### Advanced Usage with Address Recognition

```python
# Example with full address including building number
text = "Add contact Alon 212-555-0123 alon@example.com 45, 5 Ave, unit 34, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")
```

**Output:**
```
NAME: Alon
PHONE: 212-555-0123
EMAIL: alon@example.com
ADDRESS: 45, 5 Ave, unit 34, New York
```

### Batch Processing

```python
texts = [
    "Add contact Sarah 718-555-4567 sarah@email.com lives at 123 Broadway, Apt 5B, NY 10001",
    "Create contact Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901",
    "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"
]

for text in texts:
    results = ner_pipeline(text)
    print(f"\nText: {text}")
    for result in results:
        print(f"  - {result['entity_group']}: {result['word']}")
```

## Training Details

### Dataset
- **Size:** 2,185 training examples
- **ADDRESS entities:** 543 occurrences (including full street addresses with building numbers)
- **NAME entities:** 1,897 occurrences
- **PHONE entities:** 564 occurrences
- **EMAIL entities:** 415 occurrences
- **BIRTHDAY entities:** 252 occurrences

### Training Configuration
- **Base Model:** roberta-base
- **Learning Rate:** 3e-5
- **Batch Size:** 16
- **Max Length:** 128 tokens
- **Epochs:** 5
- **Optimizer:** AdamW
- **Training Framework:** Hugging Face Transformers

### Performance Metrics

| Metric | Value |
|--------|-------|
| Accuracy | 95.1% |
| Entity Accuracy | 93.7% |
| Precision | 94.9% |
| Recall | 95.1% |
| F1 Score | 94.6% |

## Key Features

### ✅ Full Address Recognition
Unlike many NER models that only recognize city names, this model recognizes **complete street addresses** including:
- Building numbers (45, 123, 1234, etc.)
- Street names (Broadway, 5 Ave, Sunset Boulevard, etc.)
- Unit/Apartment numbers (unit 34, Apt 5B, Suite 12, Floor 3)
- Cities and states (New York, NY, Los Angeles, CA, etc.)
- ZIP codes (10001, 90028, 77002, etc.)

### Example: Full Address Recognition

**Before (typical NER models):**
```
Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "New York" ❌ (only city)
```

**After (this model):**
```
Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "45, 5 ave, unit 34, New York" ✅ (full address with building number!)
```

## Example Predictions

### Example 1: Complete Contact
```python
text = "Add contact John Smith 212-555-0123 john@example.com 45, 5 Ave, unit 34, New York"
```
**Extracted Entities:**
- NAME: John Smith
- PHONE: 212-555-0123
- EMAIL: john@example.com
- ADDRESS: 45, 5 Ave, unit 34, New York

### Example 2: Address with ZIP Code
```python
text = "Create contact Sarah at 123 Broadway, Apt 5B, New York, NY 10001"
```
**Extracted Entities:**
- NAME: Sarah
- ADDRESS: 123 Broadway, Apt 5B, New York, NY 10001

### Example 3: Complex Address
```python
text = "Save contact for Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901"
```
**Extracted Entities:**
- NAME: Michael
- PHONE: 917-555-8901
- ADDRESS: 789 Park Avenue, Suite 12, Manhattan, NY 10021

### Example 4: Different City
```python
text = "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"
```
**Extracted Entities:**
- NAME: David Martinez
- ADDRESS: 1234 Sunset Boulevard, Los Angeles, CA 90028

## Intended Use

This model is designed for:
- Contact management applications
- Personal assistant bots
- CRM systems with natural language interface
- Address extraction from text
- Contact information parsing

## Limitations

- **Optimized for US-style addresses** - International addresses not yet in training data
- **Best performance on English text** - Other languages not supported
- **Contact management domain** - May not generalize well to other domains without fine-tuning

## Model Architecture

Based on RoBERTa (Robustly Optimized BERT Pretraining Approach):
- **Layers:** 12 transformer layers
- **Hidden size:** 768
- **Attention heads:** 12
- **Parameters:** ~125M
- **Task:** Token Classification with IOB2 tagging scheme

## Entity Label Format

The model uses IOB2 (Inside-Outside-Beginning) format:
- `B-{ENTITY}`: Beginning of entity
- `I-{ENTITY}`: Inside/continuation of entity
- `O`: Outside any entity

Example:
```
Tokens:  ["Add", "contact", "John", "Smith", "212", "-", "555", "-", "0123"]
Labels:  ["O",   "O",       "B-NAME", "I-NAME", "B-PHONE", "I-PHONE", "I-PHONE", "I-PHONE", "I-PHONE"]
```

## Citation

If you use this model, please cite:

```bibtex
@misc{kotenko2025nermodel,
  author = {Kotenko, Mykyta},
  title = {NER Model for Contact Management Assistant Bot},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kms-engineer/assistant-bot-ner-model}},
  note = {Based on RoBERTa by Facebook AI. Achieves 95.1\% accuracy with full address recognition including building numbers.}
}
```

## Acknowledgments

- **Base Model:** RoBERTa by Facebook AI Research
- **Framework:** Hugging Face Transformers
- **Training:** Fine-tuned on custom contact management dataset with 2,185 examples
- **Special Feature:** Enhanced address recognition with building numbers, apartments, and full street addresses

## Technical Improvements

This model includes several technical improvements over standard NER models:

1. **Enhanced Tokenization:** Improved handling of addresses with fuzzy matching algorithm
2. **Rich Training Data:** 115+ real-world address examples from major US cities
3. **Address Variations:** Multiple formats including "address-first" patterns
4. **High Accuracy:** 95.1% overall accuracy, 93.7% entity-level accuracy

## Updates

- **v1.0.0 (2025-01-18):** Initial release
  - 95.1% accuracy
  - Full address recognition with building numbers
  - 2,185 training examples
  - Support for 9 entity types

## License

MIT License - See LICENSE file for details.

This model is a derivative work based on RoBERTa, which is licensed under MIT License by Facebook, Inc.

## Contact

- **Author:** Mykyta Kotenko
- **Repository:** [assistant-bot](https://github.com/kms-engineer/assistant-bot)
- **Issues:** Please report issues on GitHub
- **Hugging Face:** [kms-engineer](https://huggingface.co/kms-engineer)

## Related Models

- **Intent Classifier:** [kms-engineer/assistant-bot-intent-classifier](https://huggingface.co/kms-engineer/assistant-bot-intent-classifier)
- **Dataset:** [kms-engineer/assistant-bot-ner-dataset](https://huggingface.co/datasets/kms-engineer/assistant-bot-ner-dataset)