assistant-bot-ner-model / README.md

kms-engineer

Update README.md

b81e2b4 verified about 2 months ago

preview code

raw

history blame contribute delete

8.8 kB

metadata

language: en
license: mit
tags:
  - token-classification
  - named-entity-recognition
  - ner
  - contact-management
  - roberta
base_model: roberta-base
datasets:
  - custom
model-index:
  - name: assistant-bot-ner-model
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        metrics:
          - type: accuracy
            value: 0.951
            name: Accuracy
          - type: f1
            value: 0.946
            name: F1 Score

NER Model for Contact Management Assistant Bot

This model is a fine-tuned RoBERTa-base model for Named Entity Recognition (NER) in contact management tasks.

Model Description

Developed by: Mykyta Kotenko
Base Model: roberta-base by Facebook AI
Task: Token Classification (Named Entity Recognition)
Language: English
License: MIT
Accuracy: 95.1%
Entity Accuracy: 93.7%
F1 Score: 94.6%

Supported Entities

This model extracts the following entity types:

NAME: Person's full name
PHONE: Phone numbers in various formats
EMAIL: Email addresses
ADDRESS: Full street addresses (including building numbers, street names, apartments, cities, states, ZIP codes)
BIRTHDAY: Dates of birth
TAG: Contact tags
NOTE_TEXT: Note content
ID: Contact/note identifiers
DAYS: Time periods

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("kms-engineer/assistant-bot-ner-model")
model = AutoModelForTokenClassification.from_pretrained("kms-engineer/assistant-bot-ner-model")

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Merge B-/I- tokens
)

# Extract entities
text = "Add contact John Smith 212-555-0123 john@example.com 123 Broadway, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")

Output:

NAME: John Smith
PHONE: 212-555-0123
EMAIL: john@example.com
ADDRESS: 123 Broadway, New York

Advanced Usage with Address Recognition

# Example with full address including building number
text = "Add contact Alon 212-555-0123 alon@example.com 45, 5 Ave, unit 34, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")

Output:

NAME: Alon
PHONE: 212-555-0123
EMAIL: alon@example.com
ADDRESS: 45, 5 Ave, unit 34, New York

Batch Processing

texts = [
    "Add contact Sarah 718-555-4567 sarah@email.com lives at 123 Broadway, Apt 5B, NY 10001",
    "Create contact Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901",
    "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"
]

for text in texts:
    results = ner_pipeline(text)
    print(f"\nText: {text}")
    for result in results:
        print(f"  - {result['entity_group']}: {result['word']}")

Training Details

Dataset

Size: 2,185 training examples
ADDRESS entities: 543 occurrences (including full street addresses with building numbers)
NAME entities: 1,897 occurrences
PHONE entities: 564 occurrences
EMAIL entities: 415 occurrences
BIRTHDAY entities: 252 occurrences

Training Configuration

Base Model: roberta-base
Learning Rate: 3e-5
Batch Size: 16
Max Length: 128 tokens
Epochs: 5
Optimizer: AdamW
Training Framework: Hugging Face Transformers

Performance Metrics

Metric	Value
Accuracy	95.1%
Entity Accuracy	93.7%
Precision	94.9%
Recall	95.1%
F1 Score	94.6%

Key Features

✅ Full Address Recognition

Unlike many NER models that only recognize city names, this model recognizes complete street addresses including:

Building numbers (45, 123, 1234, etc.)
Street names (Broadway, 5 Ave, Sunset Boulevard, etc.)
Unit/Apartment numbers (unit 34, Apt 5B, Suite 12, Floor 3)
Cities and states (New York, NY, Los Angeles, CA, etc.)
ZIP codes (10001, 90028, 77002, etc.)

Example: Full Address Recognition

Before (typical NER models):

Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "New York" ❌ (only city)

After (this model):

Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "45, 5 ave, unit 34, New York" ✅ (full address with building number!)

Example Predictions

Example 1: Complete Contact

text = "Add contact John Smith 212-555-0123 john@example.com 45, 5 Ave, unit 34, New York"

Extracted Entities:

NAME: John Smith
PHONE: 212-555-0123
EMAIL: john@example.com
ADDRESS: 45, 5 Ave, unit 34, New York

Example 2: Address with ZIP Code

text = "Create contact Sarah at 123 Broadway, Apt 5B, New York, NY 10001"

Extracted Entities:

NAME: Sarah
ADDRESS: 123 Broadway, Apt 5B, New York, NY 10001

Example 3: Complex Address

text = "Save contact for Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901"

Extracted Entities:

NAME: Michael
PHONE: 917-555-8901
ADDRESS: 789 Park Avenue, Suite 12, Manhattan, NY 10021

Example 4: Different City

text = "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"

Extracted Entities:

NAME: David Martinez
ADDRESS: 1234 Sunset Boulevard, Los Angeles, CA 90028

Intended Use

This model is designed for:

Contact management applications
Personal assistant bots
CRM systems with natural language interface
Address extraction from text
Contact information parsing

Limitations

Optimized for US-style addresses - International addresses not yet in training data
Best performance on English text - Other languages not supported
Contact management domain - May not generalize well to other domains without fine-tuning

Model Architecture

Based on RoBERTa (Robustly Optimized BERT Pretraining Approach):

Layers: 12 transformer layers
Hidden size: 768
Attention heads: 12
Parameters: ~125M
Task: Token Classification with IOB2 tagging scheme

Entity Label Format

The model uses IOB2 (Inside-Outside-Beginning) format:

B-{ENTITY}: Beginning of entity
I-{ENTITY}: Inside/continuation of entity
O: Outside any entity

Example:

Tokens:  ["Add", "contact", "John", "Smith", "212", "-", "555", "-", "0123"]
Labels:  ["O",   "O",       "B-NAME", "I-NAME", "B-PHONE", "I-PHONE", "I-PHONE", "I-PHONE", "I-PHONE"]

Citation

If you use this model, please cite:

@misc{kotenko2025nermodel,
  author = {Kotenko, Mykyta},
  title = {NER Model for Contact Management Assistant Bot},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kms-engineer/assistant-bot-ner-model}},
  note = {Based on RoBERTa by Facebook AI. Achieves 95.1\% accuracy with full address recognition including building numbers.}
}

Acknowledgments

Base Model: RoBERTa by Facebook AI Research
Framework: Hugging Face Transformers
Training: Fine-tuned on custom contact management dataset with 2,185 examples
Special Feature: Enhanced address recognition with building numbers, apartments, and full street addresses

Technical Improvements

This model includes several technical improvements over standard NER models:

Enhanced Tokenization: Improved handling of addresses with fuzzy matching algorithm
Rich Training Data: 115+ real-world address examples from major US cities
Address Variations: Multiple formats including "address-first" patterns
High Accuracy: 95.1% overall accuracy, 93.7% entity-level accuracy

Updates

v1.0.0 (2025-01-18): Initial release
- 95.1% accuracy
- Full address recognition with building numbers
- 2,185 training examples
- Support for 9 entity types

License

MIT License - See LICENSE file for details.

This model is a derivative work based on RoBERTa, which is licensed under MIT License by Facebook, Inc.

Contact

Author: Mykyta Kotenko
Repository: assistant-bot
Issues: Please report issues on GitHub
Hugging Face: kms-engineer

Related Models

Intent Classifier: kms-engineer/assistant-bot-intent-classifier
Dataset: kms-engineer/assistant-bot-ner-dataset