kms-engineer's picture
Update README.md
b81e2b4 verified
metadata
language: en
license: mit
tags:
  - token-classification
  - named-entity-recognition
  - ner
  - contact-management
  - roberta
base_model: roberta-base
datasets:
  - custom
model-index:
  - name: assistant-bot-ner-model
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        metrics:
          - type: accuracy
            value: 0.951
            name: Accuracy
          - type: f1
            value: 0.946
            name: F1 Score

NER Model for Contact Management Assistant Bot

This model is a fine-tuned RoBERTa-base model for Named Entity Recognition (NER) in contact management tasks.

Model Description

  • Developed by: Mykyta Kotenko
  • Base Model: roberta-base by Facebook AI
  • Task: Token Classification (Named Entity Recognition)
  • Language: English
  • License: MIT
  • Accuracy: 95.1%
  • Entity Accuracy: 93.7%
  • F1 Score: 94.6%

Supported Entities

This model extracts the following entity types:

  • NAME: Person's full name
  • PHONE: Phone numbers in various formats
  • EMAIL: Email addresses
  • ADDRESS: Full street addresses (including building numbers, street names, apartments, cities, states, ZIP codes)
  • BIRTHDAY: Dates of birth
  • TAG: Contact tags
  • NOTE_TEXT: Note content
  • ID: Contact/note identifiers
  • DAYS: Time periods

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("kms-engineer/assistant-bot-ner-model")
model = AutoModelForTokenClassification.from_pretrained("kms-engineer/assistant-bot-ner-model")

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Merge B-/I- tokens
)

# Extract entities
text = "Add contact John Smith 212-555-0123 john@example.com 123 Broadway, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")

Output:

NAME: John Smith
PHONE: 212-555-0123
EMAIL: john@example.com
ADDRESS: 123 Broadway, New York

Advanced Usage with Address Recognition

# Example with full address including building number
text = "Add contact Alon 212-555-0123 alon@example.com 45, 5 Ave, unit 34, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")

Output:

NAME: Alon
PHONE: 212-555-0123
EMAIL: alon@example.com
ADDRESS: 45, 5 Ave, unit 34, New York

Batch Processing

texts = [
    "Add contact Sarah 718-555-4567 sarah@email.com lives at 123 Broadway, Apt 5B, NY 10001",
    "Create contact Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901",
    "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"
]

for text in texts:
    results = ner_pipeline(text)
    print(f"\nText: {text}")
    for result in results:
        print(f"  - {result['entity_group']}: {result['word']}")

Training Details

Dataset

  • Size: 2,185 training examples
  • ADDRESS entities: 543 occurrences (including full street addresses with building numbers)
  • NAME entities: 1,897 occurrences
  • PHONE entities: 564 occurrences
  • EMAIL entities: 415 occurrences
  • BIRTHDAY entities: 252 occurrences

Training Configuration

  • Base Model: roberta-base
  • Learning Rate: 3e-5
  • Batch Size: 16
  • Max Length: 128 tokens
  • Epochs: 5
  • Optimizer: AdamW
  • Training Framework: Hugging Face Transformers

Performance Metrics

Metric Value
Accuracy 95.1%
Entity Accuracy 93.7%
Precision 94.9%
Recall 95.1%
F1 Score 94.6%

Key Features

✅ Full Address Recognition

Unlike many NER models that only recognize city names, this model recognizes complete street addresses including:

  • Building numbers (45, 123, 1234, etc.)
  • Street names (Broadway, 5 Ave, Sunset Boulevard, etc.)
  • Unit/Apartment numbers (unit 34, Apt 5B, Suite 12, Floor 3)
  • Cities and states (New York, NY, Los Angeles, CA, etc.)
  • ZIP codes (10001, 90028, 77002, etc.)

Example: Full Address Recognition

Before (typical NER models):

Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "New York" ❌ (only city)

After (this model):

Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "45, 5 ave, unit 34, New York" ✅ (full address with building number!)

Example Predictions

Example 1: Complete Contact

text = "Add contact John Smith 212-555-0123 john@example.com 45, 5 Ave, unit 34, New York"

Extracted Entities:

  • NAME: John Smith
  • PHONE: 212-555-0123
  • EMAIL: john@example.com
  • ADDRESS: 45, 5 Ave, unit 34, New York

Example 2: Address with ZIP Code

text = "Create contact Sarah at 123 Broadway, Apt 5B, New York, NY 10001"

Extracted Entities:

  • NAME: Sarah
  • ADDRESS: 123 Broadway, Apt 5B, New York, NY 10001

Example 3: Complex Address

text = "Save contact for Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901"

Extracted Entities:

  • NAME: Michael
  • PHONE: 917-555-8901
  • ADDRESS: 789 Park Avenue, Suite 12, Manhattan, NY 10021

Example 4: Different City

text = "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"

Extracted Entities:

  • NAME: David Martinez
  • ADDRESS: 1234 Sunset Boulevard, Los Angeles, CA 90028

Intended Use

This model is designed for:

  • Contact management applications
  • Personal assistant bots
  • CRM systems with natural language interface
  • Address extraction from text
  • Contact information parsing

Limitations

  • Optimized for US-style addresses - International addresses not yet in training data
  • Best performance on English text - Other languages not supported
  • Contact management domain - May not generalize well to other domains without fine-tuning

Model Architecture

Based on RoBERTa (Robustly Optimized BERT Pretraining Approach):

  • Layers: 12 transformer layers
  • Hidden size: 768
  • Attention heads: 12
  • Parameters: ~125M
  • Task: Token Classification with IOB2 tagging scheme

Entity Label Format

The model uses IOB2 (Inside-Outside-Beginning) format:

  • B-{ENTITY}: Beginning of entity
  • I-{ENTITY}: Inside/continuation of entity
  • O: Outside any entity

Example:

Tokens:  ["Add", "contact", "John", "Smith", "212", "-", "555", "-", "0123"]
Labels:  ["O",   "O",       "B-NAME", "I-NAME", "B-PHONE", "I-PHONE", "I-PHONE", "I-PHONE", "I-PHONE"]

Citation

If you use this model, please cite:

@misc{kotenko2025nermodel,
  author = {Kotenko, Mykyta},
  title = {NER Model for Contact Management Assistant Bot},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kms-engineer/assistant-bot-ner-model}},
  note = {Based on RoBERTa by Facebook AI. Achieves 95.1\% accuracy with full address recognition including building numbers.}
}

Acknowledgments

  • Base Model: RoBERTa by Facebook AI Research
  • Framework: Hugging Face Transformers
  • Training: Fine-tuned on custom contact management dataset with 2,185 examples
  • Special Feature: Enhanced address recognition with building numbers, apartments, and full street addresses

Technical Improvements

This model includes several technical improvements over standard NER models:

  1. Enhanced Tokenization: Improved handling of addresses with fuzzy matching algorithm
  2. Rich Training Data: 115+ real-world address examples from major US cities
  3. Address Variations: Multiple formats including "address-first" patterns
  4. High Accuracy: 95.1% overall accuracy, 93.7% entity-level accuracy

Updates

  • v1.0.0 (2025-01-18): Initial release
    • 95.1% accuracy
    • Full address recognition with building numbers
    • 2,185 training examples
    • Support for 9 entity types

License

MIT License - See LICENSE file for details.

This model is a derivative work based on RoBERTa, which is licensed under MIT License by Facebook, Inc.

Contact

Related Models