File size: 8,801 Bytes
de1e689
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b81e2b4
 
de1e689
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b81e2b4
de1e689
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b81e2b4
de1e689
 
 
b81e2b4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
---
language: en
license: mit
tags:
- token-classification
- named-entity-recognition
- ner
- contact-management
- roberta
base_model: roberta-base
datasets:
- custom
model-index:
- name: assistant-bot-ner-model
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    metrics:
    - type: accuracy
      value: 0.951
      name: Accuracy
    - type: f1
      value: 0.946
      name: F1 Score
---

# NER Model for Contact Management Assistant Bot

This model is a fine-tuned RoBERTa-base model for Named Entity Recognition (NER) in contact management tasks.

## Model Description

- **Developed by:** Mykyta Kotenko
- **Base Model:** [roberta-base](https://huggingface.co/roberta-base) by Facebook AI
- **Task:** Token Classification (Named Entity Recognition)
- **Language:** English
- **License:** MIT
- **Accuracy:** 95.1%
- **Entity Accuracy:** 93.7%
- **F1 Score:** 94.6%

## Supported Entities

This model extracts the following entity types:

- **NAME**: Person's full name
- **PHONE**: Phone numbers in various formats
- **EMAIL**: Email addresses
- **ADDRESS**: Full street addresses (including building numbers, street names, apartments, cities, states, ZIP codes)
- **BIRTHDAY**: Dates of birth
- **TAG**: Contact tags
- **NOTE_TEXT**: Note content
- **ID**: Contact/note identifiers
- **DAYS**: Time periods

## Usage

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("kms-engineer/assistant-bot-ner-model")
model = AutoModelForTokenClassification.from_pretrained("kms-engineer/assistant-bot-ner-model")

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"  # Merge B-/I- tokens
)

# Extract entities
text = "Add contact John Smith 212-555-0123 john@example.com 123 Broadway, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")
```

**Output:**
```
NAME: John Smith
PHONE: 212-555-0123
EMAIL: john@example.com
ADDRESS: 123 Broadway, New York
```

### Advanced Usage with Address Recognition

```python
# Example with full address including building number
text = "Add contact Alon 212-555-0123 alon@example.com 45, 5 Ave, unit 34, New York"
results = ner_pipeline(text)

for result in results:
    print(f"{result['entity_group']}: {result['word']}")
```

**Output:**
```
NAME: Alon
PHONE: 212-555-0123
EMAIL: alon@example.com
ADDRESS: 45, 5 Ave, unit 34, New York
```

### Batch Processing

```python
texts = [
    "Add contact Sarah 718-555-4567 sarah@email.com lives at 123 Broadway, Apt 5B, NY 10001",
    "Create contact Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901",
    "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"
]

for text in texts:
    results = ner_pipeline(text)
    print(f"\nText: {text}")
    for result in results:
        print(f"  - {result['entity_group']}: {result['word']}")
```

## Training Details

### Dataset
- **Size:** 2,185 training examples
- **ADDRESS entities:** 543 occurrences (including full street addresses with building numbers)
- **NAME entities:** 1,897 occurrences
- **PHONE entities:** 564 occurrences
- **EMAIL entities:** 415 occurrences
- **BIRTHDAY entities:** 252 occurrences

### Training Configuration
- **Base Model:** roberta-base
- **Learning Rate:** 3e-5
- **Batch Size:** 16
- **Max Length:** 128 tokens
- **Epochs:** 5
- **Optimizer:** AdamW
- **Training Framework:** Hugging Face Transformers

### Performance Metrics

| Metric | Value |
|--------|-------|
| Accuracy | 95.1% |
| Entity Accuracy | 93.7% |
| Precision | 94.9% |
| Recall | 95.1% |
| F1 Score | 94.6% |

## Key Features

### ✅ Full Address Recognition
Unlike many NER models that only recognize city names, this model recognizes **complete street addresses** including:
- Building numbers (45, 123, 1234, etc.)
- Street names (Broadway, 5 Ave, Sunset Boulevard, etc.)
- Unit/Apartment numbers (unit 34, Apt 5B, Suite 12, Floor 3)
- Cities and states (New York, NY, Los Angeles, CA, etc.)
- ZIP codes (10001, 90028, 77002, etc.)

### Example: Full Address Recognition

**Before (typical NER models):**
```
Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "New York" ❌ (only city)
```

**After (this model):**
```
Input: "add address for Alon 45, 5 ave, unit 34, New York"
ADDRESS: "45, 5 ave, unit 34, New York" ✅ (full address with building number!)
```

## Example Predictions

### Example 1: Complete Contact
```python
text = "Add contact John Smith 212-555-0123 john@example.com 45, 5 Ave, unit 34, New York"
```
**Extracted Entities:**
- NAME: John Smith
- PHONE: 212-555-0123
- EMAIL: john@example.com
- ADDRESS: 45, 5 Ave, unit 34, New York

### Example 2: Address with ZIP Code
```python
text = "Create contact Sarah at 123 Broadway, Apt 5B, New York, NY 10001"
```
**Extracted Entities:**
- NAME: Sarah
- ADDRESS: 123 Broadway, Apt 5B, New York, NY 10001

### Example 3: Complex Address
```python
text = "Save contact for Michael at 789 Park Avenue, Suite 12, Manhattan, NY 10021 phone 917-555-8901"
```
**Extracted Entities:**
- NAME: Michael
- PHONE: 917-555-8901
- ADDRESS: 789 Park Avenue, Suite 12, Manhattan, NY 10021

### Example 4: Different City
```python
text = "Register David Martinez 1234 Sunset Boulevard, Los Angeles, CA 90028"
```
**Extracted Entities:**
- NAME: David Martinez
- ADDRESS: 1234 Sunset Boulevard, Los Angeles, CA 90028

## Intended Use

This model is designed for:
- Contact management applications
- Personal assistant bots
- CRM systems with natural language interface
- Address extraction from text
- Contact information parsing

## Limitations

- **Optimized for US-style addresses** - International addresses not yet in training data
- **Best performance on English text** - Other languages not supported
- **Contact management domain** - May not generalize well to other domains without fine-tuning

## Model Architecture

Based on RoBERTa (Robustly Optimized BERT Pretraining Approach):
- **Layers:** 12 transformer layers
- **Hidden size:** 768
- **Attention heads:** 12
- **Parameters:** ~125M
- **Task:** Token Classification with IOB2 tagging scheme

## Entity Label Format

The model uses IOB2 (Inside-Outside-Beginning) format:
- `B-{ENTITY}`: Beginning of entity
- `I-{ENTITY}`: Inside/continuation of entity
- `O`: Outside any entity

Example:
```
Tokens:  ["Add", "contact", "John", "Smith", "212", "-", "555", "-", "0123"]
Labels:  ["O",   "O",       "B-NAME", "I-NAME", "B-PHONE", "I-PHONE", "I-PHONE", "I-PHONE", "I-PHONE"]
```

## Citation

If you use this model, please cite:

```bibtex
@misc{kotenko2025nermodel,
  author = {Kotenko, Mykyta},
  title = {NER Model for Contact Management Assistant Bot},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/kms-engineer/assistant-bot-ner-model}},
  note = {Based on RoBERTa by Facebook AI. Achieves 95.1\% accuracy with full address recognition including building numbers.}
}
```

## Acknowledgments

- **Base Model:** RoBERTa by Facebook AI Research
- **Framework:** Hugging Face Transformers
- **Training:** Fine-tuned on custom contact management dataset with 2,185 examples
- **Special Feature:** Enhanced address recognition with building numbers, apartments, and full street addresses

## Technical Improvements

This model includes several technical improvements over standard NER models:

1. **Enhanced Tokenization:** Improved handling of addresses with fuzzy matching algorithm
2. **Rich Training Data:** 115+ real-world address examples from major US cities
3. **Address Variations:** Multiple formats including "address-first" patterns
4. **High Accuracy:** 95.1% overall accuracy, 93.7% entity-level accuracy

## Updates

- **v1.0.0 (2025-01-18):** Initial release
  - 95.1% accuracy
  - Full address recognition with building numbers
  - 2,185 training examples
  - Support for 9 entity types

## License

MIT License - See LICENSE file for details.

This model is a derivative work based on RoBERTa, which is licensed under MIT License by Facebook, Inc.

## Contact

- **Author:** Mykyta Kotenko
- **Repository:** [assistant-bot](https://github.com/kms-engineer/assistant-bot)
- **Issues:** Please report issues on GitHub
- **Hugging Face:** [kms-engineer](https://huggingface.co/kms-engineer)

## Related Models

- **Intent Classifier:** [kms-engineer/assistant-bot-intent-classifier](https://huggingface.co/kms-engineer/assistant-bot-intent-classifier)
- **Dataset:** [kms-engineer/assistant-bot-ner-dataset](https://huggingface.co/datasets/kms-engineer/assistant-bot-ner-dataset)