File size: 3,568 Bytes
8c2bc91 ea5f3d9 8c2bc91 ea5f3d9 8c2bc91 ea5f3d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
library_name: transformers
pipeline_tag: token-classification
tags:
- name-splitting
- ner
- modernbert
- names
language:
- es
- en
- pt
license: mit
model-index:
- name: tori2-bilineal
results:
- task:
type: token-classification
name: Name Splitting (bilineal)
dataset:
type: custom
name: Bilineal eval split (MX/CO/ES/PE/CL)
metrics:
- type: f1
value: 0.9948
name: F1
- type: precision
value: 0.9948
name: Precision
- type: recall
value: 0.9949
name: Recall
- name: tori2-unilineal
results:
- task:
type: token-classification
name: Name Splitting (unilineal)
dataset:
type: custom
name: Unilineal eval split (AR/US/BR/PT)
metrics:
- type: f1
value: 0.9927
name: F1
- type: precision
value: 0.9927
name: Precision
- type: recall
value: 0.9927
name: Recall
---
# Tori v2 — Name Splitter
ModernBERT-base (149M params) fine-tuned for splitting full name strings into
**forenames** and **surnames** using BIO token classification.
## Evaluation Results
| Variant | F1 | Precision | Recall | Eval Dataset |
|---------|---:|----------:|-------:|--------------|
| **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names |
| **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names |
Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match.
## Variants
| Variant | Countries | Surname Pattern | Subfolder |
|---------|-----------|-----------------|-----------|
| **bilineal** (default) | MX, CO, ES, PE, CL | Double surname (paternal + maternal) | `/` (root) |
| **unilineal** | AR, US, BR, PT | Single surname | `unilineal/` |
## Usage
```python
from tori.inference import load_pipeline, split_name
# Default: bilineal model (double-surname countries)
pipe = load_pipeline("ittailup/tori2")
result = split_name(pipe, "Juan Carlos García López")
print(result.forenames) # ['Juan', 'Carlos']
print(result.surnames) # ['García', 'López']
# Unilineal model (single-surname countries)
pipe = load_pipeline("ittailup/tori2", variant="unilineal")
result = split_name(pipe, "John Michael Smith")
print(result.forenames) # ['John', 'Michael']
print(result.surnames) # ['Smith']
```
## Labels
- `O` — Outside any name entity
- `B-forenames` — Beginning of forename
- `I-forenames` — Inside forename (continuation)
- `B-surnames` — Beginning of surname
- `I-surnames` — Inside surname (continuation)
## Important: Custom Aggregation Required
This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is
**not compatible** with HuggingFace's built-in `aggregation_strategy="simple"`.
Use the `tori.inference` module which handles subword aggregation correctly,
or use `aggregation_strategy="none"` and aggregate tokens yourself using
character offsets.
## Training
- **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
- **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar
- **Batch size**: 256 effective (128 x 2 gradient accumulation)
- **Learning rate**: 5e-5, cosine schedule with 10% warmup
- **Epochs**: 3
- **Precision**: bf16
- **Hardware**: NVIDIA A10G (AWS g5.xlarge)
|