|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- name-splitting |
|
|
- ner |
|
|
- modernbert |
|
|
- names |
|
|
language: |
|
|
- es |
|
|
- en |
|
|
- pt |
|
|
license: mit |
|
|
model-index: |
|
|
- name: tori2-bilineal |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Name Splitting (bilineal) |
|
|
dataset: |
|
|
type: custom |
|
|
name: Bilineal eval split (MX/CO/ES/PE/CL) |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 0.9948 |
|
|
name: F1 |
|
|
- type: precision |
|
|
value: 0.9948 |
|
|
name: Precision |
|
|
- type: recall |
|
|
value: 0.9949 |
|
|
name: Recall |
|
|
- name: tori2-unilineal |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Name Splitting (unilineal) |
|
|
dataset: |
|
|
type: custom |
|
|
name: Unilineal eval split (AR/US/BR/PT) |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 0.9927 |
|
|
name: F1 |
|
|
- type: precision |
|
|
value: 0.9927 |
|
|
name: Precision |
|
|
- type: recall |
|
|
value: 0.9927 |
|
|
name: Recall |
|
|
--- |
|
|
|
|
|
# Tori v2 — Name Splitter |
|
|
|
|
|
ModernBERT-base (149M params) fine-tuned for splitting full name strings into |
|
|
**forenames** and **surnames** using BIO token classification. |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Variant | F1 | Precision | Recall | Eval Dataset | |
|
|
|---------|---:|----------:|-------:|--------------| |
|
|
| **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names | |
|
|
| **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names | |
|
|
|
|
|
Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match. |
|
|
|
|
|
## Variants |
|
|
|
|
|
| Variant | Countries | Surname Pattern | Subfolder | |
|
|
|---------|-----------|-----------------|-----------| |
|
|
| **bilineal** (default) | MX, CO, ES, PE, CL | Double surname (paternal + maternal) | `/` (root) | |
|
|
| **unilineal** | AR, US, BR, PT | Single surname | `unilineal/` | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from tori.inference import load_pipeline, split_name |
|
|
|
|
|
# Default: bilineal model (double-surname countries) |
|
|
pipe = load_pipeline("ittailup/tori2") |
|
|
result = split_name(pipe, "Juan Carlos García López") |
|
|
print(result.forenames) # ['Juan', 'Carlos'] |
|
|
print(result.surnames) # ['García', 'López'] |
|
|
|
|
|
# Unilineal model (single-surname countries) |
|
|
pipe = load_pipeline("ittailup/tori2", variant="unilineal") |
|
|
result = split_name(pipe, "John Michael Smith") |
|
|
print(result.forenames) # ['John', 'Michael'] |
|
|
print(result.surnames) # ['Smith'] |
|
|
``` |
|
|
|
|
|
## Labels |
|
|
|
|
|
- `O` — Outside any name entity |
|
|
- `B-forenames` — Beginning of forename |
|
|
- `I-forenames` — Inside forename (continuation) |
|
|
- `B-surnames` — Beginning of surname |
|
|
- `I-surnames` — Inside surname (continuation) |
|
|
|
|
|
## Important: Custom Aggregation Required |
|
|
|
|
|
This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is |
|
|
**not compatible** with HuggingFace's built-in `aggregation_strategy="simple"`. |
|
|
Use the `tori.inference` module which handles subword aggregation correctly, |
|
|
or use `aggregation_strategy="none"` and aggregate tokens yourself using |
|
|
character offsets. |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params) |
|
|
- **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar |
|
|
- **Batch size**: 256 effective (128 x 2 gradient accumulation) |
|
|
- **Learning rate**: 5e-5, cosine schedule with 10% warmup |
|
|
- **Epochs**: 3 |
|
|
- **Precision**: bf16 |
|
|
- **Hardware**: NVIDIA A10G (AWS g5.xlarge) |
|
|
|