tori2

File size: 3,568 Bytes

---
library_name: transformers
pipeline_tag: token-classification
tags:
  - name-splitting
  - ner
  - modernbert
  - names
language:
  - es
  - en
  - pt
license: mit
model-index:
  - name: tori2-bilineal
    results:
      - task:
          type: token-classification
          name: Name Splitting (bilineal)
        dataset:
          type: custom
          name: Bilineal eval split (MX/CO/ES/PE/CL)
        metrics:
          - type: f1
            value: 0.9948
            name: F1
          - type: precision
            value: 0.9948
            name: Precision
          - type: recall
            value: 0.9949
            name: Recall
  - name: tori2-unilineal
    results:
      - task:
          type: token-classification
          name: Name Splitting (unilineal)
        dataset:
          type: custom
          name: Unilineal eval split (AR/US/BR/PT)
        metrics:
          - type: f1
            value: 0.9927
            name: F1
          - type: precision
            value: 0.9927
            name: Precision
          - type: recall
            value: 0.9927
            name: Recall
---

# Tori v2 — Name Splitter

ModernBERT-base (149M params) fine-tuned for splitting full name strings into
**forenames** and **surnames** using BIO token classification.

## Evaluation Results

| Variant | F1 | Precision | Recall | Eval Dataset |
|---------|---:|----------:|-------:|--------------|
| **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names |
| **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names |

Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match.

## Variants

| Variant | Countries | Surname Pattern | Subfolder |
|---------|-----------|-----------------|-----------|
| **bilineal** (default) | MX, CO, ES, PE, CL | Double surname (paternal + maternal) | `/` (root) |
| **unilineal** | AR, US, BR, PT | Single surname | `unilineal/` |

## Usage

```python
from tori.inference import load_pipeline, split_name

# Default: bilineal model (double-surname countries)
pipe = load_pipeline("ittailup/tori2")
result = split_name(pipe, "Juan Carlos García López")
print(result.forenames)  # ['Juan', 'Carlos']
print(result.surnames)   # ['García', 'López']

# Unilineal model (single-surname countries)
pipe = load_pipeline("ittailup/tori2", variant="unilineal")
result = split_name(pipe, "John Michael Smith")
print(result.forenames)  # ['John', 'Michael']
print(result.surnames)   # ['Smith']
```

## Labels

- `O` — Outside any name entity
- `B-forenames` — Beginning of forename
- `I-forenames` — Inside forename (continuation)
- `B-surnames` — Beginning of surname
- `I-surnames` — Inside surname (continuation)

## Important: Custom Aggregation Required

This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is
**not compatible** with HuggingFace's built-in `aggregation_strategy="simple"`.
Use the `tori.inference` module which handles subword aggregation correctly,
or use `aggregation_strategy="none"` and aggregate tokens yourself using
character offsets.

## Training

- **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
- **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar
- **Batch size**: 256 effective (128 x 2 gradient accumulation)
- **Learning rate**: 5e-5, cosine schedule with 10% warmup
- **Epochs**: 3
- **Precision**: bf16
- **Hardware**: NVIDIA A10G (AWS g5.xlarge)