--- library_name: transformers pipeline_tag: token-classification tags: - name-splitting - ner - modernbert - names language: - es - en - pt license: mit model-index: - name: tori2-bilineal results: - task: type: token-classification name: Name Splitting (bilineal) dataset: type: custom name: Bilineal eval split (MX/CO/ES/PE/CL) metrics: - type: f1 value: 0.9948 name: F1 - type: precision value: 0.9948 name: Precision - type: recall value: 0.9949 name: Recall - name: tori2-unilineal results: - task: type: token-classification name: Name Splitting (unilineal) dataset: type: custom name: Unilineal eval split (AR/US/BR/PT) metrics: - type: f1 value: 0.9927 name: F1 - type: precision value: 0.9927 name: Precision - type: recall value: 0.9927 name: Recall --- # Tori v2 — Name Splitter ModernBERT-base (149M params) fine-tuned for splitting full name strings into **forenames** and **surnames** using BIO token classification. ## Evaluation Results | Variant | F1 | Precision | Recall | Eval Dataset | |---------|---:|----------:|-------:|--------------| | **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names | | **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names | Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match. ## Variants | Variant | Countries | Surname Pattern | Subfolder | |---------|-----------|-----------------|-----------| | **bilineal** (default) | MX, CO, ES, PE, CL | Double surname (paternal + maternal) | `/` (root) | | **unilineal** | AR, US, BR, PT | Single surname | `unilineal/` | ## Usage ```python from tori.inference import load_pipeline, split_name # Default: bilineal model (double-surname countries) pipe = load_pipeline("ittailup/tori2") result = split_name(pipe, "Juan Carlos García López") print(result.forenames) # ['Juan', 'Carlos'] print(result.surnames) # ['García', 'López'] # Unilineal model (single-surname countries) pipe = load_pipeline("ittailup/tori2", variant="unilineal") result = split_name(pipe, "John Michael Smith") print(result.forenames) # ['John', 'Michael'] print(result.surnames) # ['Smith'] ``` ## Labels - `O` — Outside any name entity - `B-forenames` — Beginning of forename - `I-forenames` — Inside forename (continuation) - `B-surnames` — Beginning of surname - `I-surnames` — Inside surname (continuation) ## Important: Custom Aggregation Required This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is **not compatible** with HuggingFace's built-in `aggregation_strategy="simple"`. Use the `tori.inference` module which handles subword aggregation correctly, or use `aggregation_strategy="none"` and aggregate tokens yourself using character offsets. ## Training - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params) - **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar - **Batch size**: 256 effective (128 x 2 gradient accumulation) - **Learning rate**: 5e-5, cosine schedule with 10% warmup - **Epochs**: 3 - **Precision**: bf16 - **Hardware**: NVIDIA A10G (AWS g5.xlarge)