ittailup
/

tori2

@@ -11,6 +11,43 @@ language:
   - en
   - pt
 license: mit
 ---
 # Tori v2 — Name Splitter
@@ -18,6 +55,15 @@ license: mit
 ModernBERT-base (149M params) fine-tuned for splitting full name strings into
 **forenames** and **surnames** using BIO token classification.
 ## Variants
 | Variant | Countries | Surname Pattern | Subfolder |
@@ -58,3 +104,13 @@ This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is
 Use the `tori.inference` module which handles subword aggregation correctly,
 or use `aggregation_strategy="none"` and aggregate tokens yourself using
 character offsets.

   - en
   - pt
 license: mit
+model-index:
+  - name: tori2-bilineal
+    results:
+      - task:
+          type: token-classification
+          name: Name Splitting (bilineal)
+        dataset:
+          type: custom
+          name: Bilineal eval split (MX/CO/ES/PE/CL)
+        metrics:
+          - type: f1
+            value: 0.9948
+            name: F1
+          - type: precision
+            value: 0.9948
+            name: Precision
+          - type: recall
+            value: 0.9949
+            name: Recall
+  - name: tori2-unilineal
+    results:
+      - task:
+          type: token-classification
+          name: Name Splitting (unilineal)
+        dataset:
+          type: custom
+          name: Unilineal eval split (AR/US/BR/PT)
+        metrics:
+          - type: f1
+            value: 0.9927
+            name: F1
+          - type: precision
+            value: 0.9927
+            name: Precision
+          - type: recall
+            value: 0.9927
+            name: Recall
 ---
 # Tori v2 — Name Splitter
 ModernBERT-base (149M params) fine-tuned for splitting full name strings into
 **forenames** and **surnames** using BIO token classification.
+## Evaluation Results
+| Variant | F1 | Precision | Recall | Eval Dataset |
+|---------|---:|----------:|-------:|--------------|
+| **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names |
+| **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names |
+Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match.
 ## Variants
 | Variant | Countries | Surname Pattern | Subfolder |
 Use the `tori.inference` module which handles subword aggregation correctly,
 or use `aggregation_strategy="none"` and aggregate tokens yourself using
 character offsets.
+## Training
+- **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
+- **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar
+- **Batch size**: 256 effective (128 x 2 gradient accumulation)
+- **Learning rate**: 5e-5, cosine schedule with 10% warmup
+- **Epochs**: 3
+- **Precision**: bf16
+- **Hardware**: NVIDIA A10G (AWS g5.xlarge)