Add evaluation metrics to model card
Browse files
README.md
CHANGED
|
@@ -11,6 +11,43 @@ language:
|
|
| 11 |
- en
|
| 12 |
- pt
|
| 13 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# Tori v2 — Name Splitter
|
|
@@ -18,6 +55,15 @@ license: mit
|
|
| 18 |
ModernBERT-base (149M params) fine-tuned for splitting full name strings into
|
| 19 |
**forenames** and **surnames** using BIO token classification.
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## Variants
|
| 22 |
|
| 23 |
| Variant | Countries | Surname Pattern | Subfolder |
|
|
@@ -58,3 +104,13 @@ This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is
|
|
| 58 |
Use the `tori.inference` module which handles subword aggregation correctly,
|
| 59 |
or use `aggregation_strategy="none"` and aggregate tokens yourself using
|
| 60 |
character offsets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- en
|
| 12 |
- pt
|
| 13 |
license: mit
|
| 14 |
+
model-index:
|
| 15 |
+
- name: tori2-bilineal
|
| 16 |
+
results:
|
| 17 |
+
- task:
|
| 18 |
+
type: token-classification
|
| 19 |
+
name: Name Splitting (bilineal)
|
| 20 |
+
dataset:
|
| 21 |
+
type: custom
|
| 22 |
+
name: Bilineal eval split (MX/CO/ES/PE/CL)
|
| 23 |
+
metrics:
|
| 24 |
+
- type: f1
|
| 25 |
+
value: 0.9948
|
| 26 |
+
name: F1
|
| 27 |
+
- type: precision
|
| 28 |
+
value: 0.9948
|
| 29 |
+
name: Precision
|
| 30 |
+
- type: recall
|
| 31 |
+
value: 0.9949
|
| 32 |
+
name: Recall
|
| 33 |
+
- name: tori2-unilineal
|
| 34 |
+
results:
|
| 35 |
+
- task:
|
| 36 |
+
type: token-classification
|
| 37 |
+
name: Name Splitting (unilineal)
|
| 38 |
+
dataset:
|
| 39 |
+
type: custom
|
| 40 |
+
name: Unilineal eval split (AR/US/BR/PT)
|
| 41 |
+
metrics:
|
| 42 |
+
- type: f1
|
| 43 |
+
value: 0.9927
|
| 44 |
+
name: F1
|
| 45 |
+
- type: precision
|
| 46 |
+
value: 0.9927
|
| 47 |
+
name: Precision
|
| 48 |
+
- type: recall
|
| 49 |
+
value: 0.9927
|
| 50 |
+
name: Recall
|
| 51 |
---
|
| 52 |
|
| 53 |
# Tori v2 — Name Splitter
|
|
|
|
| 55 |
ModernBERT-base (149M params) fine-tuned for splitting full name strings into
|
| 56 |
**forenames** and **surnames** using BIO token classification.
|
| 57 |
|
| 58 |
+
## Evaluation Results
|
| 59 |
+
|
| 60 |
+
| Variant | F1 | Precision | Recall | Eval Dataset |
|
| 61 |
+
|---------|---:|----------:|-------:|--------------|
|
| 62 |
+
| **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names |
|
| 63 |
+
| **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names |
|
| 64 |
+
|
| 65 |
+
Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match.
|
| 66 |
+
|
| 67 |
## Variants
|
| 68 |
|
| 69 |
| Variant | Countries | Surname Pattern | Subfolder |
|
|
|
|
| 104 |
Use the `tori.inference` module which handles subword aggregation correctly,
|
| 105 |
or use `aggregation_strategy="none"` and aggregate tokens yourself using
|
| 106 |
character offsets.
|
| 107 |
+
|
| 108 |
+
## Training
|
| 109 |
+
|
| 110 |
+
- **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
|
| 111 |
+
- **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar
|
| 112 |
+
- **Batch size**: 256 effective (128 x 2 gradient accumulation)
|
| 113 |
+
- **Learning rate**: 5e-5, cosine schedule with 10% warmup
|
| 114 |
+
- **Epochs**: 3
|
| 115 |
+
- **Precision**: bf16
|
| 116 |
+
- **Hardware**: NVIDIA A10G (AWS g5.xlarge)
|