etr-lora-v4 — Etruscan-side LoRA adapter for LaBSE

Status note. The numbers in the YAML frontmatter and in the Evaluation table below are the LaBSE-only column of the current frozen rosetta-eval-v1 benchmark. That is what the first Hub deposit covers. The v4 column will be added after WBS tasks T2.3 (ingest v4 vectors behind a feature flag) and T2.4 (run the head-to-head eval) land in prod and the benchmark gains its fourth row.

TL;DR

etr-lora-v4 is a LoRA adapter that fine-tunes the Etruscan-side vocabulary projection of a multilingual encoder (XLM-R-base, with LaBSE as the cross-lingual anchor on the Latin/Greek side) so that Etruscan words land in the same 768-dim semantic space as the rest of the multilingual vocabulary. The system is evaluated against held-out Etruscan ↔ Latin equivalences drawn from the philological literature (Bonfante & Bonfante 2002, Wallace 2008, Pallottino 1968), exposed through the rosetta-eval-v1 frozen benchmark.

The pipeline is designed for semantic-neighbourhood retrieval over a low-resource, undeciphered ancient language, not lexical-equivalence translation. See Limitations before you cite the numbers.

Intended use

Cognate / loanword detection. Given an Etruscan word, find orthographically- or semantically-similar Latin or Greek words. Useful for spotting Etruscan→Latin borrowings (e.g. histrio, popa, subulo, satura).
Theonym and place-name alignment. Etruscan deity and place names were often Latinised by Roman authors with regular sound correspondences. The system reliably recovers these: menrva→minerva, hercle→hercules, fanu→fanum.
Within-language semantic-field exploration. For an Etruscan query, the system returns Latin words with related meanings even when the exact target lemma is wrong (e.g. papa→[papa, daddy, pater]).
Multilingual nearest-neighbour browsing as a primitive other ancient-language work (Phoenician, Faliscan, Oscan) can plug into without rebuilding the storage / API layer.

Out of scope

Mechanical Etruscan → Latin translation. Lexical equivalence between unrelated surface forms (clan → filius, puia → uxor, lautn → familia) is not in the model, and no amount of pooling, centering, or LoRA fine-tuning recovers signal that was never in the training corpus.
Decipherment of unknown Etruscan words. Top-k results will be orthographic and semantic neighbours of the source surface form, not authoritative semantic equivalents.
An Etruscan dictionary. This is not a dictionary. We make no such claim. The output is a ranked shortlist for downstream philological judgement, not a translation.

How to use

From `sentence-transformers`

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Eddy1919/etr-lora-v4")
embeddings = model.encode(["fanu", "avil", "clan"])
# embeddings.shape == (3, 768)

Through the hosted API

curl 'https://api.openetruscan.com/neural/rosetta?word=fanu&from=ett&to=lat&embedder=xlmr-lora-v4'

The default embedder is LaBSE; passing embedder=xlmr-lora-v4 routes the query through the v4 adapter. The route currently returns LaBSE results until T2.3 lands the v4 partition in prod.

Training data

Derived from the OpenEtruscan corpus v1 (Zenodo DOI 10.5281/zenodo.20075836):

6,633 unified inscriptions, drawn primarily from the Larth Dataset (Vico & Spanakis 2023; ~~71% of rows) and the Corpus Inscriptionum Etruscarum Vol. I extractions (~~29%).
~8,905 unique Etruscan tokens on the source side after divider-normalisation (see Training procedure below).
No primary-source-attested anchors are used in training — only the raw transcriptions. The Bonfante / Wallace / Pallottino equivalences are held out for evaluation in rosetta-eval-v1.

Upstream provenance chain is documented in research/BIBLIOGRAPHY.md.

Training procedure

LoRA over XLM-R-base (768-dim hidden), trained on Vertex AI in the openetruscan-rosetta GCP project.

Output adapter: gs://openetruscan-rosetta/adapters/etr-lora-v4/
Re-embedded Etruscan vocabulary: gs://openetruscan-rosetta/embeddings/etr-xlmr-lora-v4.jsonl (8,905 rows × 768 dim).
Etruscan-side preprocessing: word-divider normalisation (: and · → space, per Bonfante 2002 §10), preserving . (intra-word phonological marker) and - (compounding marker).

Hyperparameters (matching the v3 → v4 recipe in scripts/training/vertex/submit_etr_lora_v4.sh):

Hyperparameter	Value
Base model	`xlm-roberta-base`
Epochs	5
Learning rate	5e-4
Batch size	16
Max length	64 tokens
LoRA r	8
LoRA alpha	16
LoRA dropout	0.1
Target modules	`q_proj`, `v_proj`
Seed	42
Hardware	1× NVIDIA T4 (Vertex AI `n1-standard-8`)
Wall time	~30–60 min
Compute cost	~$0.40 USD

The training recipe (and divider-normalisation function) is in scripts/training/vertex/train_etruscan_lora.py. The only delta from v3 is the corpus input (etruscan-prod-rawtext-v3.jsonl, the cleaner V3 corpus produced after normalize_inscriptions.py removed Cyrillic / Latin-Ext-B mirror-glyph corruption and unified sibilant variants σ/ś/š/ς → SAN).

Evaluation

All numbers below are from the first frozen run of rosetta-eval-v1, committed at eval/rosetta-eval-v1-20260510T210124Z.json. The model column reflects the LaBSE baseline that prod was serving at the time of the run. The v4 column will be added when T2.3 lands v4 vectors in prod and T2.4 runs the head-to-head.

Headline numbers — 22-pair test split

Metric	random	Levenshtein	LaBSE (current prod)	v4 (after T2.3 / T2.4)
Strict-lexical precision@10	0.0002	0.000	0.0625	to be added
Semantic-field precision@10	0.0081	0.000	0.1875	to be added
Coverage@cos≥0.50	0.000	0.955	1.000	to be added
Coverage@cos≥0.70	0.000	0.273	1.000	to be added
Coverage@cos≥0.85	0.000	0.091	0.6875	to be added
n evaluated (of 22)	22	22	16	to be added
n skipped (OOV on the source side)	0	0	6 (27.3%)	to be added

Per-confidence breakdown (LaBSE column)

Confidence	n	strict @10	field @10
high	10	0.100	0.200
medium	6	0.000	0.167

Per-category breakdown (LaBSE column, field@10)

Category	n	field @10
kinship	3	0.333
theonym	3	0.333
onomastic	2	0.500
religious	2	0.000
time	2	0.000
numeral	3	0.000
verb	1	0.000

The strict-lexical metric measures something the system cannot do without parallel-data supervision; the semantic-field metric measures what it can do, and is the honest reflection of the system's actual research utility. Both are reported side-by-side for historical comparability.

For the full reproducibility manifest (pinned commit hashes, Latin vocab snapshot, baseline math), see research/notes/reproduce-rosetta-eval-v1.md.

Limitations

Honesty matters more here than marketing:

Small held-out test split (n=22 pairs). Confidence intervals are correspondingly wide. RG.4 in the SOTA roadmap adds 95%-bootstrap CIs to every reported number; until that lands, treat single-decimal-point differences between models as noise.
27% OOV rate on the source side. 6 of the 22 test-split pairs are skipped by the model because the Etruscan token has no vector in language_word_embeddings. The other two baselines (random, Levenshtein) evaluate all 22. Comparisons are accordingly not apples-to-apples without per-pair pairing.
No primary-source-attested anchors used in training. The evaluation set is itself the philological consensus. Any training signal that pushed precision up — short of genuinely parallel data we do not have — would be reflecting that same consensus back at us. Work-package P4 (primary-source mining) is the route out.
Philological consensus reflects a school. The Bonfante & Bonfante / Wallace / Pallottino reading is one school's best reading. Categories like verb (n=1) and time (n=2) are under-represented; the per-category breakdown above is indicative, not authoritative.
Cross-language semantic alignment for unrelated surface forms remains weak. clan → filius, puia → uxor, lautn → familia are misses by design; there is no signal in the training corpus that these are equivalent.

Citation

If you use this model, please cite both the software/dataset DOI and the model directly:

@software{openetruscan_2026,
  author    = {OpenEtruscan Contributors},
  title     = {{OpenEtruscan: open-source digital corpus platform for Etruscan epigraphy}},
  year      = {2026},
  version   = {0.5.0},
  doi       = {10.5281/zenodo.20075836},
  url       = {https://doi.org/10.5281/zenodo.20075836},
  publisher = {Zenodo}
}

@misc{openetruscan_etr_lora_v4_2026,
  author       = {OpenEtruscan Contributors},
  title        = {{etr-lora-v4: Etruscan-side LoRA adapter for LaBSE / XLM-R}},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Eddy1919/etr-lora-v4}},
  note         = {Evaluated against the rosetta-eval-v1 frozen benchmark.}
}

The frozen reference benchmark is rosetta-eval-v1; full reproduction instructions live in research/notes/reproduce-rosetta-eval-v1.md.

License

Apache 2.0 — matches the model-artifact licensing scheme of the OpenEtruscan repository (code: MIT, data: CC0 1.0, models: Apache 2.0).

Acknowledgements

Vico, A. and Spanakis, G. (2023). Larth Dataset — primary source for ~71% of the unified corpus.
Compilers of the Corpus Inscriptionum Etruscarum (CIE Vol. I), source of the remaining ~29%.
Bonfante, G. and Bonfante, L. (2002). The Etruscan Language: An Introduction, 2nd edition.
Wallace, R. E. (2008). Zikh Rasna: A Manual of the Etruscan Language and Inscriptions.
Pallottino, M. (1968). Testimonia Linguae Etruscae.
Feng et al. (2020). LaBSE: Language-agnostic BERT Sentence Embedding — the cross-lingual anchor.
The Pelagios Network, the EpiDoc community, and the Classical Language Toolkit.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Eddy1919/etr-lora-v4

Base model

sentence-transformers/LaBSE

Finetuned

(86)

this model

Evaluation results

Semantic-field precision@10 (LaBSE baseline) on rosetta-eval-v1 (test split)
self-reported

0.188
Strict-lexical precision@10 (LaBSE baseline) on rosetta-eval-v1 (test split)
self-reported

0.063