intertext-classical-swedish-sentence

A cross-lingual bi-encoder for finding classical Greek and Latin sentences cited, translated, paraphrased, or alluded to in Swedish prose. Embeds short text — single sentences or short clauses — into a shared 1024-dim space where translations and citations cluster together across languages.

For passage-level matching (5-sentence windows), see the companion model Ericu950/intertext-classical-swedish-window.

Quick start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence")
model.max_seq_length = 192

src = "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ."
candidates = [
    "Veten därför, att de som äro av tron, de äro Abrahams barn.",   # Galatians 3:7
    "Han gick genom rummet och stannade vid fönstret.",
    "Det har vi väntat på i fyra hundra år.",
]

embs = model.encode([src] + candidates, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for c, s in zip(candidates, scores):
    print(f"{s:+.3f}  {c}")

Training data

Training data comes from Ericu950/classical-swedish-citations (sentences config):

Training procedure

  • Base model: BAAI/bge-m3, via an intermediate trilingual fine-tune on synthetic (Greek/Latin)–English–Swedish parallel pairs from Ericu950/classical-swedish-synthetic-parallel.
  • Losses: MultipleNegativesRankingLoss on the triplets + OnlineContrastiveLoss (margin 0.3) on the contrastive pairs.
  • Multi-dataset sampling: round-robin so the smaller triplet dataset isn't drowned out by the contrastive pairs.
  • Optimizer: AdamW, learning rate 2e-6, warmup ratio 0.05.
  • Schedule: 1 epoch (the model converged early and the best checkpoint by held-out nDCG@10 was selected).
  • Batch size: 32 per GPU, bf16 mixed precision, 4× A100 80GB.
  • max_seq_length: 192.

Evaluation

Held-out set: 269 unique source anchors (sentences) from Ericu950/classical-swedish-citations, split off before mining (no leakage). The document pool contains:

  • 269 gold target Swedish sentences
  • 17,371 real production false-positive Swedish sentences (labeled negative pairs from the same dataset, training-side)
  • ~5,000 random Swedish sentences sampled from a 4M-sentence corpus

Total document pool: ~22,500 docs per query. The model must surface the gold target above all distractors.

Metric v2 base v3 (this model) Δ
nDCG@10 0.879 0.881 +0.002
Accuracy@1 78.1% 78.4% +0.4%
Accuracy@5 93.7% 93.7%
Accuracy@10 95.5% 95.5%
Accuracy@25 100.0% 100.0%

The improvement is modest. The v2 base is already strong for sentence-level retrieval — 100% recall by rank 25 means the gold is always found, the fine-tune just sharpens the top of the ranking. The window-level companion model has more room to grow and shows a larger improvement.

Related artifacts

Citation

@misc{intertext_classical_swedish_sentence_2026,
  author       = {Cullhed, Eric},
  title        = {intertext-classical-swedish-sentence: a sentence-level bi-encoder for cross-lingual classical citation detection},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-sentence}},
}
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ericu950/intertext-classical-swedish-sentence

Base model

BAAI/bge-m3
Finetuned
(450)
this model

Dataset used to train Ericu950/intertext-classical-swedish-sentence