intertext-classical-swedish-sentence

A cross-lingual bi-encoder for finding classical Greek and Latin sentences cited, translated, paraphrased, or alluded to in Swedish prose. Embeds short text — single sentences or short clauses — into a shared 1024-dim space where translations and citations cluster together across languages.

For passage-level matching (5-sentence windows), see the companion model Ericu950/intertext-classical-swedish-window.

Quick start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence")
model.max_seq_length = 192

src = "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ."
candidates = [
    "Veten därför, att de som äro av tron, de äro Abrahams barn.",   # Galatians 3:7
    "Han gick genom rummet och stannade vid fönstret.",
    "Det har vi väntat på i fyra hundra år.",
]

embs = model.encode([src] + candidates, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for c, s in zip(candidates, scores):
    print(f"{s:+.3f}  {c}")

Training data

Training data comes from Ericu950/classical-swedish-citations (sentences config):

Training procedure

Base model: BAAI/bge-m3, via an intermediate trilingual fine-tune on synthetic (Greek/Latin)–English–Swedish parallel pairs from Ericu950/classical-swedish-synthetic-parallel.
Losses: MultipleNegativesRankingLoss on the triplets + OnlineContrastiveLoss (margin 0.3) on the contrastive pairs.
Multi-dataset sampling: round-robin so the smaller triplet dataset isn't drowned out by the contrastive pairs.
Optimizer: AdamW, learning rate 2e-6, warmup ratio 0.05.
Schedule: 1 epoch (the model converged early and the best checkpoint by held-out nDCG@10 was selected).
Batch size: 32 per GPU, bf16 mixed precision, 4× A100 80GB.
max_seq_length: 192.

Evaluation

Held-out set: 269 unique source anchors (sentences) from Ericu950/classical-swedish-citations, split off before mining (no leakage). The document pool contains:

269 gold target Swedish sentences
17,371 real production false-positive Swedish sentences (labeled negative pairs from the same dataset, training-side)
~5,000 random Swedish sentences sampled from a 4M-sentence corpus

Total document pool: ~22,500 docs per query. The model must surface the gold target above all distractors.

Metric	v2 base	v3 (this model)	Δ
nDCG@10	0.879	0.881	+0.002
Accuracy@1	78.1%	78.4%	+0.4%
Accuracy@5	93.7%	93.7%	—
Accuracy@10	95.5%	95.5%	—
Accuracy@25	100.0%	100.0%	—

The improvement is modest. The v2 base is already strong for sentence-level retrieval — 100% recall by rank 25 means the gold is always found, the fine-tune just sharpens the top of the ranking. The window-level companion model has more room to grow and shows a larger improvement.

Related artifacts

Window-level model: Ericu950/intertext-classical-swedish-window
Labeled citation data: Ericu950/classical-swedish-citations

Citation

@misc{intertext_classical_swedish_sentence_2026,
  author       = {Cullhed, Eric},
  title        = {intertext-classical-swedish-sentence: a sentence-level bi-encoder for cross-lingual classical citation detection},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-sentence}},
}

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Ericu950/intertext-classical-swedish-sentence

Base model

BAAI/bge-m3

Finetuned

(450)

this model

Ericu950
/

intertext-classical-swedish-sentence