--- license: mit language: - grc - la - sv library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction - bge-m3 - cross-lingual - classical-philology - intertextuality - citation-detection base_model: BAAI/bge-m3 datasets: - Ericu950/classical-swedish-citations - Ericu950/classical-swedish-synthetic-parallel --- # intertext-classical-swedish-sentence A cross-lingual bi-encoder for finding classical Greek and Latin **sentences** cited, translated, paraphrased, or alluded to in Swedish prose. Embeds short text — single sentences or short clauses — into a shared 1024-dim space where translations and citations cluster together across languages. For passage-level matching (5-sentence windows), see the companion model **[Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window)**. ## Quick start ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence") model.max_seq_length = 192 src = "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ." candidates = [ "Veten därför, att de som äro av tron, de äro Abrahams barn.", # Galatians 3:7 "Han gick genom rummet och stannade vid fönstret.", "Det har vi väntat på i fyra hundra år.", ] embs = model.encode([src] + candidates, normalize_embeddings=True) scores = embs[0] @ embs[1:].T for c, s in zip(candidates, scores): print(f"{s:+.3f} {c}") ``` ## Training data Training data comes from [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (sentences config): ## Training procedure - **Base model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), via an intermediate trilingual fine-tune on synthetic (Greek/Latin)–English–Swedish parallel pairs from [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel). - **Losses:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) on the triplets + [OnlineContrastiveLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#onlinecontrastiveloss) (margin 0.3) on the contrastive pairs. - **Multi-dataset sampling:** round-robin so the smaller triplet dataset isn't drowned out by the contrastive pairs. - **Optimizer:** AdamW, learning rate 2e-6, warmup ratio 0.05. - **Schedule:** 1 epoch (the model converged early and the best checkpoint by held-out nDCG@10 was selected). - **Batch size:** 32 per GPU, bf16 mixed precision, 4× A100 80GB. - **max_seq_length:** 192. ## Evaluation Held-out set: 269 unique source anchors (sentences) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains: - 269 gold target Swedish sentences - 17,371 real production false-positive Swedish sentences (labeled negative pairs from the same dataset, training-side) - ~5,000 random Swedish sentences sampled from a 4M-sentence corpus Total document pool: ~22,500 docs per query. The model must surface the gold target above all distractors. | Metric | v2 base | v3 (this model) | Δ | |---|---|---|---| | nDCG@10 | 0.879 | **0.881** | +0.002 | | Accuracy@1 | 78.1% | 78.4% | +0.4% | | Accuracy@5 | 93.7% | 93.7% | — | | Accuracy@10 | 95.5% | 95.5% | — | | Accuracy@25 | 100.0% | 100.0% | — | The improvement is modest. The v2 base is already strong for sentence-level retrieval — 100% recall by rank 25 means the gold is always found, the fine-tune just sharpens the top of the ranking. The window-level companion model has more room to grow and shows a larger improvement. ## Related artifacts - **Window-level model:** [Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window) - **Labeled citation data:** [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) ## Citation ```bibtex @misc{intertext_classical_swedish_sentence_2026, author = {Cullhed, Eric}, title = {intertext-classical-swedish-sentence: a sentence-level bi-encoder for cross-lingual classical citation detection}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-sentence}}, } ```