Sentence Similarity
sentence-transformers
Safetensors
Ancient Greek (to 1453)
Latin
Swedish
xlm-roberta
feature-extraction
bge-m3
cross-lingual
classical-philology
intertextuality
citation-detection
text-embeddings-inference
Instructions to use Ericu950/intertext-classical-swedish-sentence with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Ericu950/intertext-classical-swedish-sentence with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - grc | |
| - la | |
| - sv | |
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - bge-m3 | |
| - cross-lingual | |
| - classical-philology | |
| - intertextuality | |
| - citation-detection | |
| base_model: BAAI/bge-m3 | |
| datasets: | |
| - Ericu950/classical-swedish-citations | |
| - Ericu950/classical-swedish-synthetic-parallel | |
| # intertext-classical-swedish-sentence | |
| A cross-lingual bi-encoder for finding classical Greek and Latin **sentences** cited, translated, paraphrased, or alluded to in Swedish prose. Embeds short text — single sentences or short clauses — into a shared 1024-dim space where translations and citations cluster together across languages. | |
| For passage-level matching (5-sentence windows), see the companion model | |
| **[Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window)**. | |
| ## Quick start | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence") | |
| model.max_seq_length = 192 | |
| src = "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ." | |
| candidates = [ | |
| "Veten därför, att de som äro av tron, de äro Abrahams barn.", # Galatians 3:7 | |
| "Han gick genom rummet och stannade vid fönstret.", | |
| "Det har vi väntat på i fyra hundra år.", | |
| ] | |
| embs = model.encode([src] + candidates, normalize_embeddings=True) | |
| scores = embs[0] @ embs[1:].T | |
| for c, s in zip(candidates, scores): | |
| print(f"{s:+.3f} {c}") | |
| ``` | |
| ## Training data | |
| Training data comes from | |
| [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (sentences config): | |
| ## Training procedure | |
| - **Base model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), via an intermediate trilingual fine-tune on synthetic (Greek/Latin)–English–Swedish parallel pairs from [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel). | |
| - **Losses:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) on the triplets + [OnlineContrastiveLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#onlinecontrastiveloss) (margin 0.3) on the contrastive pairs. | |
| - **Multi-dataset sampling:** round-robin so the smaller triplet dataset isn't drowned out by the contrastive pairs. | |
| - **Optimizer:** AdamW, learning rate 2e-6, warmup ratio 0.05. | |
| - **Schedule:** 1 epoch (the model converged early and the best checkpoint by held-out nDCG@10 was selected). | |
| - **Batch size:** 32 per GPU, bf16 mixed precision, 4× A100 80GB. | |
| - **max_seq_length:** 192. | |
| ## Evaluation | |
| Held-out set: 269 unique source anchors (sentences) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains: | |
| - 269 gold target Swedish sentences | |
| - 17,371 real production false-positive Swedish sentences (labeled negative pairs from the same dataset, training-side) | |
| - ~5,000 random Swedish sentences sampled from a 4M-sentence corpus | |
| Total document pool: ~22,500 docs per query. The model must surface the gold target above all distractors. | |
| | Metric | v2 base | v3 (this model) | Δ | | |
| |---|---|---|---| | |
| | nDCG@10 | 0.879 | **0.881** | +0.002 | | |
| | Accuracy@1 | 78.1% | 78.4% | +0.4% | | |
| | Accuracy@5 | 93.7% | 93.7% | — | | |
| | Accuracy@10 | 95.5% | 95.5% | — | | |
| | Accuracy@25 | 100.0% | 100.0% | — | | |
| The improvement is modest. The v2 base is already strong for sentence-level retrieval — 100% recall by rank 25 means the gold is always found, the fine-tune just sharpens the top of the ranking. The window-level companion model has more room to grow and shows a larger improvement. | |
| ## Related artifacts | |
| - **Window-level model:** [Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window) | |
| - **Labeled citation data:** [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) | |
| ## Citation | |
| ```bibtex | |
| @misc{intertext_classical_swedish_sentence_2026, | |
| author = {Cullhed, Eric}, | |
| title = {intertext-classical-swedish-sentence: a sentence-level bi-encoder for cross-lingual classical citation detection}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-sentence}}, | |
| } | |
| ``` |