Ericu950's picture
Update README.md
660f5e0 verified
---
license: mit
language:
- grc
- la
- sv
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- bge-m3
- cross-lingual
- classical-philology
- intertextuality
- citation-detection
base_model: BAAI/bge-m3
datasets:
- Ericu950/classical-swedish-citations
- Ericu950/classical-swedish-synthetic-parallel
---
# intertext-classical-swedish-sentence
A cross-lingual bi-encoder for finding classical Greek and Latin **sentences** cited, translated, paraphrased, or alluded to in Swedish prose. Embeds short text — single sentences or short clauses — into a shared 1024-dim space where translations and citations cluster together across languages.
For passage-level matching (5-sentence windows), see the companion model
**[Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window)**.
## Quick start
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence")
model.max_seq_length = 192
src = "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ."
candidates = [
"Veten därför, att de som äro av tron, de äro Abrahams barn.", # Galatians 3:7
"Han gick genom rummet och stannade vid fönstret.",
"Det har vi väntat på i fyra hundra år.",
]
embs = model.encode([src] + candidates, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for c, s in zip(candidates, scores):
print(f"{s:+.3f} {c}")
```
## Training data
Training data comes from
[Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (sentences config):
## Training procedure
- **Base model:** [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), via an intermediate trilingual fine-tune on synthetic (Greek/Latin)–English–Swedish parallel pairs from [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel).
- **Losses:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) on the triplets + [OnlineContrastiveLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#onlinecontrastiveloss) (margin 0.3) on the contrastive pairs.
- **Multi-dataset sampling:** round-robin so the smaller triplet dataset isn't drowned out by the contrastive pairs.
- **Optimizer:** AdamW, learning rate 2e-6, warmup ratio 0.05.
- **Schedule:** 1 epoch (the model converged early and the best checkpoint by held-out nDCG@10 was selected).
- **Batch size:** 32 per GPU, bf16 mixed precision, 4× A100 80GB.
- **max_seq_length:** 192.
## Evaluation
Held-out set: 269 unique source anchors (sentences) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains:
- 269 gold target Swedish sentences
- 17,371 real production false-positive Swedish sentences (labeled negative pairs from the same dataset, training-side)
- ~5,000 random Swedish sentences sampled from a 4M-sentence corpus
Total document pool: ~22,500 docs per query. The model must surface the gold target above all distractors.
| Metric | v2 base | v3 (this model) | Δ |
|---|---|---|---|
| nDCG@10 | 0.879 | **0.881** | +0.002 |
| Accuracy@1 | 78.1% | 78.4% | +0.4% |
| Accuracy@5 | 93.7% | 93.7% | — |
| Accuracy@10 | 95.5% | 95.5% | — |
| Accuracy@25 | 100.0% | 100.0% | — |
The improvement is modest. The v2 base is already strong for sentence-level retrieval — 100% recall by rank 25 means the gold is always found, the fine-tune just sharpens the top of the ranking. The window-level companion model has more room to grow and shows a larger improvement.
## Related artifacts
- **Window-level model:** [Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window)
- **Labeled citation data:** [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations)
## Citation
```bibtex
@misc{intertext_classical_swedish_sentence_2026,
author = {Cullhed, Eric},
title = {intertext-classical-swedish-sentence: a sentence-level bi-encoder for cross-lingual classical citation detection},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-sentence}},
}
```