Update README.md

660f5e0 verified 23 days ago

4.58 kB

	---
	license: mit
	language:
	- grc
	- la
	- sv
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- bge-m3
	- cross-lingual
	- classical-philology
	- intertextuality
	- citation-detection
	base_model: BAAI/bge-m3
	datasets:
	- Ericu950/classical-swedish-citations
	- Ericu950/classical-swedish-synthetic-parallel
	---

	# intertext-classical-swedish-sentence

	A cross-lingual bi-encoder for finding classical Greek and Latin sentences cited, translated, paraphrased, or alluded to in Swedish prose. Embeds short text — single sentences or short clauses — into a shared 1024-dim space where translations and citations cluster together across languages.

	For passage-level matching (5-sentence windows), see the companion model
	[Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window).

	## Quick start

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Ericu950/intertext-classical-swedish-sentence")
	model.max_seq_length = 192

	src = "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ."
	candidates = [
	"Veten därför, att de som äro av tron, de äro Abrahams barn.", # Galatians 3:7
	"Han gick genom rummet och stannade vid fönstret.",
	"Det har vi väntat på i fyra hundra år.",
	]

	embs = model.encode([src] + candidates, normalize_embeddings=True)
	scores = embs[0] @ embs[1:].T
	for c, s in zip(candidates, scores):
	print(f"{s:+.3f} {c}")
	```

	## Training data

	Training data comes from
	[Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (sentences config):

	## Training procedure

	- Base model: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), via an intermediate trilingual fine-tune on synthetic (Greek/Latin)–English–Swedish parallel pairs from [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel).
	- Losses: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) on the triplets + [OnlineContrastiveLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#onlinecontrastiveloss) (margin 0.3) on the contrastive pairs.
	- Multi-dataset sampling: round-robin so the smaller triplet dataset isn't drowned out by the contrastive pairs.
	- Optimizer: AdamW, learning rate 2e-6, warmup ratio 0.05.
	- Schedule: 1 epoch (the model converged early and the best checkpoint by held-out nDCG@10 was selected).
	- Batch size: 32 per GPU, bf16 mixed precision, 4× A100 80GB.
	- max_seq_length: 192.

	## Evaluation

	Held-out set: 269 unique source anchors (sentences) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains:

	- 269 gold target Swedish sentences
	- 17,371 real production false-positive Swedish sentences (labeled negative pairs from the same dataset, training-side)
	- ~5,000 random Swedish sentences sampled from a 4M-sentence corpus

	Total document pool: ~22,500 docs per query. The model must surface the gold target above all distractors.

	\| Metric \| v2 base \| v3 (this model) \| Δ \|
	\|---\|---\|---\|---\|
	\| nDCG@10 \| 0.879 \| 0.881 \| +0.002 \|
	\| Accuracy@1 \| 78.1% \| 78.4% \| +0.4% \|
	\| Accuracy@5 \| 93.7% \| 93.7% \| — \|
	\| Accuracy@10 \| 95.5% \| 95.5% \| — \|
	\| Accuracy@25 \| 100.0% \| 100.0% \| — \|

	The improvement is modest. The v2 base is already strong for sentence-level retrieval — 100% recall by rank 25 means the gold is always found, the fine-tune just sharpens the top of the ranking. The window-level companion model has more room to grow and shows a larger improvement.

	## Related artifacts

	- Window-level model: [Ericu950/intertext-classical-swedish-window](https://huggingface.co/Ericu950/intertext-classical-swedish-window)
	- Labeled citation data: [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations)

	## Citation

	```bibtex
	@misc{intertext_classical_swedish_sentence_2026,
	author = {Cullhed, Eric},
	title = {intertext-classical-swedish-sentence: a sentence-level bi-encoder for cross-lingual classical citation detection},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-sentence}},
	}
	```