Update README.md

0085b9f verified 2 days ago

6.11 kB

	---
	license: mit
	language:
	- grc
	- la
	- sv
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- bge-m3
	- cross-lingual
	- classical-philology
	- intertextuality
	- citation-detection
	base_model: BAAI/bge-m3
	datasets:
	- Ericu950/classical-swedish-citations
	- Ericu950/classical-swedish-synthetic-parallel
	---

	# intertext-classical-swedish-window

	A cross-lingual bi-encoder for finding classical Greek and Latin citations in Swedish prose, operating on 5-sentence windows rather than single sentences. The wider context lets the model match citations that are paraphrased, expanded, or spread across several Swedish sentences — cases where surface form barely overlaps but meaning does.

	For sentence-level matching, see the companion model
	[Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence).

	## Quick start

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Ericu950/intertext-classical-swedish-window")
	model.max_seq_length = 320

	src = (
	"Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ. "
	"Προϊδοῦσα δὲ ἡ γραφὴ ὅτι ἐκ πίστεως δικαιοῖ τὰ ἔθνη ὁ θεός, "
	"προευηγγελίσατο τῷ Ἀβραὰμ ὅτι ἐνευλογηθήσονται ἐν σοὶ πάντα τὰ ἔθνη."
	)
	candidates = [
	"Veten därför, att de som äro av tron, de äro Abrahams barn. "
	"Och eftersom Skriften förutsåg att Gud genom tron rättfärdigar hedningarna, "
	"förkunnade hon i förväg för Abraham detta glada budskap...",

	"Han gick genom rummet och stannade vid fönstret. "
	"Han såg ut över taken och funderade på vad som hade hänt. "
	"Klockan på torget slog tre. Han vände sig om...",
	]

	embs = model.encode([src] + candidates, normalize_embeddings=True)
	scores = embs[0] @ embs[1:].T
	for c, s in zip(candidates, scores):
	print(f"{s:+.3f} {c[:80]}...")
	```

	## Intended use

	The model is the passage-level retrieval head of a pipeline for discovering classical citations in Swedish literary corpora. Typical use:

	1. Encode classical (Greek/Latin) source windows and Swedish corpus windows with this model.
	2. Run dense retrieval (cosine) to surface candidate citation pairs.
	3. Rerank with a cross-encoder and apply additional features (rarity, sentence-level agreement, contextual support).
	4. Filter survivors with an LLM judge.

	The model also functions as a general Greek/Latin/Swedish passage encoder, but it's specifically optimized for citation detection at window granularity (~5 sentences).

	## Training data

	Training data comes from
	[Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (windows config). A "window" in this dataset is a 5-sentence chunk centered on a target sentence; for sentences near a work's boundary, the window is truncated accordingly.


	## Evaluation

	Held-out set: 367 unique source anchors (windows) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains:

	- 367 gold target Swedish windows
	- ~39,500 real production false-positive Swedish windows (labeled negative pairs from the same dataset, training-side)
	- ~5,000 random Swedish windows sampled from a 4M-window corpus

	Total document pool: ~45,000 docs per query.

	\| Metric \| v2 base \| v3 (this model) \| Δ \|
	\|---\|---\|---\|---\|
	\| nDCG@10 \| 0.839 \| 0.853 \| +0.014 \|
	\| Accuracy@1 \| 63.2% \| 65.4% \| +2.2% \|
	\| Accuracy@5 \| 99.7% \| 100.0% \| +0.3% \|
	\| Accuracy@10 \| 99.7% \| 100.0% \| +0.3% \|
	\| Accuracy@25 \| 100.0% \| 100.0% \| — \|

	Window retrieval is intrinsically harder than sentence retrieval — longer text means more surface overlap with distractors. The fine-tune produces a meaningful improvement at the top of the ranking (the only place there's room): gold is now always found by rank 10, and the top-1 hit rate improves by 2.2 absolute percentage points.

	## Limitations

	- Domain: trained primarily on biblical, philosophical, and literary citations. Performance on other domains is unknown.
	- Granularity: optimized for 5-sentence windows. For tight single-line citations, the sentence-level companion model may be sharper.
	- Edge windows: sentences near a work's start or end have shorter windows (1–4 sentences). The model sees these but performance on them may differ from full 5-sentence windows.
	- Language coverage: Greek, Latin, and Swedish only. The base BGE-M3 is multilingual, but this fine-tune may have shifted geometry away from other languages.
	- Citations vs. translations: the model conflates citation, translation, and close paraphrase. It cannot distinguish between "this passage is quoting Plato" and "this passage independently translates Plato."
	- Sequence length: max_seq_length is 320 tokens. Very long Swedish sentences (or windows packed with long compounds) may be truncated.

	## Related artifacts

	- Sentence-level model: [Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence)
	- Labeled citation data: [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations)
	- Synthetic parallel data: [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel)
	- Source corpus: [Ericu950/classical-swedish-corpus](https://huggingface.co/datasets/Ericu950/classical-swedish-corpus)

	## Citation

	```bibtex
	@misc{intertext_classical_swedish_window_2026,
	author = {Cullhed, Eric},
	title = {intertext-classical-swedish-window: a window-level bi-encoder for cross-lingual classical citation detection},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-window}},
	}
	```