intertext-classical-swedish-window

A cross-lingual bi-encoder for finding classical Greek and Latin citations in Swedish prose, operating on 5-sentence windows rather than single sentences. The wider context lets the model match citations that are paraphrased, expanded, or spread across several Swedish sentences — cases where surface form barely overlaps but meaning does.

For sentence-level matching, see the companion model Ericu950/intertext-classical-swedish-sentence.

Quick start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Ericu950/intertext-classical-swedish-window")
model.max_seq_length = 320

src = (
    "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ. "
    "Προϊδοῦσα δὲ ἡ γραφὴ ὅτι ἐκ πίστεως δικαιοῖ τὰ ἔθνη ὁ θεός, "
    "προευηγγελίσατο τῷ Ἀβραὰμ ὅτι ἐνευλογηθήσονται ἐν σοὶ πάντα τὰ ἔθνη."
)
candidates = [
    "Veten därför, att de som äro av tron, de äro Abrahams barn. "
    "Och eftersom Skriften förutsåg att Gud genom tron rättfärdigar hedningarna, "
    "förkunnade hon i förväg för Abraham detta glada budskap...",

    "Han gick genom rummet och stannade vid fönstret. "
    "Han såg ut över taken och funderade på vad som hade hänt. "
    "Klockan på torget slog tre. Han vände sig om...",
]

embs = model.encode([src] + candidates, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for c, s in zip(candidates, scores):
    print(f"{s:+.3f}  {c[:80]}...")

Intended use

The model is the passage-level retrieval head of a pipeline for discovering classical citations in Swedish literary corpora. Typical use:

Encode classical (Greek/Latin) source windows and Swedish corpus windows with this model.
Run dense retrieval (cosine) to surface candidate citation pairs.
Rerank with a cross-encoder and apply additional features (rarity, sentence-level agreement, contextual support).
Filter survivors with an LLM judge.

The model also functions as a general Greek/Latin/Swedish passage encoder, but it's specifically optimized for citation detection at window granularity (~5 sentences).

Training data

Training data comes from Ericu950/classical-swedish-citations (windows config). A "window" in this dataset is a 5-sentence chunk centered on a target sentence; for sentences near a work's boundary, the window is truncated accordingly.

Evaluation

Held-out set: 367 unique source anchors (windows) from Ericu950/classical-swedish-citations, split off before mining (no leakage). The document pool contains:

367 gold target Swedish windows
~39,500 real production false-positive Swedish windows (labeled negative pairs from the same dataset, training-side)
~5,000 random Swedish windows sampled from a 4M-window corpus

Total document pool: ~45,000 docs per query.

Metric	v2 base	v3 (this model)	Δ
nDCG@10	0.839	0.853	+0.014
Accuracy@1	63.2%	65.4%	+2.2%
Accuracy@5	99.7%	100.0%	+0.3%
Accuracy@10	99.7%	100.0%	+0.3%
Accuracy@25	100.0%	100.0%	—

Window retrieval is intrinsically harder than sentence retrieval — longer text means more surface overlap with distractors. The fine-tune produces a meaningful improvement at the top of the ranking (the only place there's room): gold is now always found by rank 10, and the top-1 hit rate improves by 2.2 absolute percentage points.

Limitations

Domain: trained primarily on biblical, philosophical, and literary citations. Performance on other domains is unknown.
Granularity: optimized for 5-sentence windows. For tight single-line citations, the sentence-level companion model may be sharper.
Edge windows: sentences near a work's start or end have shorter windows (1–4 sentences). The model sees these but performance on them may differ from full 5-sentence windows.
Language coverage: Greek, Latin, and Swedish only. The base BGE-M3 is multilingual, but this fine-tune may have shifted geometry away from other languages.
Citations vs. translations: the model conflates citation, translation, and close paraphrase. It cannot distinguish between "this passage is quoting Plato" and "this passage independently translates Plato."
Sequence length: max_seq_length is 320 tokens. Very long Swedish sentences (or windows packed with long compounds) may be truncated.

Related artifacts

Sentence-level model: Ericu950/intertext-classical-swedish-sentence
Labeled citation data: Ericu950/classical-swedish-citations
Synthetic parallel data: Ericu950/classical-swedish-synthetic-parallel
Source corpus: Ericu950/classical-swedish-corpus

Citation

@misc{intertext_classical_swedish_window_2026,
  author       = {Cullhed, Eric},
  title        = {intertext-classical-swedish-window: a window-level bi-encoder for cross-lingual classical citation detection},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-window}},
}

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Ericu950/intertext-classical-swedish-window

Base model

BAAI/bge-m3

Finetuned

(450)

this model

Ericu950
/

intertext-classical-swedish-window