Sentence Similarity
sentence-transformers
Safetensors
Ancient Greek (to 1453)
Latin
Swedish
xlm-roberta
feature-extraction
bge-m3
cross-lingual
classical-philology
intertextuality
citation-detection
text-embeddings-inference
Instructions to use Ericu950/intertext-classical-swedish-window with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Ericu950/intertext-classical-swedish-window with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Ericu950/intertext-classical-swedish-window") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - grc | |
| - la | |
| - sv | |
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - bge-m3 | |
| - cross-lingual | |
| - classical-philology | |
| - intertextuality | |
| - citation-detection | |
| base_model: BAAI/bge-m3 | |
| datasets: | |
| - Ericu950/classical-swedish-citations | |
| - Ericu950/classical-swedish-synthetic-parallel | |
| # intertext-classical-swedish-window | |
| A cross-lingual bi-encoder for finding classical Greek and Latin citations in Swedish prose, operating on **5-sentence windows** rather than single sentences. The wider context lets the model match citations that are paraphrased, expanded, or spread across several Swedish sentences — cases where surface form barely overlaps but meaning does. | |
| For sentence-level matching, see the companion model | |
| **[Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence)**. | |
| ## Quick start | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("Ericu950/intertext-classical-swedish-window") | |
| model.max_seq_length = 320 | |
| src = ( | |
| "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ. " | |
| "Προϊδοῦσα δὲ ἡ γραφὴ ὅτι ἐκ πίστεως δικαιοῖ τὰ ἔθνη ὁ θεός, " | |
| "προευηγγελίσατο τῷ Ἀβραὰμ ὅτι ἐνευλογηθήσονται ἐν σοὶ πάντα τὰ ἔθνη." | |
| ) | |
| candidates = [ | |
| "Veten därför, att de som äro av tron, de äro Abrahams barn. " | |
| "Och eftersom Skriften förutsåg att Gud genom tron rättfärdigar hedningarna, " | |
| "förkunnade hon i förväg för Abraham detta glada budskap...", | |
| "Han gick genom rummet och stannade vid fönstret. " | |
| "Han såg ut över taken och funderade på vad som hade hänt. " | |
| "Klockan på torget slog tre. Han vände sig om...", | |
| ] | |
| embs = model.encode([src] + candidates, normalize_embeddings=True) | |
| scores = embs[0] @ embs[1:].T | |
| for c, s in zip(candidates, scores): | |
| print(f"{s:+.3f} {c[:80]}...") | |
| ``` | |
| ## Intended use | |
| The model is the passage-level retrieval head of a pipeline for discovering classical citations in Swedish literary corpora. Typical use: | |
| 1. Encode classical (Greek/Latin) source windows and Swedish corpus windows with this model. | |
| 2. Run dense retrieval (cosine) to surface candidate citation pairs. | |
| 3. Rerank with a cross-encoder and apply additional features (rarity, sentence-level agreement, contextual support). | |
| 4. Filter survivors with an LLM judge. | |
| The model also functions as a general Greek/Latin/Swedish passage encoder, but it's specifically optimized for citation detection at window granularity (~5 sentences). | |
| ## Training data | |
| Training data comes from | |
| [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (windows config). A "window" in this dataset is a 5-sentence chunk centered on a target sentence; for sentences near a work's boundary, the window is truncated accordingly. | |
| ## Evaluation | |
| Held-out set: 367 unique source anchors (windows) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains: | |
| - 367 gold target Swedish windows | |
| - ~39,500 real production false-positive Swedish windows (labeled negative pairs from the same dataset, training-side) | |
| - ~5,000 random Swedish windows sampled from a 4M-window corpus | |
| Total document pool: ~45,000 docs per query. | |
| | Metric | v2 base | v3 (this model) | Δ | | |
| |---|---|---|---| | |
| | nDCG@10 | 0.839 | **0.853** | +0.014 | | |
| | Accuracy@1 | 63.2% | **65.4%** | +2.2% | | |
| | Accuracy@5 | 99.7% | 100.0% | +0.3% | | |
| | Accuracy@10 | 99.7% | 100.0% | +0.3% | | |
| | Accuracy@25 | 100.0% | 100.0% | — | | |
| Window retrieval is intrinsically harder than sentence retrieval — longer text means more surface overlap with distractors. The fine-tune produces a meaningful improvement at the top of the ranking (the only place there's room): gold is now always found by rank 10, and the top-1 hit rate improves by 2.2 absolute percentage points. | |
| ## Limitations | |
| - **Domain:** trained primarily on biblical, philosophical, and literary citations. Performance on other domains is unknown. | |
| - **Granularity:** optimized for 5-sentence windows. For tight single-line citations, the sentence-level companion model may be sharper. | |
| - **Edge windows:** sentences near a work's start or end have shorter windows (1–4 sentences). The model sees these but performance on them may differ from full 5-sentence windows. | |
| - **Language coverage:** Greek, Latin, and Swedish only. The base BGE-M3 is multilingual, but this fine-tune may have shifted geometry away from other languages. | |
| - **Citations vs. translations:** the model conflates citation, translation, and close paraphrase. It cannot distinguish between "this passage is quoting Plato" and "this passage independently translates Plato." | |
| - **Sequence length:** max_seq_length is 320 tokens. Very long Swedish sentences (or windows packed with long compounds) may be truncated. | |
| ## Related artifacts | |
| - **Sentence-level model:** [Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence) | |
| - **Labeled citation data:** [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) | |
| - **Synthetic parallel data:** [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel) | |
| - **Source corpus:** [Ericu950/classical-swedish-corpus](https://huggingface.co/datasets/Ericu950/classical-swedish-corpus) | |
| ## Citation | |
| ```bibtex | |
| @misc{intertext_classical_swedish_window_2026, | |
| author = {Cullhed, Eric}, | |
| title = {intertext-classical-swedish-window: a window-level bi-encoder for cross-lingual classical citation detection}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-window}}, | |
| } | |
| ``` |