Sentence Similarity
sentence-transformers
Safetensors
Ancient Greek (to 1453)
Latin
Swedish
xlm-roberta
feature-extraction
bge-m3
cross-lingual
classical-philology
intertextuality
citation-detection
text-embeddings-inference
Instructions to use Ericu950/intertext-classical-swedish-window with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Ericu950/intertext-classical-swedish-window with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Ericu950/intertext-classical-swedish-window") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 6,113 Bytes
dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f dc3eebf 0085b9f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | ---
license: mit
language:
- grc
- la
- sv
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- bge-m3
- cross-lingual
- classical-philology
- intertextuality
- citation-detection
base_model: BAAI/bge-m3
datasets:
- Ericu950/classical-swedish-citations
- Ericu950/classical-swedish-synthetic-parallel
---
# intertext-classical-swedish-window
A cross-lingual bi-encoder for finding classical Greek and Latin citations in Swedish prose, operating on **5-sentence windows** rather than single sentences. The wider context lets the model match citations that are paraphrased, expanded, or spread across several Swedish sentences — cases where surface form barely overlaps but meaning does.
For sentence-level matching, see the companion model
**[Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence)**.
## Quick start
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Ericu950/intertext-classical-swedish-window")
model.max_seq_length = 320
src = (
"Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ. "
"Προϊδοῦσα δὲ ἡ γραφὴ ὅτι ἐκ πίστεως δικαιοῖ τὰ ἔθνη ὁ θεός, "
"προευηγγελίσατο τῷ Ἀβραὰμ ὅτι ἐνευλογηθήσονται ἐν σοὶ πάντα τὰ ἔθνη."
)
candidates = [
"Veten därför, att de som äro av tron, de äro Abrahams barn. "
"Och eftersom Skriften förutsåg att Gud genom tron rättfärdigar hedningarna, "
"förkunnade hon i förväg för Abraham detta glada budskap...",
"Han gick genom rummet och stannade vid fönstret. "
"Han såg ut över taken och funderade på vad som hade hänt. "
"Klockan på torget slog tre. Han vände sig om...",
]
embs = model.encode([src] + candidates, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for c, s in zip(candidates, scores):
print(f"{s:+.3f} {c[:80]}...")
```
## Intended use
The model is the passage-level retrieval head of a pipeline for discovering classical citations in Swedish literary corpora. Typical use:
1. Encode classical (Greek/Latin) source windows and Swedish corpus windows with this model.
2. Run dense retrieval (cosine) to surface candidate citation pairs.
3. Rerank with a cross-encoder and apply additional features (rarity, sentence-level agreement, contextual support).
4. Filter survivors with an LLM judge.
The model also functions as a general Greek/Latin/Swedish passage encoder, but it's specifically optimized for citation detection at window granularity (~5 sentences).
## Training data
Training data comes from
[Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (windows config). A "window" in this dataset is a 5-sentence chunk centered on a target sentence; for sentences near a work's boundary, the window is truncated accordingly.
## Evaluation
Held-out set: 367 unique source anchors (windows) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains:
- 367 gold target Swedish windows
- ~39,500 real production false-positive Swedish windows (labeled negative pairs from the same dataset, training-side)
- ~5,000 random Swedish windows sampled from a 4M-window corpus
Total document pool: ~45,000 docs per query.
| Metric | v2 base | v3 (this model) | Δ |
|---|---|---|---|
| nDCG@10 | 0.839 | **0.853** | +0.014 |
| Accuracy@1 | 63.2% | **65.4%** | +2.2% |
| Accuracy@5 | 99.7% | 100.0% | +0.3% |
| Accuracy@10 | 99.7% | 100.0% | +0.3% |
| Accuracy@25 | 100.0% | 100.0% | — |
Window retrieval is intrinsically harder than sentence retrieval — longer text means more surface overlap with distractors. The fine-tune produces a meaningful improvement at the top of the ranking (the only place there's room): gold is now always found by rank 10, and the top-1 hit rate improves by 2.2 absolute percentage points.
## Limitations
- **Domain:** trained primarily on biblical, philosophical, and literary citations. Performance on other domains is unknown.
- **Granularity:** optimized for 5-sentence windows. For tight single-line citations, the sentence-level companion model may be sharper.
- **Edge windows:** sentences near a work's start or end have shorter windows (1–4 sentences). The model sees these but performance on them may differ from full 5-sentence windows.
- **Language coverage:** Greek, Latin, and Swedish only. The base BGE-M3 is multilingual, but this fine-tune may have shifted geometry away from other languages.
- **Citations vs. translations:** the model conflates citation, translation, and close paraphrase. It cannot distinguish between "this passage is quoting Plato" and "this passage independently translates Plato."
- **Sequence length:** max_seq_length is 320 tokens. Very long Swedish sentences (or windows packed with long compounds) may be truncated.
## Related artifacts
- **Sentence-level model:** [Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence)
- **Labeled citation data:** [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations)
- **Synthetic parallel data:** [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel)
- **Source corpus:** [Ericu950/classical-swedish-corpus](https://huggingface.co/datasets/Ericu950/classical-swedish-corpus)
## Citation
```bibtex
@misc{intertext_classical_swedish_window_2026,
author = {Cullhed, Eric},
title = {intertext-classical-swedish-window: a window-level bi-encoder for cross-lingual classical citation detection},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-window}},
}
``` |