File size: 6,113 Bytes
dc3eebf
0085b9f
 
 
 
 
 
 
dc3eebf
 
 
 
0085b9f
 
 
 
 
 
 
 
 
dc3eebf
 
0085b9f
dc3eebf
0085b9f
dc3eebf
0085b9f
 
dc3eebf
0085b9f
dc3eebf
 
 
 
0085b9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc3eebf
 
0085b9f
 
 
 
dc3eebf
 
0085b9f
dc3eebf
0085b9f
dc3eebf
0085b9f
 
 
 
dc3eebf
0085b9f
dc3eebf
0085b9f
dc3eebf
0085b9f
 
dc3eebf
 
 
 
0085b9f
dc3eebf
0085b9f
 
 
dc3eebf
0085b9f
dc3eebf
0085b9f
 
 
 
 
 
 
dc3eebf
0085b9f
dc3eebf
0085b9f
dc3eebf
0085b9f
 
 
 
 
 
dc3eebf
0085b9f
dc3eebf
0085b9f
 
 
 
dc3eebf
 
 
 
0085b9f
 
 
 
 
 
dc3eebf
0085b9f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: mit
language:
- grc
- la
- sv
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- bge-m3
- cross-lingual
- classical-philology
- intertextuality
- citation-detection
base_model: BAAI/bge-m3
datasets:
- Ericu950/classical-swedish-citations
- Ericu950/classical-swedish-synthetic-parallel
---

# intertext-classical-swedish-window

A cross-lingual bi-encoder for finding classical Greek and Latin citations in Swedish prose, operating on **5-sentence windows** rather than single sentences. The wider context lets the model match citations that are paraphrased, expanded, or spread across several Swedish sentences — cases where surface form barely overlaps but meaning does.

For sentence-level matching, see the companion model
**[Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence)**.

## Quick start

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Ericu950/intertext-classical-swedish-window")
model.max_seq_length = 320

src = (
    "Γινώσκετε ἄρα ὅτι οἱ ἐκ πίστεως, οὗτοι υἱοί εἰσιν Ἀβραάμ. "
    "Προϊδοῦσα δὲ ἡ γραφὴ ὅτι ἐκ πίστεως δικαιοῖ τὰ ἔθνη ὁ θεός, "
    "προευηγγελίσατο τῷ Ἀβραὰμ ὅτι ἐνευλογηθήσονται ἐν σοὶ πάντα τὰ ἔθνη."
)
candidates = [
    "Veten därför, att de som äro av tron, de äro Abrahams barn. "
    "Och eftersom Skriften förutsåg att Gud genom tron rättfärdigar hedningarna, "
    "förkunnade hon i förväg för Abraham detta glada budskap...",

    "Han gick genom rummet och stannade vid fönstret. "
    "Han såg ut över taken och funderade på vad som hade hänt. "
    "Klockan på torget slog tre. Han vände sig om...",
]

embs = model.encode([src] + candidates, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for c, s in zip(candidates, scores):
    print(f"{s:+.3f}  {c[:80]}...")
```

## Intended use

The model is the passage-level retrieval head of a pipeline for discovering classical citations in Swedish literary corpora. Typical use:

1. Encode classical (Greek/Latin) source windows and Swedish corpus windows with this model.
2. Run dense retrieval (cosine) to surface candidate citation pairs.
3. Rerank with a cross-encoder and apply additional features (rarity, sentence-level agreement, contextual support).
4. Filter survivors with an LLM judge.

The model also functions as a general Greek/Latin/Swedish passage encoder, but it's specifically optimized for citation detection at window granularity (~5 sentences).

## Training data

Training data comes from
[Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations) (windows config). A "window" in this dataset is a 5-sentence chunk centered on a target sentence; for sentences near a work's boundary, the window is truncated accordingly.


## Evaluation

Held-out set: 367 unique source anchors (windows) from `Ericu950/classical-swedish-citations`, split off before mining (no leakage). The document pool contains:

- 367 gold target Swedish windows
- ~39,500 real production false-positive Swedish windows (labeled negative pairs from the same dataset, training-side)
- ~5,000 random Swedish windows sampled from a 4M-window corpus

Total document pool: ~45,000 docs per query.

| Metric | v2 base | v3 (this model) | Δ |
|---|---|---|---|
| nDCG@10 | 0.839 | **0.853** | +0.014 |
| Accuracy@1 | 63.2% | **65.4%** | +2.2% |
| Accuracy@5 | 99.7% | 100.0% | +0.3% |
| Accuracy@10 | 99.7% | 100.0% | +0.3% |
| Accuracy@25 | 100.0% | 100.0% | — |

Window retrieval is intrinsically harder than sentence retrieval — longer text means more surface overlap with distractors. The fine-tune produces a meaningful improvement at the top of the ranking (the only place there's room): gold is now always found by rank 10, and the top-1 hit rate improves by 2.2 absolute percentage points.

## Limitations

- **Domain:** trained primarily on biblical, philosophical, and literary citations. Performance on other domains is unknown.
- **Granularity:** optimized for 5-sentence windows. For tight single-line citations, the sentence-level companion model may be sharper.
- **Edge windows:** sentences near a work's start or end have shorter windows (1–4 sentences). The model sees these but performance on them may differ from full 5-sentence windows.
- **Language coverage:** Greek, Latin, and Swedish only. The base BGE-M3 is multilingual, but this fine-tune may have shifted geometry away from other languages.
- **Citations vs. translations:** the model conflates citation, translation, and close paraphrase. It cannot distinguish between "this passage is quoting Plato" and "this passage independently translates Plato."
- **Sequence length:** max_seq_length is 320 tokens. Very long Swedish sentences (or windows packed with long compounds) may be truncated.

## Related artifacts

- **Sentence-level model:** [Ericu950/intertext-classical-swedish-sentence](https://huggingface.co/Ericu950/intertext-classical-swedish-sentence)
- **Labeled citation data:** [Ericu950/classical-swedish-citations](https://huggingface.co/datasets/Ericu950/classical-swedish-citations)
- **Synthetic parallel data:** [Ericu950/classical-swedish-synthetic-parallel](https://huggingface.co/datasets/Ericu950/classical-swedish-synthetic-parallel)
- **Source corpus:** [Ericu950/classical-swedish-corpus](https://huggingface.co/datasets/Ericu950/classical-swedish-corpus)

## Citation

```bibtex
@misc{intertext_classical_swedish_window_2026,
  author       = {Cullhed, Eric},
  title        = {intertext-classical-swedish-window: a window-level bi-encoder for cross-lingual classical citation detection},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ericu950/intertext-classical-swedish-window}},
}
```