cmboulanger commited on
Commit
8a7ede1
Β·
1 Parent(s): 482c2e9

Add detailed explanation

Browse files
.gitignore CHANGED
@@ -8,4 +8,5 @@ wheels/
8
 
9
  # Virtual environments
10
  .venv
11
- .env*
 
 
8
 
9
  # Virtual environments
10
  .venv
11
+ .env*
12
+ .DS_Store
README.md CHANGED
@@ -12,6 +12,73 @@ Works with any inference endpoint through an injected `call_fn: (str) -> str`
12
 
13
  ---
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ## Installation
16
 
17
  Requires Python β‰₯ 3.12 and [uv](https://docs.astral.sh/uv/).
 
12
 
13
  ---
14
 
15
+ ## Pipeline diagram
16
+
17
+ ```text
18
+ Input text (may contain XML markup)
19
+ β”‚
20
+ β–Ό
21
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
22
+ β”‚ Strip existing XML tags β”‚ pipeline.py
23
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
24
+ β”‚
25
+ β–Ό (optional)
26
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
27
+ β”‚ GLiNER pre-detection β”‚ detection/
28
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
29
+ β”‚
30
+ β–Ό
31
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
32
+ β”‚ Chunk text β”‚ chunking/
33
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
34
+ β”‚
35
+ ╔══════╧══════╗
36
+ β•‘ per chunk β•‘
37
+ β•šβ•β•β•β•β•β•β•€β•β•β•β•β•β•β•
38
+ β”‚
39
+ β–Ό
40
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
41
+ β”‚ Build LLM prompt β”‚ prompting/
42
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
43
+ β”‚
44
+ β–Ό
45
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
46
+ β”‚ LLM inference β”‚ inference/
47
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
48
+ β”‚
49
+ β–Ό
50
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
51
+ β”‚ Parse JSON response β”‚ postprocessing/
52
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
53
+ β”‚
54
+ ╔════════╧════════╗
55
+ β•‘ merge all chunks β•‘
56
+ β•šβ•β•β•β•β•β•β•β•β•€β•β•β•β•β•β•β•β•β•
57
+ β”‚
58
+ β–Ό
59
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
60
+ β”‚ Resolve spans β†’ char offsets β”‚ postprocessing/
61
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
62
+ β”‚ Validate against schema β”‚ postprocessing/
63
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
64
+ β”‚ Inject XML tags β”‚ postprocessing/
65
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
66
+ β”‚
67
+ β–Ό
68
+ Annotated XML output
69
+ ```
70
+
71
+ Detailed documentation for each stage:
72
+ [Data models](tei_annotator/models/README.md) Β·
73
+ [GLiNER detection](tei_annotator/detection/README.md) Β·
74
+ [Chunking](tei_annotator/chunking/README.md) Β·
75
+ [Prompt building](tei_annotator/prompting/README.md) Β·
76
+ [Inference configuration](tei_annotator/inference/README.md) Β·
77
+ [Post-processing](tei_annotator/postprocessing/README.md) Β·
78
+ [Evaluation](tei_annotator/evaluation/README.md)
79
+
80
+ ---
81
+
82
  ## Installation
83
 
84
  Requires Python β‰₯ 3.12 and [uv](https://docs.astral.sh/uv/).
tei_annotator/chunking/README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Chunking
2
+
3
+ Long texts exceed the context window (or the practical attention span) of most LLMs. The chunker splits the source text into overlapping windows so that each LLM call sees a manageable piece of text while entity boundaries between windows are never lost.
4
+
5
+ ---
6
+
7
+ ## API
8
+
9
+ ```python
10
+ from tei_annotator.chunking.chunker import chunk_text, Chunk
11
+
12
+ chunks: list[Chunk] = chunk_text(text, chunk_size=1500, chunk_overlap=200)
13
+ ```
14
+
15
+ Each `Chunk` is a dataclass:
16
+
17
+ ```python
18
+ @dataclass
19
+ class Chunk:
20
+ text: str # content of this window
21
+ start_offset: int # character position of chunk[0] in the original text
22
+ ```
23
+
24
+ `start_offset` is used by the resolver to convert chunk-local character positions back to positions in the full source text.
25
+
26
+ ---
27
+
28
+ ## Splitting algorithm
29
+
30
+ 1. Walk the text left-to-right, accumulating characters.
31
+ 2. When the accumulated length reaches `chunk_size`, scan backwards for the last whitespace character at or before that position and cut there. The boundary is always between words, never mid-token.
32
+ 3. The **next chunk begins** `chunk_overlap` characters before the cut point, so consecutive chunks share a strip of text.
33
+ 4. Repeat until all text is consumed.
34
+
35
+ The final chunk is whatever remains after the last cut, even if it is shorter than `chunk_size`.
36
+
37
+ ---
38
+
39
+ ## XML safety
40
+
41
+ If the source text contains existing XML markup, the splitter is tag-aware: it never places a cut inside an XML tag. Tags are treated as **zero-width** for length accounting β€” only visible text characters count toward `chunk_size`. This ensures that pre-existing markup is always preserved intact within whichever chunk it falls in.
42
+
43
+ ---
44
+
45
+ ## Why overlap matters
46
+
47
+ Without overlap, an entity that straddles the boundary between two chunks could be missed by both LLM calls β€” the first call sees the entity's opening but not its close, and vice versa. With `chunk_overlap` characters of shared context, the entity appears complete in at least one chunk with enough surrounding text for the model to recognise and annotate it.
48
+
49
+ Because the overlap causes the same entity to appear in two consecutive chunks, the pipeline deduplicates resolved spans after collecting results from all chunks: identical `(element, start, end)` triples are merged and only one instance is kept.
50
+
51
+ ---
52
+
53
+ ## Tuning
54
+
55
+ | Parameter | Default | Effect |
56
+ |-----------|---------|--------|
57
+ | `chunk_size` | `1500` | Larger values send more text per LLM call (cheaper) but risk exceeding context limits. Reduce for models with small windows. |
58
+ | `chunk_overlap` | `200` | Larger values make boundary entities safer but increase the number of duplicate detections that must be deduplicated. |
59
+
60
+ A `chunk_size` of 1500 characters corresponds to roughly 375–500 tokens for Latin-script text, leaving ample room for the prompt preamble and JSON response within a 4 k-token limit.
tei_annotator/detection/README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GLiNER pre-detection (optional)
2
+
3
+ [GLiNER](https://github.com/urchade/GLiNER) is a zero-shot named-entity recognition model that runs locally on CPU. The pre-detection pass uses GLiNER to quickly surface entity candidates before the (slower, more expensive) LLM annotation pass.
4
+
5
+ ---
6
+
7
+ ## Why a pre-detection pass?
8
+
9
+ Pre-detection serves two purposes:
10
+
11
+ 1. **Improved recall** β€” GLiNER is fast and broad. It catches many candidate spans that the LLM can then verify, correct, and extend. This is especially useful in longer texts where lower-salience entities might otherwise be missed.
12
+
13
+ 2. **Cost reduction** β€” By surfacing strong candidates in the prompt, the LLM can spend less effort on exhaustive detection and more on attribute filling and disambiguation.
14
+
15
+ The step is entirely optional: passing `gliner_model=None` to `annotate()` skips it and the LLM works from the raw text alone.
16
+
17
+ ---
18
+
19
+ ## How it works
20
+
21
+ `detect_spans(text, schema, model_name)` in `gliner_detector.py`:
22
+
23
+ 1. **Label mapping** β€” Each `TEIElement.description` becomes a GLiNER label string. For example, `TEIElement(tag="persName", description="a person's name")` produces the label `"a person's name"`. GLiNER's zero-shot design means no fine-tuning is needed β€” labels are matched by semantic similarity at inference time.
24
+
25
+ 2. **Inference** β€” The GLiNER model runs span prediction on the full source text using those labels. It returns a list of `(text, label, score)` triples.
26
+
27
+ 3. **Context windowing** β€” For each detected span, a context window of Β±60 characters is extracted from the source text to populate `SpanDescriptor.context`. This context string is what the resolver later uses to locate the span in the source text when converting it to a character offset.
28
+
29
+ 4. **Output** β€” Each detection becomes a `SpanDescriptor(element=<tag>, text=<matched text>, context=<window>, score=<confidence>)`.
30
+
31
+ ---
32
+
33
+ ## Candidates in the LLM prompt
34
+
35
+ The `SpanDescriptor` list is passed to the prompt builder, which renders the candidates as a JSON block inside the LLM prompt. The LLM is instructed to treat them as suggestions, not ground truth: it may confirm, correct, split, merge, or discard them based on its own reading of the text.
36
+
37
+ ---
38
+
39
+ ## Installation
40
+
41
+ The GLiNER integration is an optional dependency group:
42
+
43
+ ```bash
44
+ uv sync --extra gliner
45
+ ```
46
+
47
+ This installs `gliner`, PyTorch, and Hugging Face Transformers. Without this extra, importing `detect_spans` raises `ImportError` with a clear message. The `annotate()` function imports GLiNER lazily, so the base package always loads without it.
48
+
49
+ ---
50
+
51
+ ## Choosing a model
52
+
53
+ Any GLiNER-compatible Hugging Face model can be used. Recommended starting points:
54
+
55
+ | Model | Download size | Notes |
56
+ |-------|--------------|-------|
57
+ | `numind/NuNER_Zero` | ~200 MB | Good general-purpose zero-shot NER; default in the README examples |
58
+ | `urchade/gliner_medium-v2.1` | ~400 MB | Broader entity type coverage |
59
+ | `urchade/gliner_large-v2.1` | ~800 MB | Higher accuracy on ambiguous entities |
60
+
61
+ Models are downloaded automatically from Hugging Face Hub on first use and cached in the default HF cache directory (`~/.cache/huggingface/`).
62
+
63
+ ---
64
+
65
+ ## Performance characteristics
66
+
67
+ GLiNER inference is CPU-based and typically completes in milliseconds to low seconds per text chunk, depending on text length and hardware. It is orders of magnitude faster than a remote LLM call. For batch evaluation workloads the GLiNER pass adds negligible overhead.
tei_annotator/evaluation/README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluation
2
+
3
+ The `evaluation` module measures how accurately the annotator reproduces a hand-annotated (gold-standard) TEI XML file. It computes **precision**, **recall**, and **F1 score** per element type and in aggregate.
4
+
5
+ ---
6
+
7
+ ## Evaluation flow
8
+
9
+ For each gold element (e.g. a `<biblStruct>` record in the gold file):
10
+
11
+ ```
12
+ Gold XML element
13
+ β”‚
14
+ β–Ό
15
+ 1. extract_spans() β†’ gold EvaluationSpans + plain text (extractor.py)
16
+ β”‚
17
+ β–Ό
18
+ 2. annotate() β†’ predicted XML string (pipeline.py)
19
+ β”‚
20
+ β–Ό
21
+ 3. extract_spans() β†’ predicted EvaluationSpans (extractor.py)
22
+ β”‚
23
+ β–Ό
24
+ 4. compute_metrics() β†’ EvaluationResult (metrics.py)
25
+ ```
26
+
27
+ Because the annotator receives *exactly the same plain text* that the gold spans are anchored to, character offsets align without any additional normalisation.
28
+
29
+ ---
30
+
31
+ ## Span extraction (`extractor.py`)
32
+
33
+ ### `EvaluationSpan`
34
+
35
+ ```python
36
+ @dataclass
37
+ class EvaluationSpan:
38
+ tag: str # element name, e.g. "author"
39
+ start: int # start offset in the element's plain text (inclusive)
40
+ end: int # end offset (exclusive)
41
+ text: str # raw text content
42
+ ```
43
+
44
+ `normalized_text` is a computed property that collapses runs of internal whitespace to a single space and strips leading/trailing whitespace. It is used by the `TEXT` match mode to compare spans independent of formatting differences.
45
+
46
+ ### `extract_spans(element)`
47
+
48
+ Walks the lxml element tree depth-first. At each descendant element it records the cumulative plain-text length seen so far as `start`, then adds the element's own text content to get `end`. Tail text (text following a closing tag but before the next sibling's opening tag) is accumulated but not itself attributed to a span.
49
+
50
+ Returns `(plain_text: str, spans: list[EvaluationSpan])` β€” the plain text is what `annotate()` will receive; the spans are what it should ideally produce.
51
+
52
+ ---
53
+
54
+ ## Match modes (`metrics.py`)
55
+
56
+ Three strategies determine when a predicted span "matches" a gold span:
57
+
58
+ | Mode | Match condition |
59
+ |------|-----------------|
60
+ | `EXACT` | Same element tag **and** identical `(start, end)` offsets |
61
+ | `TEXT` *(default)* | Same element tag **and** `normalized_text` is equal |
62
+ | `OVERLAP` | Same element tag **and** intersection-over-union of offset ranges β‰₯ threshold (default 0.5) |
63
+
64
+ `TEXT` mode is the most useful in practice: it is invariant to small character-position differences (which can arise from whitespace normalisation or fuzzy span matching) while still requiring the annotator to have found the right text.
65
+
66
+ `EXACT` is strictest and useful for regression testing. `OVERLAP` is the most lenient and is appropriate when partial matches should count as correct.
67
+
68
+ ### Greedy matching algorithm
69
+
70
+ `match_spans(gold, predicted, mode)`:
71
+
72
+ 1. Enumerate all `(gold_span, predicted_span)` candidate pairs that share the same element tag.
73
+ 2. Score each pair:
74
+ - `EXACT` / `TEXT` β†’ 1.0 if matching, 0.0 otherwise.
75
+ - `OVERLAP` β†’ IoU value in [0, 1].
76
+ 3. Sort by score descending, then greedily assign: take the highest-scoring unmatched pair, mark both spans as matched.
77
+ 4. Return `SpanMatch` objects for matched pairs, plus lists of unmatched gold and unmatched predicted spans.
78
+
79
+ ---
80
+
81
+ ## Metrics (`metrics.py`)
82
+
83
+ ### `ElementMetrics`
84
+
85
+ Per-element-type counts (`tp`, `fp`, `fn`) and derived rates:
86
+
87
+ - **Precision** = TP / (TP + FP)
88
+ - **Recall** = TP / (TP + FN)
89
+ - **F1** = 2 Β· P Β· R / (P + R)
90
+
91
+ ### `EvaluationResult`
92
+
93
+ Aggregates `SpanMatch` objects and per-element `ElementMetrics`. Exposes two averaging strategies:
94
+
95
+ | Strategy | How computed | When to prefer |
96
+ |----------|-------------|----------------|
97
+ | **Micro** | Sum TP / FP / FN across all element types, then compute P / R / F1 | Imbalanced element distributions β€” frequent types dominate the score, which reflects real-world impact |
98
+ | **Macro** | Compute P / R / F1 per element type, then average | All element types weighted equally regardless of frequency β€” highlights performance on rare types |
99
+
100
+ `result.report()` prints a formatted table with both averages and a per-element breakdown.
101
+
102
+ ### `aggregate(results)`
103
+
104
+ Merges a list of `EvaluationResult` objects (one per document record) into a single corpus-level result by summing TP / FP / FN counts across all records before computing rates.
105
+
106
+ ---
107
+
108
+ ## High-level API (`evaluator.py`)
109
+
110
+ ### `evaluate_bibl(element, schema, endpoint, match_mode)`
111
+
112
+ Evaluates annotation of a single lxml `_Element`. Handles:
113
+
114
+ - XML parsing errors in the annotated output (caught and counted as zero predictions).
115
+ - Literal `<` / `>` characters in the annotated text that are not valid XML tags: `_escape_nonschema_brackets()` escapes any angle bracket that is not part of a known schema element tag, so lxml can parse the result without error.
116
+
117
+ ### `evaluate_file(gold_xml_path, schema, endpoint, match_mode, max_items)`
118
+
119
+ Evaluates an entire TEI XML file:
120
+
121
+ 1. Parses the XML file with lxml.
122
+ 2. Finds all first-level child elements of the root.
123
+ 3. Calls `evaluate_bibl()` on each, up to `max_items`.
124
+ 4. Returns `(list[EvaluationResult], EvaluationResult)` β€” individual results per record and the corpus-level aggregate.
125
+
126
+ ---
127
+
128
+ ## Example output
129
+
130
+ ```
131
+ === Evaluation Results ===
132
+ Micro P=0.821 R=0.754 F1=0.786 (TP=83 FP=18 FN=27)
133
+ Macro P=0.834 R=0.762 F1=0.791
134
+
135
+ Per-element breakdown:
136
+ author P=0.923 R=0.960 F1=0.941 (TP=24 FP=2 FN=1)
137
+ biblScope P=0.750 R=0.600 F1=0.667 (TP=12 FP=4 FN=8)
138
+ date P=0.867 R=0.867 F1=0.867 (TP=13 FP=2 FN=2)
139
+ title P=0.900 R=0.818 F1=0.857 (TP=18 FP=2 FN=4)
140
+ ...
141
+ ```
142
+
143
+ The `matched`, `unmatched_gold`, and `unmatched_pred` lists on each `EvaluationResult` are available for detailed error analysis beyond the summary table.
tei_annotator/inference/README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Inference configuration
2
+
3
+ The annotator is endpoint-agnostic: it talks to any language model (or extraction model) through a single `call_fn: (str) -> str` callable. `EndpointConfig` wires together a capability declaration and that callable.
4
+
5
+ ---
6
+
7
+ ## `EndpointCapability`
8
+
9
+ ```python
10
+ from tei_annotator import EndpointCapability
11
+ ```
12
+
13
+ | Value | When to use |
14
+ |-------|-------------|
15
+ | `TEXT_GENERATION` | Standard chat/completion LLM. JSON is requested via the prompt. If the response cannot be parsed, the pipeline sends a self-correction follow-up and retries once. |
16
+ | `JSON_ENFORCED` | Constrained-decoding endpoint that guarantees syntactically valid JSON output (e.g. a vLLM server with `--guided-decoding-backend`). The correction retry is skipped because output is always parseable. |
17
+ | `EXTRACTION` | Native extraction model (GLiNER2 / NuExtract-style). The raw source text is passed directly; no Jinja2 prompt is built. Used internally when `gliner_model=` is set on `annotate()`; do not wrap these models in `EndpointConfig`. |
18
+
19
+ ---
20
+
21
+ ## `EndpointConfig`
22
+
23
+ ```python
24
+ from tei_annotator import EndpointConfig, EndpointCapability
25
+
26
+ endpoint = EndpointConfig(
27
+ capability=EndpointCapability.TEXT_GENERATION,
28
+ call_fn=my_call_fn,
29
+ )
30
+ ```
31
+
32
+ `call_fn` receives the complete prompt string and must return the model's raw response string. Any implementation is valid β€” an `openai.Client`, an `anthropic.Anthropic` client, a local `requests.post` to Ollama, or a function that reads from a file for testing.
33
+
34
+ ---
35
+
36
+ ## Examples
37
+
38
+ ### Anthropic (Claude)
39
+
40
+ ```python
41
+ import anthropic
42
+
43
+ client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
44
+
45
+ def call_fn(prompt: str) -> str:
46
+ msg = client.messages.create(
47
+ model="claude-opus-4-6",
48
+ max_tokens=4096,
49
+ messages=[{"role": "user", "content": prompt}],
50
+ )
51
+ return msg.content[0].text
52
+
53
+ endpoint = EndpointConfig(
54
+ capability=EndpointCapability.TEXT_GENERATION,
55
+ call_fn=call_fn,
56
+ )
57
+ ```
58
+
59
+ ### OpenAI
60
+
61
+ ```python
62
+ from openai import OpenAI
63
+
64
+ client = OpenAI() # reads OPENAI_API_KEY from environment
65
+
66
+ def call_fn(prompt: str) -> str:
67
+ resp = client.chat.completions.create(
68
+ model="gpt-4o-mini",
69
+ messages=[{"role": "user", "content": prompt}],
70
+ )
71
+ return resp.choices[0].message.content
72
+
73
+ endpoint = EndpointConfig(
74
+ capability=EndpointCapability.TEXT_GENERATION,
75
+ call_fn=call_fn,
76
+ )
77
+ ```
78
+
79
+ ### Google Gemini
80
+
81
+ ```python
82
+ import google.generativeai as genai
83
+
84
+ genai.configure(api_key=os.environ["GEMINI_API_KEY"])
85
+ model = genai.GenerativeModel("gemini-2.0-flash")
86
+
87
+ def call_fn(prompt: str) -> str:
88
+ return model.generate_content(prompt).text
89
+
90
+ endpoint = EndpointConfig(
91
+ capability=EndpointCapability.TEXT_GENERATION,
92
+ call_fn=call_fn,
93
+ )
94
+ ```
95
+
96
+ ### Ollama (local)
97
+
98
+ ```python
99
+ import requests
100
+
101
+ def call_fn(prompt: str) -> str:
102
+ resp = requests.post(
103
+ "http://localhost:11434/api/generate",
104
+ json={"model": "llama3.1", "prompt": prompt, "stream": False},
105
+ )
106
+ return resp.json()["response"]
107
+
108
+ endpoint = EndpointConfig(
109
+ capability=EndpointCapability.TEXT_GENERATION,
110
+ call_fn=call_fn,
111
+ )
112
+ ```
113
+
114
+ ### vLLM with constrained JSON decoding
115
+
116
+ ```python
117
+ from openai import OpenAI
118
+
119
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
120
+
121
+ def call_fn(prompt: str) -> str:
122
+ resp = client.chat.completions.create(
123
+ model="meta-llama/Llama-3.3-70B-Instruct",
124
+ messages=[{"role": "user", "content": prompt}],
125
+ extra_body={"guided_json": True},
126
+ )
127
+ return resp.choices[0].message.content
128
+
129
+ endpoint = EndpointConfig(
130
+ capability=EndpointCapability.JSON_ENFORCED, # skip correction retry
131
+ call_fn=call_fn,
132
+ )
133
+ ```
134
+
135
+ ---
136
+
137
+ ## How the capability affects pipeline behaviour
138
+
139
+ | Capability | Prompt template | Retry on parse failure |
140
+ |------------|-----------------|------------------------|
141
+ | `TEXT_GENERATION` | `text_gen.jinja2` (verbose, with instructions) | Yes β€” one self-correction attempt |
142
+ | `JSON_ENFORCED` | `json_enforced.jinja2` (compact) | No |
143
+ | `EXTRACTION` | None β€” raw text passed directly | N/A |
tei_annotator/models/README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data models
2
+
3
+ Two groups of data classes power the pipeline: **schema models** (what the annotator is allowed to produce) and **span models** (the annotations as they flow through each stage).
4
+
5
+ ---
6
+
7
+ ## Schema models (`schema.py`)
8
+
9
+ ### `TEIAttribute`
10
+
11
+ Describes a single XML attribute that may appear on a `TEIElement`.
12
+
13
+ | Field | Type | Description |
14
+ |-------|------|-------------|
15
+ | `name` | `str` | Attribute name, e.g. `"ref"` |
16
+ | `description` | `str` | Human-readable explanation included in LLM prompts |
17
+ | `required` | `bool` | Whether the attribute must be present (informational only β€” not enforced by the pipeline) |
18
+ | `allowed_values` | `list[str] \| None` | If set, the validator rejects any value not in this list |
19
+
20
+ ### `TEIElement`
21
+
22
+ Describes one XML element the annotator may produce.
23
+
24
+ | Field | Type | Description |
25
+ |-------|------|-------------|
26
+ | `tag` | `str` | XML tag name, e.g. `"persName"` |
27
+ | `description` | `str` | Human-readable explanation included in LLM prompts |
28
+ | `children` | `list[TEIElement]` | Allowed child elements (enables nested annotation) |
29
+ | `attributes` | `list[TEIAttribute]` | Allowed attributes |
30
+
31
+ Elements may nest: a `biblStruct` can contain `author`, which can contain `persName`. The injector uses the children hierarchy to build valid nesting trees.
32
+
33
+ ### `TEISchema`
34
+
35
+ A flat container of `TEIElement` objects with a `get(tag) -> TEIElement | None` method for O(1) lookup by tag name.
36
+
37
+ **Building a schema programmatically:**
38
+
39
+ ```python
40
+ from tei_annotator import TEISchema, TEIElement, TEIAttribute
41
+
42
+ schema = TEISchema(elements=[
43
+ TEIElement(
44
+ tag="persName",
45
+ description="a person's name",
46
+ attributes=[TEIAttribute(name="ref", description="authority URI")],
47
+ ),
48
+ TEIElement(tag="placeName", description="a geographical place name"),
49
+ ])
50
+ ```
51
+
52
+ **Building from a RELAX NG file** (see [`tei.py`](../tei.py)):
53
+
54
+ ```python
55
+ from tei_annotator import create_schema
56
+
57
+ schema = create_schema("schema/tei-bib.rng", element="biblStruct", depth=1)
58
+ ```
59
+
60
+ `create_schema` walks the RNG content model breadth-first to `depth` levels, collecting allowed child elements and their attribute definitions automatically.
61
+
62
+ ---
63
+
64
+ ## Span models (`spans.py`)
65
+
66
+ ### `SpanDescriptor`
67
+
68
+ Produced by the LLM (via the parser) or by the GLiNER detector. A `SpanDescriptor` is **context-anchored**: instead of character offsets (which LLMs count unreliably), it stores the surrounding text around the entity. The resolver later searches the source text for `context` to determine where `text` lives.
69
+
70
+ | Field | Type | Description |
71
+ |-------|------|-------------|
72
+ | `element` | `str` | TEI tag to apply, e.g. `"persName"` |
73
+ | `text` | `str` | Verbatim text of the entity |
74
+ | `context` | `str` | Surrounding text used to locate `text` in the source |
75
+ | `attributes` | `dict[str, str]` | Attribute key/value pairs |
76
+ | `score` | `float \| None` | Confidence score (populated by GLiNER; LLM-produced spans use `None`) |
77
+
78
+ All `SpanDescriptor` objects are **flat** β€” they carry no children. Nesting is inferred geometrically by the injector once offsets are known.
79
+
80
+ ### `ResolvedSpan`
81
+
82
+ The output of the resolver and the input to the injector. Offsets are absolute positions in the plain (tag-stripped) source text.
83
+
84
+ | Field | Type | Description |
85
+ |-------|------|-------------|
86
+ | `element` | `str` | TEI tag |
87
+ | `start` | `int` | Start character offset (inclusive) |
88
+ | `end` | `int` | End character offset (exclusive) |
89
+ | `attributes` | `dict[str, str]` | Attribute key/value pairs |
90
+ | `children` | `list[ResolvedSpan]` | Nested child spans (populated by the injector via offset containment) |
91
+ | `fuzzy_match` | `bool` | `True` if the context was located by fuzzy rather than exact matching |
92
+
93
+ Span A is a child of span B if `B.start <= A.start` and `A.end <= B.end`. The injector constructs this nesting tree from the flat list of `ResolvedSpan` objects.
tei_annotator/postprocessing/README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Post-processing
2
+
3
+ Post-processing converts the LLM's raw text output into XML tags inserted at the correct positions in the source text. It is split into four focused modules that execute in sequence:
4
+
5
+ ```
6
+ LLM response (string)
7
+ β”‚
8
+ β–Ό
9
+ 1. parse β†’ list[SpanDescriptor] (parser.py)
10
+ β”‚
11
+ β–Ό
12
+ 2. resolve β†’ list[ResolvedSpan] (resolver.py)
13
+ β”‚
14
+ β–Ό
15
+ 3. validate β†’ list[ResolvedSpan] (filtered) (validator.py)
16
+ β”‚
17
+ β–Ό
18
+ 4. inject β†’ annotated XML string (injector.py)
19
+ ```
20
+
21
+ ---
22
+
23
+ ## 1. Parse (`parser.py`)
24
+
25
+ **Input:** raw string returned by `call_fn`
26
+ **Output:** `list[SpanDescriptor]`
27
+
28
+ Handles the messiness of real LLM output:
29
+
30
+ - **Fence stripping** β€” models often wrap JSON in markdown code fences (`` ```json … ``` ``). These are detected and removed before parsing, regardless of the language tag used.
31
+ - **JSON extraction** β€” the cleaned string is parsed with `json.loads`. Only a JSON array is accepted; any other top-level type raises `ValueError`.
32
+ - **Dict-to-span conversion** β€” each dict is validated for required keys (`element`, `text`, `context`) and coerced to a `SpanDescriptor`. Dicts with missing or non-string values are silently dropped.
33
+ - **Retry on failure** β€” for `TEXT_GENERATION` endpoints, if the initial parse fails the pipeline calls `make_correction_prompt()` and retries exactly once with the same `call_fn`. If the second parse also fails, the chunk is skipped with a warning.
34
+
35
+ ---
36
+
37
+ ## 2. Resolve (`resolver.py`)
38
+
39
+ **Input:** `list[SpanDescriptor]`, plain source text
40
+ **Output:** `list[ResolvedSpan]`
41
+
42
+ Converts context-anchored descriptors to absolute character offsets.
43
+
44
+ ### Why context anchoring?
45
+
46
+ LLMs cannot reliably count characters in long strings, so asking them to return numeric offsets produces frequent errors. Instead, each span is described with its surrounding text (`SpanDescriptor.context`). The resolver searches the source text for that context string to determine where the entity lives. This approach trades a small risk of positional ambiguity (two identical context strings in different positions) for a large gain in robustness. In practice, entity text is distinctive enough that collisions are rare.
47
+
48
+ ### Resolution algorithm
49
+
50
+ For each `SpanDescriptor`:
51
+
52
+ 1. **Locate context** β€” search for `context` in the source text:
53
+ - First, try exact substring search (`str.find`).
54
+ - If not found, fall back to **fuzzy matching** via [rapidfuzz](https://github.com/rapidfuzz/RapidFuzz) with a default similarity threshold of 0.92. Fuzzy-matched spans receive `ResolvedSpan.fuzzy_match = True` and appear in `AnnotationResult.fuzzy_spans` for human review.
55
+ - If neither match succeeds, the span is silently discarded.
56
+
57
+ 2. **Locate text within context** β€” once the context window is pinned to a position in the source text, do a substring search for `SpanDescriptor.text` within that window.
58
+ - If not found, the span is discarded.
59
+
60
+ 3. **Compute offsets** β€” the context window's start offset plus the text's offset within the window gives `(start, end)` in the original source text.
61
+
62
+ ### Fuzzy matching details
63
+
64
+ rapidfuzz's `extractOne` function scores candidate substrings using the `partial_ratio` scorer, which finds the best-matching substring of the target for the query string. The 0.92 threshold was chosen empirically to accept minor OCR errors, whitespace normalisation differences, and Unicode equivalences while rejecting clearly wrong matches.
65
+
66
+ ---
67
+
68
+ ## 3. Validate (`validator.py`)
69
+
70
+ **Input:** `list[ResolvedSpan]`, `TEISchema`
71
+ **Output:** `list[ResolvedSpan]` (filtered)
72
+
73
+ Three checks are applied. Spans failing any check are silently dropped and logged at `WARNING` level:
74
+
75
+ 1. **Element exists** β€” the span's `element` must appear in the schema's element list.
76
+ 2. **Attribute names** β€” every key in `ResolvedSpan.attributes` must be declared in the element's `TEIAttribute` list.
77
+ 3. **Attribute values** β€” if `TEIAttribute.allowed_values` is set, the value must be one of those strings.
78
+
79
+ Additionally, spans with invalid bounds (`start < 0`, `end > len(text)`, or `start >= end`) are rejected.
80
+
81
+ Validation is a safety net against hallucinations: the LLM occasionally invents element tags or attribute names not in the schema. Dropping those silently keeps the output valid.
82
+
83
+ ---
84
+
85
+ ## 4. Inject (`injector.py`)
86
+
87
+ **Input:** plain source text, `list[ResolvedSpan]`
88
+ **Output:** annotated XML string
89
+
90
+ ### Building the nesting tree
91
+
92
+ Spans are flat at this point (no children). The injector infers nesting geometrically: span A is a child of span B if `B.start <= A.start` and `A.end <= B.end`.
93
+
94
+ `_build_nesting_tree` implements a greedy algorithm:
95
+
96
+ 1. Sort spans by start offset, then by length descending (so parents β€” which are longer β€” sort before their children).
97
+ 2. For each span, find its tightest enclosing parent by scanning already-placed spans.
98
+ 3. Attach it as a child of that parent, or as a root-level span if no enclosing parent exists.
99
+
100
+ **Overlapping spans** β€” where two spans partially overlap without one containing the other β€” are detected and logged as warnings. The offending span (the one with the later start position) is skipped, since overlapping XML tags are not well-formed.
101
+
102
+ ### Recursive tag injection
103
+
104
+ `_inject_recursive` walks the nesting tree depth-first. At each node it emits:
105
+
106
+ 1. Any source text between the previous sibling's end and this span's start.
107
+ 2. The opening tag with any attributes, e.g. `<persName ref="...">`.
108
+ 3. Recursively, the content of all child spans.
109
+ 4. Any source text after the last child and before this span's end.
110
+ 5. The closing tag, e.g. `</persName>`.
111
+
112
+ Because the injector works from pre-computed offsets on the **original source text**, the text content is never altered β€” only tags are inserted. This preserves the exact wording, whitespace, and punctuation of the input.
tei_annotator/prompting/README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prompt building
2
+
3
+ The prompt builder (`builder.py`) constructs the full instruction string sent to the LLM for each text chunk. It uses [Jinja2](https://jinja.palletsprojects.com/) templates so that the prompt structure can be read and modified independently of the Python code.
4
+
5
+ ---
6
+
7
+ ## Templates
8
+
9
+ Two templates live in `templates/`, one per `EndpointCapability`:
10
+
11
+ ### `text_gen.jinja2` β€” for `TEXT_GENERATION` endpoints
12
+
13
+ Used with standard chat/completion LLMs. The template includes:
14
+
15
+ 1. **Role preamble** β€” declares the model's role as a TEI XML annotation assistant.
16
+ 2. **Schema description** β€” for each `TEIElement`: tag name, description, allowed attributes (with descriptions and allowed values if constrained), and allowed child elements.
17
+ 3. **Pre-detected candidates** (if any) β€” rendered as a JSON array so the LLM sees what GLiNER found and can decide whether to confirm, correct, or discard each candidate.
18
+ 4. **Source text** β€” the chunk to annotate, presented verbatim.
19
+ 5. **Output instructions** β€” instructs the model to return a JSON array of objects with no prose, markdown, or explanation:
20
+
21
+ ```json
22
+ [
23
+ {
24
+ "element": "persName",
25
+ "text": "Marie Curie",
26
+ "context": "scientist Marie Curie was born in",
27
+ "attrs": {"ref": "https://viaf.org/viaf/36924049/"}
28
+ }
29
+ ]
30
+ ```
31
+
32
+ ### `json_enforced.jinja2` β€” for `JSON_ENFORCED` endpoints
33
+
34
+ A compact variant for constrained-decoding endpoints (e.g. vLLM with guided JSON). Verbose explanations are omitted because the endpoint guarantees syntactically valid JSON output; less hand-holding is needed. The expected JSON structure is identical.
35
+
36
+ ---
37
+
38
+ ## The `context` field: why it exists
39
+
40
+ LLMs cannot reliably count characters in long strings, so asking them to return numeric offsets produces errors. Instead, each span is described by its surrounding text (`context`). The resolver later anchors each span to the source text by searching for `context` as a substring (exact match first, then fuzzy). This makes the pipeline robust to small model errors in character arithmetic.
41
+
42
+ The prompt instructs the model to copy a ~20-40 character window around the entity verbatim from the source text β€” enough to uniquely identify the span in most real-world texts.
43
+
44
+ ---
45
+
46
+ ## Self-correction / retry
47
+
48
+ When a `TEXT_GENERATION` response cannot be parsed as JSON, `make_correction_prompt()` constructs a follow-up prompt that:
49
+
50
+ 1. Quotes the malformed response.
51
+ 2. States the parse error.
52
+ 3. Asks the model to return only a corrected JSON array, nothing else.
53
+
54
+ The pipeline sends this correction prompt to the same `call_fn`. If the corrected response also fails to parse, the chunk is skipped with a warning β€” one retry is the limit.
55
+
56
+ ---
57
+
58
+ ## Jinja2 environment
59
+
60
+ `_get_env()` initialises a `jinja2.Environment` with a custom `tojson` filter (a thin wrapper around `json.dumps`). This is necessary because `tojson` is provided automatically only in Flask applications; without it, serialising Python objects to JSON inside templates would fail.
61
+
62
+ ---
63
+
64
+ ## Prompt size considerations
65
+
66
+ Each prompt is built per chunk. With `chunk_size=1500` characters the schema preamble typically adds 200–500 characters depending on the number of elements and attributes. For very large schemas (many elements, many attributes with long descriptions), consider:
67
+
68
+ - Reducing `chunk_size` to keep total prompt length within model limits.
69
+ - Splitting the schema into focused subsets and running multiple `annotate()` passes, one per element group.