File size: 22,168 Bytes
484306e
 
 
 
 
 
 
790b4e5
484306e
 
 
 
790b4e5
484306e
 
 
 
 
 
 
 
 
 
 
790b4e5
484306e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
790b4e5
 
484306e
 
 
 
 
 
 
 
 
 
 
 
 
 
790b4e5
484306e
790b4e5
 
 
 
 
 
 
 
 
 
 
484306e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
790b4e5
 
 
 
 
484306e
 
 
 
790b4e5
484306e
 
790b4e5
484306e
 
 
 
 
 
 
 
 
790b4e5
484306e
 
 
 
 
 
790b4e5
484306e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
790b4e5
 
 
484306e
 
 
 
 
 
 
 
 
 
 
 
 
 
790b4e5
 
484306e
 
 
 
 
 
 
 
 
790b4e5
 
 
 
484306e
 
 
 
 
 
 
790b4e5
 
484306e
790b4e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37eaffd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
482c2e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
# Implementation Plan: `tei-annotator`

Prompt:

> Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
>
> **Inputs:**
>
> - A text string to annotate (may already contain partial XML)
> - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
> - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
> - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements

> The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.

## Package Structure

```
tei_annotator/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ schema.py           # TEI element/attribute data structures
β”‚   └── spans.py            # Span manifest data structures
β”œβ”€β”€ detection/
β”‚   └── gliner_detector.py  # Optional local GLiNER first-pass span detection
β”œβ”€β”€ chunking/
β”‚   └── chunker.py          # Overlap-aware text chunker, XML-safe boundaries
β”œβ”€β”€ prompting/
β”‚   β”œβ”€β”€ builder.py          # Prompt assembly
β”‚   └── templates/
β”‚       β”œβ”€β”€ text_gen.jinja2         # For plain text-generation endpoints
β”‚       └── json_enforced.jinja2    # For JSON-mode / constrained endpoints
β”œβ”€β”€ inference/
β”‚   └── endpoint.py         # Endpoint wrapper + capability enum
β”œβ”€β”€ postprocessing/
β”‚   β”œβ”€β”€ resolver.py         # Context-anchor β†’ char offset resolution
β”‚   β”œβ”€β”€ validator.py        # Span verification + schema validation
β”‚   └── injector.py         # Deterministic XML construction
└── pipeline.py             # Top-level orchestration
```

---

## Data Structures (`models/`)

```python
# schema.py
@dataclass
class TEIAttribute:
    name: str                        # e.g. "ref", "type", "cert"
    description: str
    required: bool = False
    allowed_values: list[str] | None = None   # None = free string

@dataclass
class TEIElement:
    tag: str                         # e.g. "persName"
    description: str                 # from TEI Guidelines
    allowed_children: list[str]      # tags of legal child elements
    attributes: list[TEIAttribute]

@dataclass
class TEISchema:
    elements: list[TEIElement]
    # convenience lookup
    def get(self, tag: str) -> TEIElement | None: ...

# spans.py
@dataclass
class SpanDescriptor:
    element: str
    text: str
    context: str                     # must contain text as substring
    attrs: dict[str, str]
    # always flat β€” nesting is inferred from offset containment in resolver/injector,
    # not emitted by the model (models produce unreliable nested trees)
    confidence: float | None = None  # passed through from GLiNER

@dataclass
class ResolvedSpan:
    element: str
    start: int
    end: int
    attrs: dict[str, str]
    children: list["ResolvedSpan"]
    fuzzy_match: bool = False        # flagged for human review
```

---

## GLiNER Dependency (`detection/gliner_detector.py`)

`gliner` is a regular package dependency declared in `pyproject.toml` and installed via `uv add gliner`. No manual setup step is needed.

Model weights are fetched from HuggingFace Hub automatically on first use of `GLiNER.from_pretrained(model_id)` and cached in `~/.cache/huggingface/`. If the import fails at runtime (e.g. the optional extra was not installed), the module raises a standard `ImportError` with a clear message β€” no wrapper needed.

Recommended models (specified as `gliner_model` parameter):

- `urchade/gliner_medium-v2.1` β€” balanced, Apache 2.0
- `numind/NuNER_Zero` β€” stronger multi-word entities, MIT (default)
- `knowledgator/gliner-multitask-large-v0.5` β€” adds relation extraction

All models run on CPU; no GPU required.

---

## Endpoint Abstraction (`inference/endpoint.py`)

```python
class EndpointCapability(Enum):
    TEXT_GENERATION = "text_generation"   # plain LLM, JSON via prompt only
    JSON_ENFORCED   = "json_enforced"     # constrained decoding guaranteed
    EXTRACTION      = "extraction"        # GLiNER2/NuExtract-style native

@dataclass
class EndpointConfig:
    capability: EndpointCapability
    call_fn: Callable[[str], str]
    # call_fn signature: takes a prompt string, returns a response string
    # caller is responsible for auth, model selection, retries
```

The `call_fn` injection means the library is agnostic about whether the caller is hitting Anthropic, OpenAI, a local Ollama instance, or Fastino's GLiNER2 API. The library just hands it a string and gets a string back.

---

## Pipeline (`pipeline.py`)

```python
@dataclass
class AnnotationResult:
    xml: str                          # annotated XML string
    fuzzy_spans: list[ResolvedSpan]   # spans flagged for human review

def annotate(
    text: str,                        # may contain existing XML tags
    schema: TEISchema,                # subset of TEI elements in scope
    endpoint: EndpointConfig,         # injected inference dependency
    gliner_model: str | None = "numind/NuNER_Zero",  # None disables GLiNER pass
    chunk_size: int = 1500,           # chars
    chunk_overlap: int = 200,
) -> AnnotationResult:
```

### Execution Flow

```
1. SETUP
   strip existing XML tags from text for processing,
   preserve them as a restoration map for final merge

2. GLINER PASS  (skipped if gliner_model=None, endpoint is EXTRACTION, or text is short)
   map TEISchema elements β†’ flat label list for GLiNER
     e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
   chunk text if len(text) > chunk_size (with overlap)
   run gliner.predict_entities() on each chunk
   merge cross-chunk duplicates by span text + context overlap
   output: list[SpanDescriptor] with text + context + element + confidence
   (GLiNER is a pre-filter only; the LLM may reject, correct, or extend its candidates)

3. PROMPT ASSEMBLY  
   select template based on EndpointCapability:
     TEXT_GENERATION:   include JSON structure example + "output only JSON" instruction
     JSON_ENFORCED:     minimal prompt, schema enforced externally
     EXTRACTION:        pass schema directly in endpoint's native format, skip LLM prompt
   inject into prompt:
     - TEIElement descriptions + allowed attributes for in-scope elements
     - GLiNER pre-detected spans as candidates for the model to enrich/correct
     - source text chunk
     - instruction to emit one SpanDescriptor per occurrence, not per unique entity

4. INFERENCE
   call endpoint.call_fn(prompt) β†’ raw response string

5. POSTPROCESSING  (per chunk, then merged)

   a. Parse
      JSON_ENFORCED/EXTRACTION: parse directly
      TEXT_GENERATION: strip markdown fences, parse JSON,
                       on failure: retry once with a correction prompt that includes
                       the original (bad) response and the parse error message,
                       so the model can self-correct rather than starting from scratch

   b. Resolve  (resolver.py)
      for each SpanDescriptor:
        find context string in source β†’ exact match preferred
        find text within context window
        assert source[start:end] == span.text β†’ reject on mismatch
        fuzzy fallback (threshold 0.92) β†’ flag for review

   c. Validate  (validator.py)
      reject spans where text not in source
      check attributes against TEISchema allowed values
      check element is in schema scope

   d. Inject  (injector.py)
      infer nesting from offset containment (child βŠ‚ parent by [start, end] bounds)
      check inferred nesting: children must be within parent bounds
      sort ResolvedSpans by start offset, handle nesting depth-first
      insert tags into a copy of the original source string
      restore previously existing XML tags from step 1

   e. Final validation
      parse output as XML β†’ reject malformed documents
      optionally validate against full TEI RelaxNG schema via lxml

6. RETURN
   AnnotationResult(
     xml=annotated_xml_string,
     fuzzy_spans=list_of_flagged_resolved_spans,
   )
```

---

## Key Design Constraints

- The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
- The **GLiNER pass is optional** (`gliner_model=None` disables it). It is most useful for long texts with `TEXT_GENERATION` endpoints; it is skipped automatically for `EXTRACTION` endpoints or short inputs. When enabled, GLiNER is a pre-filter only β€” the LLM may reject, correct, or extend its candidates.
- **Span nesting is inferred from offsets**, never emitted by the model. `SpanDescriptor` is always flat; `ResolvedSpan.children` is populated by the injector from containment relationships.
- `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
- Fuzzy-matched spans are **surfaced, not silently accepted** β€” `AnnotationResult.fuzzy_spans` provides a reviewable list alongside the XML.

---

## Testing Strategy

### Mocking philosophy

**Always mock `call_fn` and the GLiNER detector in unit tests.** Do not use a real GLiNER model as a substitute for a remote LLM endpoint β€” GLiNER is a span-labelling model that cannot produce JSON responses; it cannot exercise the parse/resolve/inject pipeline. Using a real model also makes tests slow (~seconds per inference on CPU), non-deterministic across versions and hardware, and dependent on a 400MB+ download.

The `call_fn: (str) -> str` design makes mocking trivial β€” a lambda returning a hardcoded JSON string is sufficient. No mock framework is needed.

### Test layers

**Layer 1 β€” Unit tests** (always run, <1s total, fully mocked):

```
tests/
β”œβ”€β”€ test_chunker.py        # chunker unit tests
β”œβ”€β”€ test_resolver.py       # resolver unit tests
β”œβ”€β”€ test_validator.py      # validator unit tests
β”œβ”€β”€ test_injector.py       # injector unit tests
β”œβ”€β”€ test_builder.py        # prompt builder unit tests
β”œβ”€β”€ test_parser.py         # JSON parse + retry unit tests
└── test_pipeline.py       # full pipeline smoke test (mocked call_fn + GLiNER)
```

**Layer 2 β€” Integration tests** (opt-in, gated by `pytest -m integration`):

```
tests/integration/
β”œβ”€β”€ test_gliner_detector.py   # real GLiNER model, real HuggingFace download
└── test_pipeline_e2e.py      # full annotate() with real GLiNER + mocked call_fn
```

Integration tests are excluded from CI by default via `pyproject.toml`:

```toml
[tool.pytest.ini_options]
addopts = "-m 'not integration'"
```

### TDD cycle (red β†’ green β†’ refactor)

Each module is written test-first. Write a failing test, implement the minimum code to pass, refactor.

### Key test cases per module

**`chunker.py`**

- Short text below `chunk_size` β†’ single chunk, offset 0
- Long text β†’ multiple chunks with correct `start_offset` per chunk
- Span exactly at a chunk boundary β†’ appears in both chunks with correct global offset
- Input with existing XML tags β†’ chunk boundaries never split a tag

**`resolver.py`**

- Exact context match β†’ `ResolvedSpan` with `fuzzy_match=False`
- Context not found in source β†’ span rejected
- `source[start:end] != span.text` β†’ span rejected
- Context found but span text not within it β†’ span rejected
- Context found, span text found, score < 0.92 β†’ span rejected
- Context found, span text found, 0.92 ≀ score < 1.0 β†’ `fuzzy_match=True`
- Multiple occurrences of context β†’ first match used, or rejection if ambiguous

**`validator.py`**

- Element not in schema β†’ span rejected
- Attribute not in schema β†’ span rejected
- Attribute value not in `allowed_values` β†’ span rejected
- Valid span β†’ passes through unchanged

**`injector.py`**

- Two non-overlapping spans β†’ both tags inserted correctly
- Span B offset-contained in span A β†’ B is child of A in output
- Overlapping (non-nesting) spans β†’ reject or flatten with warning
- Restored XML tags from step 1 do not conflict with injected tags

**`builder.py`**

- `TEXT_GENERATION` capability β†’ prompt contains JSON example and "output only JSON" instruction
- `JSON_ENFORCED` capability β†’ prompt is minimal (no JSON scaffolding)
- GLiNER candidates present β†’ candidates appear in prompt
- GLiNER candidates absent (pass skipped) β†’ prompt has no candidate section

**`parser.py` (parse + retry logic)**

- Valid JSON response β†’ parsed to `list[SpanDescriptor]` without retry
- Markdown-fenced JSON β†’ fences stripped, parsed correctly
- Invalid JSON on first attempt β†’ retry triggered with correction prompt that includes original bad response + parse error message
- Invalid JSON on second attempt β†’ exception raised, chunk skipped

**`pipeline.py` (smoke test)**

```python
def mock_call_fn(prompt: str) -> str:
    return json.dumps([
        {"element": "persName", "text": "John Smith",
         "context": "...said John Smith yesterday...", "attrs": {}}
    ])

def test_annotate_smoke():
    schema = TEISchema(elements=[
        TEIElement(tag="persName", description="a person's name",
                   allowed_children=[], attributes=[])
    ])
    endpoint = EndpointConfig(
        capability=EndpointCapability.JSON_ENFORCED,
        call_fn=mock_call_fn,
    )
    result = annotate(
        text="He said John Smith yesterday.",
        schema=schema,
        endpoint=endpoint,
        gliner_model=None,   # disable GLiNER in unit tests
    )
    assert "persName" in result.xml
    assert "John Smith" in result.xml
    assert result.xml.count("John Smith") == 1   # text not duplicated
```

---

## Implementation Status

**Completed 2026-02-28** β€” full implementation per the plan above.

### What was built

All modules in the package structure were implemented:

| File | Notes |
| --- | --- |
| `tei_annotator/models/schema.py` | `TEIAttribute`, `TEIElement`, `TEISchema` dataclasses |
| `tei_annotator/models/spans.py` | `SpanDescriptor`, `ResolvedSpan` dataclasses |
| `tei_annotator/inference/endpoint.py` | `EndpointCapability` enum, `EndpointConfig` dataclass |
| `tei_annotator/chunking/chunker.py` | `chunk_text()` β€” overlap chunker, XML-safe boundaries |
| `tei_annotator/detection/gliner_detector.py` | `detect_spans()` β€” optional, raises `ImportError` if `[gliner]` extra not installed |
| `tei_annotator/prompting/builder.py` | `build_prompt()` + `make_correction_prompt()` |
| `tei_annotator/prompting/templates/text_gen.jinja2` | Verbose prompt with JSON example, "output only JSON" instruction |
| `tei_annotator/prompting/templates/json_enforced.jinja2` | Minimal prompt for constrained-decoding endpoints |
| `tei_annotator/postprocessing/parser.py` | `parse_response()` β€” fence stripping, one-shot self-correction retry |
| `tei_annotator/postprocessing/resolver.py` | `resolve_spans()` β€” context-anchor β†’ char offset, rapidfuzz fuzzy fallback at threshold 0.92 |
| `tei_annotator/postprocessing/validator.py` | `validate_spans()` β€” element, attribute name, allowed-value checks |
| `tei_annotator/postprocessing/injector.py` | `inject_xml()` β€” stack-based nesting tree, recursive tag insertion |
| `tei_annotator/pipeline.py` | `annotate()` β€” full orchestration, tag strip/restore, deduplication across chunks, lxml final validation |

### Dependencies added

Runtime: `jinja2`, `lxml`, `rapidfuzz`. Optional extra `[gliner]` for GLiNER support. Dev: `pytest`, `pytest-cov`.

### Tests

- **63 unit tests** (Layer 1) β€” fully mocked, run in < 0.1 s via `uv run pytest`
- **9 integration tests** (Layer 2, no GLiNER) β€” complex resolver/injector/pipeline scenarios, run via `uv run pytest --override-ini="addopts=" -m integration tests/integration/test_pipeline_e2e.py -k "not real_gliner"`
- **1 GLiNER integration test** β€” requires `[gliner]` extra and HuggingFace model download

### Smoke script

`scripts/smoke_test_llm.py` β€” end-to-end test with real LLM calls (no GLiNER). Verified against:

- **Google Gemini 2.0 Flash** (`GEMINI_API_KEY` from `.env`)
- **KISSKI `llama-3.3-70b-instruct`** (`KISSKI_API_KEY` from `.env`, OpenAI-compatible API at `https://chat-ai.academiccloud.de/v1`)

Run with `uv run scripts/smoke_test_llm.py`.

### Key implementation notes

- The `_strip_existing_tags` / `_restore_existing_tags` pair in `pipeline.py` preserves original markup by tracking plain-text offsets of each stripped tag and re-inserting them after annotation.
- `_build_nesting_tree` in `injector.py` uses a sort-by-(start-asc, length-desc) + stack algorithm; partial overlaps are dropped with a `warnings.warn`.
- The resolver does an exact `str.find` first; fuzzy search (sliding-window rapidfuzz) is only attempted if exact fails and rapidfuzz is installed.
- `parse_response` passes `call_fn` and `make_correction_prompt` only for `TEXT_GENERATION` endpoints; `JSON_ENFORCED` and `EXTRACTION` never retry.

---

## RNG Schema Parsing (`tei_annotator/tei.py`)

**Added 2026-02-28** β€” `create_schema()` parses a RELAX NG schema file and produces a `TEISchema` for use with `annotate()`.

### Function signature

```python
def create_schema(
    schema_path: str | Path,
    element: str = "text",
    depth: int = 1,
) -> TEISchema
```

- `schema_path` β€” path to a `.rng` file (e.g. `schema/tei-bib.rng`).
- `element` β€” name of the root TEI element to start from (default: `"text"`).
- `depth` β€” how many levels of descendant elements to include. `depth=1` adds the root element **and** its direct children; `depth=2` adds grandchildren too; `depth=0` adds only the root itself.
- Raises `ValueError` if the element is not found in the schema.

### How it works

1. Parses the `.rng` with lxml, builds a `{define_name: element_node}` lookup table.
2. Builds a reverse map from TEI element names (e.g. `"persName"`) to their RNG define names (e.g. `"tbibpersName"`).
3. BFS from the requested element, collecting `TEIElement` entries level by level up to `depth`.
4. For each element:
   - **description** β€” extracted from the first `<a:documentation>` child inside the RNG `<element>` node.
   - **`allowed_children`** β€” content-model `<ref>` nodes are expanded recursively through macro/model groups; attribute-group refs (names containing `"att."`) are skipped; element-bearing defines are short-circuited (just record the element name, don't recurse β€” correctly handles self-referential elements like `idno`).
   - **`attributes`** β€” attribute-group refs are followed recursively; inline `<attribute>` elements are also collected; `required` is inferred from the immediate parent (`<optional>` / `<zeroOrMore>` β†’ False, otherwise True); `allowed_values` comes from `<choice><value>` enumeration inside the attribute definition.
5. Deduplicates children and attributes (preserving order) before constructing each `TEIElement`.

### Tests (`tests/test_tei.py`)

15 unit tests β€” run in < 0.05 s, no network, no model downloads:

- `idno` (leaf-like, self-referential, enumerated `type` values: `ISBN`, `ISSN`, `DOI`, …)
- `biblStruct` (explicit named children: `analytic`, `monogr`, `series`, `relatedItem`, `citedRange`; plus model-group expansion: `note`, `ptr`)
- `depth=0` vs `depth=1` behaviour
- Duplicate element / attribute detection
- Unknown element raises `ValueError`

Total unit tests after this addition: **78** (63 original + 15 new).

---

## Evaluation Module (`tei_annotator/evaluation/`)

**Added 2026-02-28** β€” compares annotator output against a gold-standard TEI XML file to compute precision, recall, and F1 score.

### Package structure

```
tei_annotator/evaluation/
β”œβ”€β”€ __init__.py        # public API exports
β”œβ”€β”€ extractor.py       # EvaluationSpan, extract_spans(), spans_from_xml_string()
β”œβ”€β”€ metrics.py         # MatchMode, SpanMatch, ElementMetrics, EvaluationResult,
β”‚                      # match_spans(), compute_metrics(), aggregate()
└── evaluator.py       # evaluate_bibl(), evaluate_file()
```

### Algorithm

For each gold-standard element (e.g. `<bibl>`):

1. **Extract gold spans** β€” walk the element tree, record `(tag, start, end, text)` for every descendant element using absolute char offsets in the element's plain text (`"".join(element.itertext())`).
2. **Strip tags** β€” the same plain text is passed to `annotate()`.
3. **Annotate** β€” run the full pipeline with the injected `EndpointConfig`.
4. **Extract predicted spans** β€” parse `result.xml` as `<_root>…</_root>`, then run the same span extractor.
5. **Match** β€” greedily pair (gold, pred) spans by score (highest first); each span matched at most once.
6. **Compute metrics** β€” count TP/FP/FN per element type; derive precision, recall, F1.

### Match modes (`MatchMode`)

| Mode | A match if… |
| --- | --- |
| `TEXT` (default) | same element tag + normalised text content |
| `EXACT` | same element tag + identical `(start, end)` offsets |
| `OVERLAP` | same element tag + IoU β‰₯ `overlap_threshold` (default 0.5) |

### Key entry points

```python
# Single element evaluation (lxml element)
result = evaluate_bibl(gold_element, schema, endpoint, gliner_model=None)
print(result.report())

# Full file evaluation
per_record, overall = evaluate_file(
    "tests/fixtures/blbl-examples.tei.xml",
    schema=schema,
    endpoint=endpoint,
    max_items=10,   # optional β€” first N bibls only
)
print(overall.report())
```

`EvaluationResult` exposes:

- `micro_precision / micro_recall / micro_f1` β€” aggregate counts, then compute rates
- `macro_precision / macro_recall / macro_f1` β€” average per-element rates
- `per_element: dict[str, ElementMetrics]` β€” per-element breakdown
- `matched / unmatched_gold / unmatched_pred` β€” full span lists for inspection
- `report()` β€” human-readable summary string

### Evaluation Tests

39 new unit tests in `tests/test_evaluation.py` β€” all mocked, run in < 0.1 s.

Total unit tests: **117** (78 + 39).