Spaces:
Sleeping
Sleeping
File size: 22,168 Bytes
484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 484306e 790b4e5 37eaffd 482c2e9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 | # Implementation Plan: `tei-annotator`
Prompt:
> Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
>
> **Inputs:**
>
> - A text string to annotate (may already contain partial XML)
> - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
> - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
> - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
> The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
## Package Structure
```
tei_annotator/
βββ __init__.py
βββ models/
β βββ schema.py # TEI element/attribute data structures
β βββ spans.py # Span manifest data structures
βββ detection/
β βββ gliner_detector.py # Optional local GLiNER first-pass span detection
βββ chunking/
β βββ chunker.py # Overlap-aware text chunker, XML-safe boundaries
βββ prompting/
β βββ builder.py # Prompt assembly
β βββ templates/
β βββ text_gen.jinja2 # For plain text-generation endpoints
β βββ json_enforced.jinja2 # For JSON-mode / constrained endpoints
βββ inference/
β βββ endpoint.py # Endpoint wrapper + capability enum
βββ postprocessing/
β βββ resolver.py # Context-anchor β char offset resolution
β βββ validator.py # Span verification + schema validation
β βββ injector.py # Deterministic XML construction
βββ pipeline.py # Top-level orchestration
```
---
## Data Structures (`models/`)
```python
# schema.py
@dataclass
class TEIAttribute:
name: str # e.g. "ref", "type", "cert"
description: str
required: bool = False
allowed_values: list[str] | None = None # None = free string
@dataclass
class TEIElement:
tag: str # e.g. "persName"
description: str # from TEI Guidelines
allowed_children: list[str] # tags of legal child elements
attributes: list[TEIAttribute]
@dataclass
class TEISchema:
elements: list[TEIElement]
# convenience lookup
def get(self, tag: str) -> TEIElement | None: ...
# spans.py
@dataclass
class SpanDescriptor:
element: str
text: str
context: str # must contain text as substring
attrs: dict[str, str]
# always flat β nesting is inferred from offset containment in resolver/injector,
# not emitted by the model (models produce unreliable nested trees)
confidence: float | None = None # passed through from GLiNER
@dataclass
class ResolvedSpan:
element: str
start: int
end: int
attrs: dict[str, str]
children: list["ResolvedSpan"]
fuzzy_match: bool = False # flagged for human review
```
---
## GLiNER Dependency (`detection/gliner_detector.py`)
`gliner` is a regular package dependency declared in `pyproject.toml` and installed via `uv add gliner`. No manual setup step is needed.
Model weights are fetched from HuggingFace Hub automatically on first use of `GLiNER.from_pretrained(model_id)` and cached in `~/.cache/huggingface/`. If the import fails at runtime (e.g. the optional extra was not installed), the module raises a standard `ImportError` with a clear message β no wrapper needed.
Recommended models (specified as `gliner_model` parameter):
- `urchade/gliner_medium-v2.1` β balanced, Apache 2.0
- `numind/NuNER_Zero` β stronger multi-word entities, MIT (default)
- `knowledgator/gliner-multitask-large-v0.5` β adds relation extraction
All models run on CPU; no GPU required.
---
## Endpoint Abstraction (`inference/endpoint.py`)
```python
class EndpointCapability(Enum):
TEXT_GENERATION = "text_generation" # plain LLM, JSON via prompt only
JSON_ENFORCED = "json_enforced" # constrained decoding guaranteed
EXTRACTION = "extraction" # GLiNER2/NuExtract-style native
@dataclass
class EndpointConfig:
capability: EndpointCapability
call_fn: Callable[[str], str]
# call_fn signature: takes a prompt string, returns a response string
# caller is responsible for auth, model selection, retries
```
The `call_fn` injection means the library is agnostic about whether the caller is hitting Anthropic, OpenAI, a local Ollama instance, or Fastino's GLiNER2 API. The library just hands it a string and gets a string back.
---
## Pipeline (`pipeline.py`)
```python
@dataclass
class AnnotationResult:
xml: str # annotated XML string
fuzzy_spans: list[ResolvedSpan] # spans flagged for human review
def annotate(
text: str, # may contain existing XML tags
schema: TEISchema, # subset of TEI elements in scope
endpoint: EndpointConfig, # injected inference dependency
gliner_model: str | None = "numind/NuNER_Zero", # None disables GLiNER pass
chunk_size: int = 1500, # chars
chunk_overlap: int = 200,
) -> AnnotationResult:
```
### Execution Flow
```
1. SETUP
strip existing XML tags from text for processing,
preserve them as a restoration map for final merge
2. GLINER PASS (skipped if gliner_model=None, endpoint is EXTRACTION, or text is short)
map TEISchema elements β flat label list for GLiNER
e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
chunk text if len(text) > chunk_size (with overlap)
run gliner.predict_entities() on each chunk
merge cross-chunk duplicates by span text + context overlap
output: list[SpanDescriptor] with text + context + element + confidence
(GLiNER is a pre-filter only; the LLM may reject, correct, or extend its candidates)
3. PROMPT ASSEMBLY
select template based on EndpointCapability:
TEXT_GENERATION: include JSON structure example + "output only JSON" instruction
JSON_ENFORCED: minimal prompt, schema enforced externally
EXTRACTION: pass schema directly in endpoint's native format, skip LLM prompt
inject into prompt:
- TEIElement descriptions + allowed attributes for in-scope elements
- GLiNER pre-detected spans as candidates for the model to enrich/correct
- source text chunk
- instruction to emit one SpanDescriptor per occurrence, not per unique entity
4. INFERENCE
call endpoint.call_fn(prompt) β raw response string
5. POSTPROCESSING (per chunk, then merged)
a. Parse
JSON_ENFORCED/EXTRACTION: parse directly
TEXT_GENERATION: strip markdown fences, parse JSON,
on failure: retry once with a correction prompt that includes
the original (bad) response and the parse error message,
so the model can self-correct rather than starting from scratch
b. Resolve (resolver.py)
for each SpanDescriptor:
find context string in source β exact match preferred
find text within context window
assert source[start:end] == span.text β reject on mismatch
fuzzy fallback (threshold 0.92) β flag for review
c. Validate (validator.py)
reject spans where text not in source
check attributes against TEISchema allowed values
check element is in schema scope
d. Inject (injector.py)
infer nesting from offset containment (child β parent by [start, end] bounds)
check inferred nesting: children must be within parent bounds
sort ResolvedSpans by start offset, handle nesting depth-first
insert tags into a copy of the original source string
restore previously existing XML tags from step 1
e. Final validation
parse output as XML β reject malformed documents
optionally validate against full TEI RelaxNG schema via lxml
6. RETURN
AnnotationResult(
xml=annotated_xml_string,
fuzzy_spans=list_of_flagged_resolved_spans,
)
```
---
## Key Design Constraints
- The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
- The **GLiNER pass is optional** (`gliner_model=None` disables it). It is most useful for long texts with `TEXT_GENERATION` endpoints; it is skipped automatically for `EXTRACTION` endpoints or short inputs. When enabled, GLiNER is a pre-filter only β the LLM may reject, correct, or extend its candidates.
- **Span nesting is inferred from offsets**, never emitted by the model. `SpanDescriptor` is always flat; `ResolvedSpan.children` is populated by the injector from containment relationships.
- `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
- Fuzzy-matched spans are **surfaced, not silently accepted** β `AnnotationResult.fuzzy_spans` provides a reviewable list alongside the XML.
---
## Testing Strategy
### Mocking philosophy
**Always mock `call_fn` and the GLiNER detector in unit tests.** Do not use a real GLiNER model as a substitute for a remote LLM endpoint β GLiNER is a span-labelling model that cannot produce JSON responses; it cannot exercise the parse/resolve/inject pipeline. Using a real model also makes tests slow (~seconds per inference on CPU), non-deterministic across versions and hardware, and dependent on a 400MB+ download.
The `call_fn: (str) -> str` design makes mocking trivial β a lambda returning a hardcoded JSON string is sufficient. No mock framework is needed.
### Test layers
**Layer 1 β Unit tests** (always run, <1s total, fully mocked):
```
tests/
βββ test_chunker.py # chunker unit tests
βββ test_resolver.py # resolver unit tests
βββ test_validator.py # validator unit tests
βββ test_injector.py # injector unit tests
βββ test_builder.py # prompt builder unit tests
βββ test_parser.py # JSON parse + retry unit tests
βββ test_pipeline.py # full pipeline smoke test (mocked call_fn + GLiNER)
```
**Layer 2 β Integration tests** (opt-in, gated by `pytest -m integration`):
```
tests/integration/
βββ test_gliner_detector.py # real GLiNER model, real HuggingFace download
βββ test_pipeline_e2e.py # full annotate() with real GLiNER + mocked call_fn
```
Integration tests are excluded from CI by default via `pyproject.toml`:
```toml
[tool.pytest.ini_options]
addopts = "-m 'not integration'"
```
### TDD cycle (red β green β refactor)
Each module is written test-first. Write a failing test, implement the minimum code to pass, refactor.
### Key test cases per module
**`chunker.py`**
- Short text below `chunk_size` β single chunk, offset 0
- Long text β multiple chunks with correct `start_offset` per chunk
- Span exactly at a chunk boundary β appears in both chunks with correct global offset
- Input with existing XML tags β chunk boundaries never split a tag
**`resolver.py`**
- Exact context match β `ResolvedSpan` with `fuzzy_match=False`
- Context not found in source β span rejected
- `source[start:end] != span.text` β span rejected
- Context found but span text not within it β span rejected
- Context found, span text found, score < 0.92 β span rejected
- Context found, span text found, 0.92 β€ score < 1.0 β `fuzzy_match=True`
- Multiple occurrences of context β first match used, or rejection if ambiguous
**`validator.py`**
- Element not in schema β span rejected
- Attribute not in schema β span rejected
- Attribute value not in `allowed_values` β span rejected
- Valid span β passes through unchanged
**`injector.py`**
- Two non-overlapping spans β both tags inserted correctly
- Span B offset-contained in span A β B is child of A in output
- Overlapping (non-nesting) spans β reject or flatten with warning
- Restored XML tags from step 1 do not conflict with injected tags
**`builder.py`**
- `TEXT_GENERATION` capability β prompt contains JSON example and "output only JSON" instruction
- `JSON_ENFORCED` capability β prompt is minimal (no JSON scaffolding)
- GLiNER candidates present β candidates appear in prompt
- GLiNER candidates absent (pass skipped) β prompt has no candidate section
**`parser.py` (parse + retry logic)**
- Valid JSON response β parsed to `list[SpanDescriptor]` without retry
- Markdown-fenced JSON β fences stripped, parsed correctly
- Invalid JSON on first attempt β retry triggered with correction prompt that includes original bad response + parse error message
- Invalid JSON on second attempt β exception raised, chunk skipped
**`pipeline.py` (smoke test)**
```python
def mock_call_fn(prompt: str) -> str:
return json.dumps([
{"element": "persName", "text": "John Smith",
"context": "...said John Smith yesterday...", "attrs": {}}
])
def test_annotate_smoke():
schema = TEISchema(elements=[
TEIElement(tag="persName", description="a person's name",
allowed_children=[], attributes=[])
])
endpoint = EndpointConfig(
capability=EndpointCapability.JSON_ENFORCED,
call_fn=mock_call_fn,
)
result = annotate(
text="He said John Smith yesterday.",
schema=schema,
endpoint=endpoint,
gliner_model=None, # disable GLiNER in unit tests
)
assert "persName" in result.xml
assert "John Smith" in result.xml
assert result.xml.count("John Smith") == 1 # text not duplicated
```
---
## Implementation Status
**Completed 2026-02-28** β full implementation per the plan above.
### What was built
All modules in the package structure were implemented:
| File | Notes |
| --- | --- |
| `tei_annotator/models/schema.py` | `TEIAttribute`, `TEIElement`, `TEISchema` dataclasses |
| `tei_annotator/models/spans.py` | `SpanDescriptor`, `ResolvedSpan` dataclasses |
| `tei_annotator/inference/endpoint.py` | `EndpointCapability` enum, `EndpointConfig` dataclass |
| `tei_annotator/chunking/chunker.py` | `chunk_text()` β overlap chunker, XML-safe boundaries |
| `tei_annotator/detection/gliner_detector.py` | `detect_spans()` β optional, raises `ImportError` if `[gliner]` extra not installed |
| `tei_annotator/prompting/builder.py` | `build_prompt()` + `make_correction_prompt()` |
| `tei_annotator/prompting/templates/text_gen.jinja2` | Verbose prompt with JSON example, "output only JSON" instruction |
| `tei_annotator/prompting/templates/json_enforced.jinja2` | Minimal prompt for constrained-decoding endpoints |
| `tei_annotator/postprocessing/parser.py` | `parse_response()` β fence stripping, one-shot self-correction retry |
| `tei_annotator/postprocessing/resolver.py` | `resolve_spans()` β context-anchor β char offset, rapidfuzz fuzzy fallback at threshold 0.92 |
| `tei_annotator/postprocessing/validator.py` | `validate_spans()` β element, attribute name, allowed-value checks |
| `tei_annotator/postprocessing/injector.py` | `inject_xml()` β stack-based nesting tree, recursive tag insertion |
| `tei_annotator/pipeline.py` | `annotate()` β full orchestration, tag strip/restore, deduplication across chunks, lxml final validation |
### Dependencies added
Runtime: `jinja2`, `lxml`, `rapidfuzz`. Optional extra `[gliner]` for GLiNER support. Dev: `pytest`, `pytest-cov`.
### Tests
- **63 unit tests** (Layer 1) β fully mocked, run in < 0.1 s via `uv run pytest`
- **9 integration tests** (Layer 2, no GLiNER) β complex resolver/injector/pipeline scenarios, run via `uv run pytest --override-ini="addopts=" -m integration tests/integration/test_pipeline_e2e.py -k "not real_gliner"`
- **1 GLiNER integration test** β requires `[gliner]` extra and HuggingFace model download
### Smoke script
`scripts/smoke_test_llm.py` β end-to-end test with real LLM calls (no GLiNER). Verified against:
- **Google Gemini 2.0 Flash** (`GEMINI_API_KEY` from `.env`)
- **KISSKI `llama-3.3-70b-instruct`** (`KISSKI_API_KEY` from `.env`, OpenAI-compatible API at `https://chat-ai.academiccloud.de/v1`)
Run with `uv run scripts/smoke_test_llm.py`.
### Key implementation notes
- The `_strip_existing_tags` / `_restore_existing_tags` pair in `pipeline.py` preserves original markup by tracking plain-text offsets of each stripped tag and re-inserting them after annotation.
- `_build_nesting_tree` in `injector.py` uses a sort-by-(start-asc, length-desc) + stack algorithm; partial overlaps are dropped with a `warnings.warn`.
- The resolver does an exact `str.find` first; fuzzy search (sliding-window rapidfuzz) is only attempted if exact fails and rapidfuzz is installed.
- `parse_response` passes `call_fn` and `make_correction_prompt` only for `TEXT_GENERATION` endpoints; `JSON_ENFORCED` and `EXTRACTION` never retry.
---
## RNG Schema Parsing (`tei_annotator/tei.py`)
**Added 2026-02-28** β `create_schema()` parses a RELAX NG schema file and produces a `TEISchema` for use with `annotate()`.
### Function signature
```python
def create_schema(
schema_path: str | Path,
element: str = "text",
depth: int = 1,
) -> TEISchema
```
- `schema_path` β path to a `.rng` file (e.g. `schema/tei-bib.rng`).
- `element` β name of the root TEI element to start from (default: `"text"`).
- `depth` β how many levels of descendant elements to include. `depth=1` adds the root element **and** its direct children; `depth=2` adds grandchildren too; `depth=0` adds only the root itself.
- Raises `ValueError` if the element is not found in the schema.
### How it works
1. Parses the `.rng` with lxml, builds a `{define_name: element_node}` lookup table.
2. Builds a reverse map from TEI element names (e.g. `"persName"`) to their RNG define names (e.g. `"tbibpersName"`).
3. BFS from the requested element, collecting `TEIElement` entries level by level up to `depth`.
4. For each element:
- **description** β extracted from the first `<a:documentation>` child inside the RNG `<element>` node.
- **`allowed_children`** β content-model `<ref>` nodes are expanded recursively through macro/model groups; attribute-group refs (names containing `"att."`) are skipped; element-bearing defines are short-circuited (just record the element name, don't recurse β correctly handles self-referential elements like `idno`).
- **`attributes`** β attribute-group refs are followed recursively; inline `<attribute>` elements are also collected; `required` is inferred from the immediate parent (`<optional>` / `<zeroOrMore>` β False, otherwise True); `allowed_values` comes from `<choice><value>` enumeration inside the attribute definition.
5. Deduplicates children and attributes (preserving order) before constructing each `TEIElement`.
### Tests (`tests/test_tei.py`)
15 unit tests β run in < 0.05 s, no network, no model downloads:
- `idno` (leaf-like, self-referential, enumerated `type` values: `ISBN`, `ISSN`, `DOI`, β¦)
- `biblStruct` (explicit named children: `analytic`, `monogr`, `series`, `relatedItem`, `citedRange`; plus model-group expansion: `note`, `ptr`)
- `depth=0` vs `depth=1` behaviour
- Duplicate element / attribute detection
- Unknown element raises `ValueError`
Total unit tests after this addition: **78** (63 original + 15 new).
---
## Evaluation Module (`tei_annotator/evaluation/`)
**Added 2026-02-28** β compares annotator output against a gold-standard TEI XML file to compute precision, recall, and F1 score.
### Package structure
```
tei_annotator/evaluation/
βββ __init__.py # public API exports
βββ extractor.py # EvaluationSpan, extract_spans(), spans_from_xml_string()
βββ metrics.py # MatchMode, SpanMatch, ElementMetrics, EvaluationResult,
β # match_spans(), compute_metrics(), aggregate()
βββ evaluator.py # evaluate_bibl(), evaluate_file()
```
### Algorithm
For each gold-standard element (e.g. `<bibl>`):
1. **Extract gold spans** β walk the element tree, record `(tag, start, end, text)` for every descendant element using absolute char offsets in the element's plain text (`"".join(element.itertext())`).
2. **Strip tags** β the same plain text is passed to `annotate()`.
3. **Annotate** β run the full pipeline with the injected `EndpointConfig`.
4. **Extract predicted spans** β parse `result.xml` as `<_root>β¦</_root>`, then run the same span extractor.
5. **Match** β greedily pair (gold, pred) spans by score (highest first); each span matched at most once.
6. **Compute metrics** β count TP/FP/FN per element type; derive precision, recall, F1.
### Match modes (`MatchMode`)
| Mode | A match if⦠|
| --- | --- |
| `TEXT` (default) | same element tag + normalised text content |
| `EXACT` | same element tag + identical `(start, end)` offsets |
| `OVERLAP` | same element tag + IoU β₯ `overlap_threshold` (default 0.5) |
### Key entry points
```python
# Single element evaluation (lxml element)
result = evaluate_bibl(gold_element, schema, endpoint, gliner_model=None)
print(result.report())
# Full file evaluation
per_record, overall = evaluate_file(
"tests/fixtures/blbl-examples.tei.xml",
schema=schema,
endpoint=endpoint,
max_items=10, # optional β first N bibls only
)
print(overall.report())
```
`EvaluationResult` exposes:
- `micro_precision / micro_recall / micro_f1` β aggregate counts, then compute rates
- `macro_precision / macro_recall / macro_f1` β average per-element rates
- `per_element: dict[str, ElementMetrics]` β per-element breakdown
- `matched / unmatched_gold / unmatched_pred` β full span lists for inspection
- `report()` β human-readable summary string
### Evaluation Tests
39 new unit tests in `tests/test_evaluation.py` β all mocked, run in < 0.1 s.
Total unit tests: **117** (78 + 39).
|