Spaces:

cmboulanger
/

tei-annotator

Sleeping

App Files Files Community

cmboulanger commited on Feb 28

Commit

37eaffd

1 Parent(s): 790b4e5

full implementation

Browse files

Files changed (36) hide show

README.md +135 -0
implementation-plan.md +52 -0
pyproject.toml +26 -3
scripts/smoke_test_llm.py +235 -0
tei_annotator/__init__.py +20 -0
tei_annotator/chunking/__init__.py +3 -0
tei_annotator/chunking/chunker.py +63 -0
tei_annotator/detection/__init__.py +0 -0
tei_annotator/detection/gliner_detector.py +56 -0
tei_annotator/inference/__init__.py +3 -0
tei_annotator/inference/endpoint.py +19 -0
tei_annotator/models/__init__.py +10 -0
tei_annotator/models/schema.py +30 -0
tei_annotator/models/spans.py +24 -0
tei_annotator/pipeline.py +278 -0
tei_annotator/postprocessing/__init__.py +6 -0
tei_annotator/postprocessing/injector.py +112 -0
tei_annotator/postprocessing/parser.py +78 -0
tei_annotator/postprocessing/resolver.py +95 -0
tei_annotator/postprocessing/validator.py +49 -0
tei_annotator/prompting/__init__.py +3 -0
tei_annotator/prompting/builder.py +82 -0
tei_annotator/prompting/templates/json_enforced.jinja2 +21 -0
tei_annotator/prompting/templates/text_gen.jinja2 +53 -0
tests/__init__.py +0 -0
tests/integration/__init__.py +0 -0
tests/integration/test_gliner_detector.py +59 -0
tests/integration/test_pipeline_e2e.py +413 -0
tests/test_builder.py +83 -0
tests/test_chunker.py +79 -0
tests/test_injector.py +102 -0
tests/test_parser.py +100 -0
tests/test_pipeline.py +137 -0
tests/test_resolver.py +65 -0
tests/test_validator.py +82 -0
uv.lock +0 -0

README.md CHANGED Viewed

	@@ -0,0 +1,135 @@

+# tei-annotator
+A Python library for annotating text with [TEI XML](https://tei-c.org/) tags using a two-stage LLM pipeline.
+The pipeline:
+1. **(Optional) GLiNER pre-detection** — fast CPU-based span labelling generates candidates for the LLM to verify and extend.
+2. **LLM annotation** — a prompted language model identifies entities, returns structured spans (element + verbatim text + surrounding context + attributes).
+3. **Deterministic post-processing** — spans are resolved to character offsets, validated against the schema, and injected as XML tags. The source text is **never modified** by any model call.
+Works with any inference endpoint through an injected `call_fn: (str) -> str` — Anthropic, OpenAI, Gemini, a local Ollama instance, or a constrained-decoding API.
+---
+## Installation
+Requires Python ≥ 3.12 and [uv](https://docs.astral.sh/uv/).
+```bash
+git clone <repo>
+cd tei-annotator
+uv sync                        # installs runtime deps: jinja2, lxml, rapidfuzz
+uv sync --extra gliner         # also installs gliner for the optional pre-detection pass
+```
+API keys for real LLM endpoints go in `.env` (see `.env` for the expected variable names).
+---
+## Quick example
+```python
+from tei_annotator import (
+    annotate,
+    TEISchema, TEIElement, TEIAttribute,
+    EndpointConfig, EndpointCapability,
+)
+# 1. Describe the elements you want to annotate
+schema = TEISchema(elements=[
+    TEIElement(
+        tag="persName",
+        description="a person's name",
+        attributes=[TEIAttribute(name="ref", description="authority URI")],
+    ),
+    TEIElement(
+        tag="placeName",
+        description="a geographical place name",
+        attributes=[],
+    ),
+])
+# 2. Wrap your inference endpoint
+def my_call_fn(prompt: str) -> str:
+    # replace with any LLM call — Anthropic, OpenAI, Gemini, Ollama, …
+    ...
+endpoint = EndpointConfig(
+    capability=EndpointCapability.TEXT_GENERATION,
+    call_fn=my_call_fn,
+)
+# 3. Annotate
+result = annotate(
+    text="Marie Curie was born in Warsaw and later worked in Paris.",
+    schema=schema,
+    endpoint=endpoint,
+    gliner_model=None,   # set to e.g. "numind/NuNER_Zero" to enable pre-detection
+)
+print(result.xml)
+# <persName>Marie Curie</persName> was born in <placeName>Warsaw</placeName>
+# and later worked in <placeName>Paris</placeName>.
+if result.fuzzy_spans:
+    print("Review these spans — context was matched approximately:")
+    for span in result.fuzzy_spans:
+        print(f"  <{span.element}>{span.text}</{span.element}>")
+```
+The input text may already contain XML markup; existing tags are stripped before the LLM sees the text and restored in the final output.
+### Real-endpoint smoke test
+`scripts/smoke_test_llm.py` runs the full pipeline against **Gemini 2.0 Flash** and **KISSKI `llama-3.3-70b-instruct`** using API keys from `.env`:
+```bash
+uv run scripts/smoke_test_llm.py
+```
+---
+## `annotate()` parameters
+| Parameter | Default | Description |
+| --- | --- | --- |
+| `text` | — | Input text; may contain existing XML tags |
+| `schema` | — | `TEISchema` describing elements and attributes in scope |
+| `endpoint` | — | `EndpointConfig` wrapping any `call_fn: (str) -> str` |
+| `gliner_model` | `"numind/NuNER_Zero"` | HuggingFace model for optional pre-detection; `None` to disable |
+| `chunk_size` | `1500` | Maximum characters per LLM prompt chunk |
+| `chunk_overlap` | `200` | Character overlap between consecutive chunks |
+### `EndpointCapability` values
+| Value | When to use |
+| --- | --- |
+| `TEXT_GENERATION` | Plain LLM — JSON requested via prompt, with one automatic retry on parse failure |
+| `JSON_ENFORCED` | Constrained-decoding endpoint that guarantees valid JSON output |
+| `EXTRACTION` | Native extraction model (GLiNER2 / NuExtract-style); raw text is passed directly |
+---
+## Testing
+```bash
+# Unit tests (fully mocked, < 0.1 s)
+uv run pytest
+# Integration tests — complex pipeline scenarios, no model download needed
+uv run pytest --override-ini="addopts=" -m integration \
+    tests/integration/test_pipeline_e2e.py -k "not real_gliner"
+# Integration tests — real GLiNER model (downloads ~400 MB on first run)
+uv run pytest --override-ini="addopts=" -m integration \
+    tests/integration/test_gliner_detector.py \
+    tests/integration/test_pipeline_e2e.py::test_pipeline_with_real_gliner
+```
+Integration tests are excluded from the default `pytest` run via `pyproject.toml`:
+```toml
+[tool.pytest.ini_options]
+addopts = "-m 'not integration'"
+```

implementation-plan.md CHANGED Viewed

@@ -340,3 +340,55 @@ def test_annotate_smoke():
     assert "John Smith" in result.xml
     assert result.xml.count("John Smith") == 1   # text not duplicated
 ```

     assert "John Smith" in result.xml
     assert result.xml.count("John Smith") == 1   # text not duplicated
 ```
+---
+## Implementation Status
+**Completed 2026-02-28** — full implementation per the plan above.
+### What was built
+All modules in the package structure were implemented:
+| File | Notes |
+| --- | --- |
+| `tei_annotator/models/schema.py` | `TEIAttribute`, `TEIElement`, `TEISchema` dataclasses |
+| `tei_annotator/models/spans.py` | `SpanDescriptor`, `ResolvedSpan` dataclasses |
+| `tei_annotator/inference/endpoint.py` | `EndpointCapability` enum, `EndpointConfig` dataclass |
+| `tei_annotator/chunking/chunker.py` | `chunk_text()` — overlap chunker, XML-safe boundaries |
+| `tei_annotator/detection/gliner_detector.py` | `detect_spans()` — optional, raises `ImportError` if `[gliner]` extra not installed |
+| `tei_annotator/prompting/builder.py` | `build_prompt()` + `make_correction_prompt()` |
+| `tei_annotator/prompting/templates/text_gen.jinja2` | Verbose prompt with JSON example, "output only JSON" instruction |
+| `tei_annotator/prompting/templates/json_enforced.jinja2` | Minimal prompt for constrained-decoding endpoints |
+| `tei_annotator/postprocessing/parser.py` | `parse_response()` — fence stripping, one-shot self-correction retry |
+| `tei_annotator/postprocessing/resolver.py` | `resolve_spans()` — context-anchor → char offset, rapidfuzz fuzzy fallback at threshold 0.92 |
+| `tei_annotator/postprocessing/validator.py` | `validate_spans()` — element, attribute name, allowed-value checks |
+| `tei_annotator/postprocessing/injector.py` | `inject_xml()` — stack-based nesting tree, recursive tag insertion |
+| `tei_annotator/pipeline.py` | `annotate()` — full orchestration, tag strip/restore, deduplication across chunks, lxml final validation |
+### Dependencies added
+Runtime: `jinja2`, `lxml`, `rapidfuzz`. Optional extra `[gliner]` for GLiNER support. Dev: `pytest`, `pytest-cov`.
+### Tests
+- **63 unit tests** (Layer 1) — fully mocked, run in < 0.1 s via `uv run pytest`
+- **9 integration tests** (Layer 2, no GLiNER) — complex resolver/injector/pipeline scenarios, run via `uv run pytest --override-ini="addopts=" -m integration tests/integration/test_pipeline_e2e.py -k "not real_gliner"`
+- **1 GLiNER integration test** — requires `[gliner]` extra and HuggingFace model download
+### Smoke script
+`scripts/smoke_test_llm.py` — end-to-end test with real LLM calls (no GLiNER). Verified against:
+- **Google Gemini 2.0 Flash** (`GEMINI_API_KEY` from `.env`)
+- **KISSKI `llama-3.3-70b-instruct`** (`KISSKI_API_KEY` from `.env`, OpenAI-compatible API at `https://chat-ai.academiccloud.de/v1`)
+Run with `uv run scripts/smoke_test_llm.py`.
+### Key implementation notes
+- The `_strip_existing_tags` / `_restore_existing_tags` pair in `pipeline.py` preserves original markup by tracking plain-text offsets of each stripped tag and re-inserting them after annotation.
+- `_build_nesting_tree` in `injector.py` uses a sort-by-(start-asc, length-desc) + stack algorithm; partial overlaps are dropped with a `warnings.warn`.
+- The resolver does an exact `str.find` first; fuzzy search (sliding-window rapidfuzz) is only attempted if exact fails and rapidfuzz is installed.
+- `parse_response` passes `call_fn` and `make_correction_prompt` only for `TEXT_GENERATION` endpoints; `JSON_ENFORCED` and `EXTRACTION` never retry.

pyproject.toml CHANGED Viewed

@@ -1,7 +1,30 @@
 [project]
 name = "tei-annotator"
 version = "0.1.0"
-description = "Add your description here"
 readme = "README.md"
-requires-python = "==3.12"
-dependencies = []

 [project]
 name = "tei-annotator"
 version = "0.1.0"
+description = "TEI XML annotation library using LLM pipelines"
 readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "jinja2>=3.1",
+    "lxml>=5.0",
+    "rapidfuzz>=3.0",
+]
+[project.optional-dependencies]
+gliner = ["gliner>=0.2"]
+[tool.pytest.ini_options]
+addopts = "-m 'not integration'"
+markers = [
+    "integration: marks tests as integration tests (require GLiNER model download)",
+]
+[dependency-groups]
+dev = [
+    "pytest>=8.0",
+    "pytest-cov>=5.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"

scripts/smoke_test_llm.py ADDED Viewed

	@@ -0,0 +1,235 @@

+#!/usr/bin/env python
+"""
+End-to-end smoke test: tei-annotator pipeline with real LLM endpoints.
+Providers tested:
+  • Google Gemini 2.0 Flash
+  • KISSKI (OpenAI-compatible API, llama-3.3-70b-instruct)
+Reads API keys from .env in the project root.
+Usage:
+    uv run scripts/smoke_test_llm.py
+    python scripts/smoke_test_llm.py      # if venv is already activated
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import textwrap
+import urllib.error
+import urllib.request
+from pathlib import Path
+# ---------------------------------------------------------------------------
+# .env loader (stdlib-only, no python-dotenv needed)
+# ---------------------------------------------------------------------------
+def _load_env(path: str = ".env") -> None:
+    try:
+        with open(path) as fh:
+            for line in fh:
+                line = line.strip()
+                if not line or line.startswith("#") or "=" not in line:
+                    continue
+                key, _, value = line.partition("=")
+                value = value.strip().strip('"').strip("'")
+                os.environ.setdefault(key.strip(), value)
+    except FileNotFoundError:
+        pass
+_load_env(Path(__file__).parent.parent / ".env")
+# ---------------------------------------------------------------------------
+# HTTP helper (stdlib urllib)
+# ---------------------------------------------------------------------------
+def _post_json(url: str, payload: dict, headers: dict) -> dict:
+    body = json.dumps(payload).encode()
+    req = urllib.request.Request(url, data=body, headers=headers, method="POST")
+    try:
+        with urllib.request.urlopen(req, timeout=60) as resp:
+            return json.loads(resp.read())
+    except urllib.error.HTTPError as exc:
+        detail = exc.read().decode(errors="replace")
+        raise RuntimeError(f"HTTP {exc.code} from {url}: {detail}") from exc
+# ---------------------------------------------------------------------------
+# call_fn factories
+# ---------------------------------------------------------------------------
+def make_gemini_call_fn(api_key: str, model: str = "gemini-2.0-flash") -> ...:
+    """Return a call_fn that sends a prompt to Gemini and returns the text reply."""
+    url = (
+        f"https://generativelanguage.googleapis.com/v1beta/models"
+        f"/{model}:generateContent?key={api_key}"
+    )
+    def call_fn(prompt: str) -> str:
+        payload = {
+            "contents": [{"parts": [{"text": prompt}]}],
+            "generationConfig": {"temperature": 0.1},
+        }
+        result = _post_json(url, payload, {"Content-Type": "application/json"})
+        return result["candidates"][0]["content"]["parts"][0]["text"]
+    call_fn.__name__ = f"gemini/{model}"
+    return call_fn
+def make_kisski_call_fn(
+    api_key: str,
+    base_url: str = "https://chat-ai.academiccloud.de/v1",
+    model: str = "llama-3.3-70b-instruct",
+) -> ...:
+    """Return a call_fn that sends a prompt to a KISSKI-hosted OpenAI-compatible model."""
+    url = f"{base_url}/chat/completions"
+    headers = {
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {api_key}",
+    }
+    def call_fn(prompt: str) -> str:
+        payload = {
+            "model": model,
+            "messages": [{"role": "user", "content": prompt}],
+            "temperature": 0.1,
+        }
+        result = _post_json(url, payload, headers)
+        return result["choices"][0]["message"]["content"]
+    call_fn.__name__ = f"kisski/{model}"
+    return call_fn
+# ---------------------------------------------------------------------------
+# Test scenario
+# ---------------------------------------------------------------------------
+TEST_TEXT = (
+    "Marie Curie was born in Warsaw, Poland, and later conducted her research "
+    "in Paris, France. Together with her husband Pierre Curie, she discovered "
+    "polonium and radium."
+)
+# We just check that the pipeline runs and produces *some* annotation.
+# Whether the LLM chose the right entities is not asserted here.
+EXPECTED_TAGS = ["persName", "placeName"]
+def _build_schema():
+    from tei_annotator.models.schema import TEIAttribute, TEIElement, TEISchema
+    return TEISchema(
+        elements=[
+            TEIElement(
+                tag="persName",
+                description="a person's name",
+                attributes=[TEIAttribute(name="ref", description="authority URI")],
+            ),
+            TEIElement(
+                tag="placeName",
+                description="a geographical place name",
+                attributes=[TEIAttribute(name="ref", description="authority URI")],
+            ),
+        ]
+    )
+def run_smoke_test(provider_name: str, call_fn) -> bool:
+    """
+    Run the full annotate() pipeline with *call_fn* and print results.
+    Returns True on success, False on failure.
+    """
+    import re
+    from tei_annotator.inference.endpoint import EndpointCapability, EndpointConfig
+    from tei_annotator.pipeline import annotate
+    print(f"\n{'─' * 60}")
+    print(f"  Provider : {provider_name}")
+    print(f"  Input    : {TEST_TEXT[:80]}…")
+    print(f"{'─' * 60}")
+    try:
+        result = annotate(
+            text=TEST_TEXT,
+            schema=_build_schema(),
+            endpoint=EndpointConfig(
+                capability=EndpointCapability.TEXT_GENERATION,
+                call_fn=call_fn,
+            ),
+            gliner_model=None,  # skip GLiNER for speed
+        )
+    except Exception as exc:
+        print(f"  ✗ FAILED  — exception during annotate(): {exc}")
+        return False
+    # Verify plain text is unmodified
+    plain = re.sub(r"<[^>]+>", "", result.xml)
+    if plain != TEST_TEXT:
+        print(f"  ✗ FAILED  — plain text was modified by the pipeline")
+        print(f"    Expected : {TEST_TEXT!r}")
+        print(f"    Got      : {plain!r}")
+        return False
+    # Verify at least one annotation was injected (LLM must have found something)
+    has_any_tag = any(f"<{t}>" in result.xml for t in EXPECTED_TAGS)
+    if not has_any_tag:
+        print(f"  ✗ FAILED  — no annotation tags found in output")
+        print(f"    Output XML: {result.xml}")
+        return False
+    # Pretty-print the result
+    tags_found = [t for t in EXPECTED_TAGS if f"<{t}>" in result.xml]
+    print(f"  ✓ PASSED")
+    print(f"  Tags found : {', '.join(tags_found)}")
+    if result.fuzzy_spans:
+        print(f"  Fuzzy spans: {[s.text for s in result.fuzzy_spans]}")
+    print(f"  Output XML :")
+    for line in textwrap.wrap(result.xml, width=72, subsequent_indent="    "):
+        print(f"    {line}")
+    return True
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main() -> int:
+    gemini_key = os.environ.get("GEMINI_API_KEY", "")
+    kisski_key = os.environ.get("KISSKI_API_KEY", "")
+    if not gemini_key:
+        print("ERROR: GEMINI_API_KEY not set (check .env)", file=sys.stderr)
+        return 1
+    if not kisski_key:
+        print("ERROR: KISSKI_API_KEY not set (check .env)", file=sys.stderr)
+        return 1
+    providers: list[tuple[str, object]] = [
+        ("Gemini 2.0 Flash", make_gemini_call_fn(gemini_key)),
+        ("KISSKI / llama-3.3-70b-instruct", make_kisski_call_fn(kisski_key)),
+    ]
+    results: list[bool] = []
+    for name, fn in providers:
+        results.append(run_smoke_test(name, fn))
+    print(f"\n{'═' * 60}")
+    passed = sum(results)
+    total = len(results)
+    print(f"  Result: {passed}/{total} providers passed")
+    print(f"{'═' * 60}\n")
+    return 0 if all(results) else 1
+if __name__ == "__main__":
+    sys.exit(main())

tei_annotator/__init__.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""
+tei-annotator: TEI XML annotation library using a two-stage LLM pipeline.
+"""
+from .inference.endpoint import EndpointCapability, EndpointConfig
+from .models.schema import TEIAttribute, TEIElement, TEISchema
+from .models.spans import ResolvedSpan, SpanDescriptor
+from .pipeline import AnnotationResult, annotate
+__all__ = [
+    "annotate",
+    "AnnotationResult",
+    "TEISchema",
+    "TEIElement",
+    "TEIAttribute",
+    "SpanDescriptor",
+    "ResolvedSpan",
+    "EndpointConfig",
+    "EndpointCapability",
+]

tei_annotator/chunking/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .chunker import Chunk, chunk_text
2	+
3	+ __all__ = ["Chunk", "chunk_text"]

tei_annotator/chunking/chunker.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from __future__ import annotations
+import re
+from dataclasses import dataclass
+@dataclass
+class Chunk:
+    text: str
+    start_offset: int  # position of text[0] in the original source
+def chunk_text(
+    text: str,
+    chunk_size: int = 1500,
+    overlap: int = 200,
+) -> list[Chunk]:
+    """
+    Split text into overlapping chunks, never splitting inside an XML tag.
+    Each chunk's start_offset satisfies:
+        original_text[chunk.start_offset : chunk.start_offset + len(chunk.text)] == chunk.text
+    """
+    if len(text) <= chunk_size:
+        return [Chunk(text=text, start_offset=0)]
+    # Build a set of character positions that are inside XML tags (inclusive).
+    tag_positions: set[int] = set()
+    for m in re.finditer(r"<[^>]*>", text):
+        tag_positions.update(range(m.start(), m.end()))
+    chunks: list[Chunk] = []
+    start = 0
+    while start < len(text):
+        end = min(start + chunk_size, len(text))
+        if end < len(text):
+            # Step back out of any XML tag
+            candidate = end
+            while candidate > start and candidate in tag_positions:
+                candidate -= 1
+            # Try to break at a whitespace boundary near the target
+            break_pos = candidate
+            for i in range(candidate, max(start, candidate - 100), -1):
+                if i not in tag_positions and text[i].isspace():
+                    break_pos = i + 1
+                    break
+            end = max(start + 1, break_pos)  # guarantee forward progress
+        chunks.append(Chunk(text=text[start:end], start_offset=start))
+        if end >= len(text):
+            break
+        next_start = end - overlap
+        if next_start <= start:
+            next_start = end
+        start = next_start
+    return chunks

tei_annotator/detection/__init__.py ADDED Viewed

File without changes

tei_annotator/detection/gliner_detector.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from __future__ import annotations
+from ..models.schema import TEISchema
+from ..models.spans import SpanDescriptor
+try:
+    from gliner import GLiNER as _GLiNER
+except ImportError as _e:
+    raise ImportError(
+        "The 'gliner' package is required for GLiNER detection. "
+        "Install it with: pip install tei-annotator[gliner]"
+    ) from _e
+def detect_spans(
+    text: str,
+    schema: TEISchema,
+    model_id: str = "numind/NuNER_Zero",
+) -> list[SpanDescriptor]:
+    """
+    Detect entity spans in *text* using a GLiNER model.
+    Model weights are fetched from HuggingFace Hub on first use and cached in
+    ~/.cache/huggingface/.  All listed models run on CPU; no GPU required.
+    Recommended model_id values:
+        - "numind/NuNER_Zero"                         (MIT, default)
+        - "urchade/gliner_medium-v2.1"                (Apache-2.0, balanced)
+        - "knowledgator/gliner-multitask-large-v0.5"  (adds relation extraction)
+    """
+    model = _GLiNER.from_pretrained(model_id)
+    # Map TEI element descriptions to their tags
+    labels = [elem.description for elem in schema.elements]
+    tag_for_label = {elem.description: elem.tag for elem in schema.elements}
+    entities = model.predict_entities(text, labels)
+    spans: list[SpanDescriptor] = []
+    for entity in entities:
+        ctx_start = max(0, entity["start"] - 60)
+        ctx_end = min(len(text), entity["end"] + 60)
+        context = text[ctx_start:ctx_end]
+        tag = tag_for_label.get(entity["label"], entity["label"])
+        spans.append(
+            SpanDescriptor(
+                element=tag,
+                text=entity["text"],
+                context=context,
+                attrs={},
+                confidence=entity.get("score"),
+            )
+        )
+    return spans

tei_annotator/inference/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .endpoint import EndpointCapability, EndpointConfig
2	+
3	+ __all__ = ["EndpointCapability", "EndpointConfig"]

tei_annotator/inference/endpoint.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from enum import Enum
+from typing import Callable
+class EndpointCapability(Enum):
+    TEXT_GENERATION = "text_generation"  # plain LLM, JSON via prompt only
+    JSON_ENFORCED = "json_enforced"      # constrained decoding guaranteed
+    EXTRACTION = "extraction"            # GLiNER2/NuExtract-style native
+@dataclass
+class EndpointConfig:
+    capability: EndpointCapability
+    call_fn: Callable[[str], str]
+    # call_fn signature: takes a prompt string, returns a response string.
+    # Caller is responsible for auth, model selection, and retries.

tei_annotator/models/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from .schema import TEIAttribute, TEIElement, TEISchema
+from .spans import SpanDescriptor, ResolvedSpan
+__all__ = [
+    "TEIAttribute",
+    "TEIElement",
+    "TEISchema",
+    "SpanDescriptor",
+    "ResolvedSpan",
+]

tei_annotator/models/schema.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+@dataclass
+class TEIAttribute:
+    name: str
+    description: str
+    required: bool = False
+    allowed_values: list[str] | None = None
+@dataclass
+class TEIElement:
+    tag: str
+    description: str
+    allowed_children: list[str] = field(default_factory=list)
+    attributes: list[TEIAttribute] = field(default_factory=list)
+@dataclass
+class TEISchema:
+    elements: list[TEIElement] = field(default_factory=list)
+    def get(self, tag: str) -> TEIElement | None:
+        for elem in self.elements:
+            if elem.tag == tag:
+                return elem
+        return None

tei_annotator/models/spans.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+@dataclass
+class SpanDescriptor:
+    """Flat span emitted by the LLM or GLiNER — always context-anchored, never nested."""
+    element: str
+    text: str
+    context: str        # must contain text as a substring
+    attrs: dict[str, str] = field(default_factory=dict)
+    confidence: float | None = None  # passed through from GLiNER
+@dataclass
+class ResolvedSpan:
+    """Span resolved to absolute char offsets in the source text."""
+    element: str
+    start: int
+    end: int
+    attrs: dict[str, str] = field(default_factory=dict)
+    children: list[ResolvedSpan] = field(default_factory=list)
+    fuzzy_match: bool = False   # flagged for human review

tei_annotator/pipeline.py ADDED Viewed

	@@ -0,0 +1,278 @@

+from __future__ import annotations
+import warnings
+from dataclasses import dataclass, field
+from .chunking.chunker import chunk_text
+from .inference.endpoint import EndpointCapability, EndpointConfig
+from .models.schema import TEISchema
+from .models.spans import ResolvedSpan, SpanDescriptor
+from .postprocessing.injector import inject_xml
+from .postprocessing.parser import parse_response
+from .postprocessing.resolver import resolve_spans
+from .postprocessing.validator import validate_spans
+from .prompting.builder import build_prompt, make_correction_prompt
+@dataclass
+class AnnotationResult:
+    xml: str
+    fuzzy_spans: list[ResolvedSpan] = field(default_factory=list)
+# ---------------------------------------------------------------------------
+# Internal helpers
+# ---------------------------------------------------------------------------
+@dataclass
+class _TagEntry:
+    plain_offset: int  # position in plain text before which this tag should be re-inserted
+    tag: str
+def _strip_existing_tags(text: str) -> tuple[str, list[_TagEntry]]:
+    """
+    Remove XML tags from *text*.
+    Returns (plain_text, restore_map) where restore_map records each stripped
+    tag and the plain-text offset at which it should be re-inserted.
+    """
+    plain: list[str] = []
+    restore: list[_TagEntry] = []
+    i = 0
+    while i < len(text):
+        if text[i] == "<":
+            j = text.find(">", i)
+            if j != -1:
+                restore.append(_TagEntry(plain_offset=len(plain), tag=text[i : j + 1]))
+                i = j + 1
+            else:
+                plain.append(text[i])
+                i += 1
+        else:
+            plain.append(text[i])
+            i += 1
+    return "".join(plain), restore
+def _restore_existing_tags(annotated_xml: str, restore_map: list[_TagEntry]) -> str:
+    """
+    Re-insert original XML tags into *annotated_xml*.
+    The tags are keyed by their position in the *plain text* (before annotation),
+    so we walk the annotated XML tracking plain-text position (i.e. advancing only
+    on non-tag characters).
+    """
+    if not restore_map:
+        return annotated_xml
+    inserts: dict[int, list[str]] = {}
+    for entry in restore_map:
+        inserts.setdefault(entry.plain_offset, []).append(entry.tag)
+    result: list[str] = []
+    plain_pos = 0
+    i = 0
+    while i < len(annotated_xml):
+        # Flush any original tags due at the current plain position
+        for tag in inserts.pop(plain_pos, []):
+            result.append(tag)
+        if annotated_xml[i] == "<":
+            # Existing (newly injected) tag — copy verbatim, don't advance plain_pos
+            j = annotated_xml.find(">", i)
+            if j != -1:
+                result.append(annotated_xml[i : j + 1])
+                i = j + 1
+            else:
+                result.append(annotated_xml[i])
+                plain_pos += 1
+                i += 1
+        else:
+            result.append(annotated_xml[i])
+            plain_pos += 1
+            i += 1
+    # Flush any remaining original tags (e.g. trailing tags in the original)
+    for pos in sorted(inserts.keys()):
+        for tag in inserts[pos]:
+            result.append(tag)
+    return "".join(result)
+def _run_gliner(
+    text: str,
+    schema: TEISchema,
+    model_id: str,
+) -> list[SpanDescriptor]:
+    """Run GLiNER detection; returns [] if the optional dependency is missing."""
+    try:
+        from .detection.gliner_detector import detect_spans
+        return detect_spans(text, schema, model_id)
+    except ImportError:
+        warnings.warn(
+            "gliner is not installed; skipping GLiNER pre-detection pass. "
+            "Install it with: pip install tei-annotator[gliner]",
+            stacklevel=3,
+        )
+        return []
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def annotate(
+    text: str,
+    schema: TEISchema,
+    endpoint: EndpointConfig,
+    gliner_model: str | None = "numind/NuNER_Zero",
+    chunk_size: int = 1500,
+    chunk_overlap: int = 200,
+) -> AnnotationResult:
+    """
+    Annotate *text* with TEI XML tags using a two-stage LLM pipeline.
+    The source text is **never modified** — models only contribute tag positions
+    and attribute values.  All text in the output comes from the original input.
+    Parameters
+    ----------
+    text:
+        Input text, which may already contain partial XML markup.
+    schema:
+        A TEISchema describing the elements (and their attributes) in scope.
+    endpoint:
+        Injected inference dependency (wraps any call_fn: str → str).
+    gliner_model:
+        HuggingFace model ID for the optional GLiNER pre-detection pass.
+        Pass None to disable.
+    chunk_size:
+        Maximum characters per chunk sent to the LLM.
+    chunk_overlap:
+        Characters of overlap between consecutive chunks.
+    """
+    # ------------------------------------------------------------------ #
+    # STEP 1  Strip existing XML tags; save restoration map               #
+    # ------------------------------------------------------------------ #
+    plain_text, restore_map = _strip_existing_tags(text)
+    # ------------------------------------------------------------------ #
+    # STEP 2  Optional GLiNER pre-detection pass                          #
+    # ------------------------------------------------------------------ #
+    gliner_candidates: list[SpanDescriptor] = []
+    if (
+        gliner_model is not None
+        and endpoint.capability != EndpointCapability.EXTRACTION
+        and len(plain_text) > 200
+    ):
+        gliner_candidates = _run_gliner(plain_text, schema, gliner_model)
+    # ------------------------------------------------------------------ #
+    # STEPS 3–5  Chunk → prompt → infer → postprocess                     #
+    # ------------------------------------------------------------------ #
+    chunks = chunk_text(plain_text, chunk_size=chunk_size, overlap=chunk_overlap)
+    all_resolved: list[ResolvedSpan] = []
+    for chunk in chunks:
+        # Narrow GLiNER candidates to those plausibly within this chunk
+        chunk_candidates: list[SpanDescriptor] | None = None
+        if gliner_candidates:
+            chunk_candidates = [
+                c
+                for c in gliner_candidates
+                if c.context and chunk.text.find(c.context[:30]) != -1
+            ] or None
+        # 3. Build prompt / raw request
+        if endpoint.capability == EndpointCapability.EXTRACTION:
+            raw_response = endpoint.call_fn(chunk.text)
+        else:
+            prompt = build_prompt(
+                source_text=chunk.text,
+                schema=schema,
+                capability=endpoint.capability,
+                candidates=chunk_candidates,
+            )
+            raw_response = endpoint.call_fn(prompt)
+        # 4. Parse response → SpanDescriptors
+        retry_fn = (
+            endpoint.call_fn
+            if endpoint.capability == EndpointCapability.TEXT_GENERATION
+            else None
+        )
+        correction_fn = (
+            make_correction_prompt
+            if endpoint.capability == EndpointCapability.TEXT_GENERATION
+            else None
+        )
+        try:
+            span_descs = parse_response(
+                raw_response,
+                call_fn=retry_fn,
+                make_correction_prompt=correction_fn,
+            )
+        except ValueError:
+            warnings.warn(
+                f"Could not parse LLM response for chunk at offset "
+                f"{chunk.start_offset}; skipping chunk.",
+                stacklevel=2,
+            )
+            continue
+        # 5a. Resolve within chunk text → positions relative to chunk
+        chunk_resolved = resolve_spans(chunk.text, span_descs)
+        # 5b. Shift to global (plain_text) offsets
+        for span in chunk_resolved:
+            span.start += chunk.start_offset
+            span.end += chunk.start_offset
+        # 5c. Validate against schema
+        chunk_resolved = validate_spans(chunk_resolved, schema, plain_text)
+        all_resolved.extend(chunk_resolved)
+    # ------------------------------------------------------------------ #
+    # Deduplicate spans that appeared in overlapping chunks               #
+    # ------------------------------------------------------------------ #
+    seen: set[tuple[str, int, int]] = set()
+    deduped: list[ResolvedSpan] = []
+    for span in all_resolved:
+        key = (span.element, span.start, span.end)
+        if key not in seen:
+            seen.add(key)
+            deduped.append(span)
+    # ------------------------------------------------------------------ #
+    # STEP 5d  Inject XML tags into the plain text                        #
+    # ------------------------------------------------------------------ #
+    annotated_text = inject_xml(plain_text, deduped)
+    # ------------------------------------------------------------------ #
+    # STEP 5d (cont.)  Restore original XML tags                          #
+    # ------------------------------------------------------------------ #
+    final_xml = _restore_existing_tags(annotated_text, restore_map)
+    # ------------------------------------------------------------------ #
+    # STEP 5e  Final XML validation (best-effort)                         #
+    # ------------------------------------------------------------------ #
+    try:
+        from lxml import etree
+        try:
+            etree.fromstring(f"<_root>{final_xml}</_root>".encode())
+        except etree.XMLSyntaxError as exc:
+            warnings.warn(f"Output XML validation failed: {exc}", stacklevel=2)
+    except ImportError:
+        pass
+    return AnnotationResult(
+        xml=final_xml,
+        fuzzy_spans=[s for s in deduped if s.fuzzy_match],
+    )

tei_annotator/postprocessing/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from .injector import inject_xml
+from .parser import parse_response
+from .resolver import resolve_spans
+from .validator import validate_spans
+__all__ = ["inject_xml", "parse_response", "resolve_spans", "validate_spans"]

tei_annotator/postprocessing/injector.py ADDED Viewed

	@@ -0,0 +1,112 @@

+from __future__ import annotations
+import warnings
+from ..models.spans import ResolvedSpan
+def _build_nesting_tree(flat_spans: list[ResolvedSpan]) -> list[ResolvedSpan]:
+    """
+    Populate ResolvedSpan.children based on offset containment and return root spans.
+    Spans are sorted so that outer (longer) spans are processed before inner ones.
+    Overlapping (non-nesting) spans are skipped with a warning.
+    """
+    # Sort: start asc, then end desc so outer spans come before inner at same start
+    spans = sorted(flat_spans, key=lambda s: (s.start, -(s.end - s.start)))
+    # Clear any children left from a previous call
+    for s in spans:
+        s.children = []
+    roots: list[ResolvedSpan] = []
+    stack: list[ResolvedSpan] = []
+    for span in spans:
+        rejected = False
+        # Pop stack entries that are fully before (or incompatibly overlap) this span
+        while stack:
+            top = stack[-1]
+            if top.start <= span.start and span.end <= top.end:
+                break  # top properly contains span → it's the parent
+            elif span.start >= top.end:
+                stack.pop()  # span comes after top → pop and continue
+            else:
+                # Partial overlap (neither contained nor after) → reject span
+                warnings.warn(
+                    f"Overlapping spans [{top.start},{top.end}] and "
+                    f"[{span.start},{span.end}] cannot be nested; "
+                    f"skipping <{span.element}> span.",
+                    stacklevel=3,
+                )
+                rejected = True
+                break
+        if rejected:
+            continue
+        if stack:
+            stack[-1].children.append(span)
+        else:
+            roots.append(span)
+        stack.append(span)
+    return roots
+def _inject_recursive(
+    text: str,
+    spans: list[ResolvedSpan],
+    offset: int,
+) -> str:
+    """
+    Insert XML open/close tags for *spans* into *text*.
+    *offset* is the absolute position of text[0] in the original source, used
+    to translate span.start/end (absolute) to positions within *text*.
+    """
+    if not spans:
+        return text
+    result: list[str] = []
+    cursor = 0  # relative position within text
+    for span in sorted(spans, key=lambda s: s.start):
+        rel_start = span.start - offset
+        rel_end = span.end - offset
+        # Text before this span
+        result.append(text[cursor:rel_start])
+        # Build tag strings
+        attrs_str = " ".join(f'{k}="{v}"' for k, v in span.attrs.items())
+        open_tag = f"<{span.element}" + (f" {attrs_str}" if attrs_str else "") + ">"
+        close_tag = f"</{span.element}>"
+        # Recursively inject children inside this span's content
+        inner = text[rel_start:rel_end]
+        if span.children:
+            inner = _inject_recursive(inner, span.children, offset=span.start)
+        result.append(open_tag)
+        result.append(inner)
+        result.append(close_tag)
+        cursor = rel_end
+    result.append(text[cursor:])
+    return "".join(result)
+def inject_xml(source: str, spans: list[ResolvedSpan]) -> str:
+    """
+    Insert XML tags into *source* at the positions defined by *spans*.
+    Nesting is inferred from offset containment via _build_nesting_tree.
+    """
+    if not spans:
+        return source
+    root_spans = _build_nesting_tree(spans)
+    return _inject_recursive(source, root_spans, offset=0)

tei_annotator/postprocessing/parser.py ADDED Viewed

	@@ -0,0 +1,78 @@

+from __future__ import annotations
+import json
+import re
+from typing import Callable
+from ..models.spans import SpanDescriptor
+def _strip_fences(text: str) -> str:
+    """Remove markdown code fences, even if preceded by explanatory text."""
+    text = text.strip()
+    m = re.search(r"```(?:json)?\s*\n?(.*?)\n?```", text, re.DOTALL)
+    if m:
+        return m.group(1).strip()
+    return text
+def _parse_json_list(text: str) -> list[dict] | None:
+    """Parse text as a JSON list; return None on failure."""
+    try:
+        result = json.loads(text)
+        return result if isinstance(result, list) else None
+    except json.JSONDecodeError:
+        return None
+def _dicts_to_spans(raw: list[dict]) -> list[SpanDescriptor]:
+    spans: list[SpanDescriptor] = []
+    for item in raw:
+        if not isinstance(item, dict):
+            continue
+        element = item.get("element", "")
+        text = item.get("text", "")
+        context = item.get("context", "")
+        attrs = item.get("attrs", {})
+        if not (element and text and context):
+            continue
+        spans.append(
+            SpanDescriptor(
+                element=element,
+                text=text,
+                context=context,
+                attrs=attrs if isinstance(attrs, dict) else {},
+            )
+        )
+    return spans
+def parse_response(
+    response: str,
+    call_fn: Callable[[str], str] | None = None,
+    make_correction_prompt: Callable[[str, str], str] | None = None,
+) -> list[SpanDescriptor]:
+    """
+    Parse an LLM response string into a list of SpanDescriptors.
+    - Strips markdown code fences automatically.
+    - If parsing fails and *call_fn* + *make_correction_prompt* are provided,
+      retries once with a self-correction prompt that includes the bad response.
+    - Raises ValueError if parsing fails after the retry (or if no retry is configured).
+    """
+    cleaned = _strip_fences(response)
+    raw = _parse_json_list(cleaned)
+    if raw is not None:
+        return _dicts_to_spans(raw)
+    if call_fn is None or make_correction_prompt is None:
+        raise ValueError(f"Failed to parse JSON from response: {response[:300]!r}")
+    error_msg = "Response is not valid JSON"
+    correction_prompt = make_correction_prompt(response, error_msg)
+    retry_response = call_fn(correction_prompt)
+    retry_cleaned = _strip_fences(retry_response)
+    raw = _parse_json_list(retry_cleaned)
+    if raw is None:
+        raise ValueError(f"Failed to parse JSON after retry: {retry_response[:300]!r}")
+    return _dicts_to_spans(raw)

tei_annotator/postprocessing/resolver.py ADDED Viewed

	@@ -0,0 +1,95 @@

+from __future__ import annotations
+from ..models.spans import ResolvedSpan, SpanDescriptor
+try:
+    from rapidfuzz import fuzz as _fuzz
+    _HAS_RAPIDFUZZ = True
+except ImportError:
+    _HAS_RAPIDFUZZ = False
+def _find_context(
+    source: str,
+    context: str,
+    threshold: float,
+) -> tuple[int, bool] | None:
+    """
+    Locate *context* in *source*.
+    Returns (start_pos, is_fuzzy):
+    - (pos, False) on exact match
+    - (pos, True)  on fuzzy match with score >= threshold
+    - None         if not found or below threshold
+    """
+    pos = source.find(context)
+    if pos != -1:
+        return pos, False
+    if not _HAS_RAPIDFUZZ or not context:
+        return None
+    win = len(context)
+    if win > len(source):
+        return None
+    best_score = 0.0
+    best_pos = -1
+    for i in range(len(source) - win + 1):
+        score = _fuzz.ratio(context, source[i : i + win]) / 100.0
+        if score > best_score:
+            best_score = score
+            best_pos = i
+    if best_score >= threshold:
+        return best_pos, True
+    return None
+def resolve_spans(
+    source: str,
+    spans: list[SpanDescriptor],
+    fuzzy_threshold: float = 0.92,
+) -> list[ResolvedSpan]:
+    """
+    Convert context-anchored SpanDescriptors to char-offset ResolvedSpans.
+    Rejects spans whose text cannot be reliably located in *source*.
+    Spans that required fuzzy context matching are flagged with fuzzy_match=True.
+    """
+    resolved: list[ResolvedSpan] = []
+    for span in spans:
+        result = _find_context(source, span.context, fuzzy_threshold)
+        if result is None:
+            continue  # context not found → reject
+        ctx_start, context_is_fuzzy = result
+        # Find span.text within the located context window
+        window = source[ctx_start : ctx_start + len(span.context)]
+        text_pos = window.find(span.text)
+        if text_pos == -1:
+            continue  # text not in context window → reject
+        abs_start = ctx_start + text_pos
+        abs_end = abs_start + len(span.text)
+        # Verify verbatim match (should always hold after exact context find,
+        # but important guard after fuzzy context find)
+        if source[abs_start:abs_end] != span.text:
+            continue
+        resolved.append(
+            ResolvedSpan(
+                element=span.element,
+                start=abs_start,
+                end=abs_end,
+                attrs=span.attrs.copy(),
+                children=[],
+                fuzzy_match=context_is_fuzzy,
+            )
+        )
+    return resolved

tei_annotator/postprocessing/validator.py ADDED Viewed

	@@ -0,0 +1,49 @@

+from __future__ import annotations
+from ..models.schema import TEISchema
+from ..models.spans import ResolvedSpan
+def validate_spans(
+    spans: list[ResolvedSpan],
+    schema: TEISchema,
+    source: str,
+) -> list[ResolvedSpan]:
+    """
+    Filter out spans that fail schema validation.
+    Rejected when:
+    - element is not in the schema
+    - an attribute name is not listed for that element
+    - an attribute value is not in the element's allowed_values (when constrained)
+    - span bounds are out of range
+    """
+    valid: list[ResolvedSpan] = []
+    for span in spans:
+        # Bounds sanity check
+        if span.start < 0 or span.end > len(source) or span.start >= span.end:
+            continue
+        elem = schema.get(span.element)
+        if elem is None:
+            continue  # element not in schema
+        allowed_names = {a.name for a in elem.attributes}
+        attr_ok = True
+        for attr_name, attr_value in span.attrs.items():
+            if attr_name not in allowed_names:
+                attr_ok = False
+                break
+            attr_def = next((a for a in elem.attributes if a.name == attr_name), None)
+            if attr_def and attr_def.allowed_values is not None:
+                if attr_value not in attr_def.allowed_values:
+                    attr_ok = False
+                    break
+        if not attr_ok:
+            continue
+        valid.append(span)
+    return valid

tei_annotator/prompting/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .builder import build_prompt, make_correction_prompt
2	+
3	+ __all__ = ["build_prompt", "make_correction_prompt"]

tei_annotator/prompting/builder.py ADDED Viewed

	@@ -0,0 +1,82 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+from ..inference.endpoint import EndpointCapability
+from ..models.schema import TEISchema
+from ..models.spans import SpanDescriptor
+try:
+    from jinja2 import Environment, FileSystemLoader
+    _HAS_JINJA = True
+except ImportError:
+    _HAS_JINJA = False
+_TEMPLATE_DIR = Path(__file__).parent / "templates"
+def _get_env() -> "Environment":
+    if not _HAS_JINJA:
+        raise ImportError(
+            "jinja2 is required for prompt building. Install it with: pip install jinja2"
+        )
+    env = Environment(loader=FileSystemLoader(str(_TEMPLATE_DIR)), keep_trailing_newline=True)
+    env.filters["tojson"] = lambda x, **kw: json.dumps(x, ensure_ascii=False, **kw)
+    return env
+def build_prompt(
+    source_text: str,
+    schema: TEISchema,
+    capability: EndpointCapability,
+    candidates: list[SpanDescriptor] | None = None,
+) -> str:
+    """
+    Build an LLM prompt for the given endpoint capability.
+    Raises ValueError for EXTRACTION endpoints (they don't use text prompts).
+    """
+    if capability == EndpointCapability.EXTRACTION:
+        raise ValueError(
+            "EXTRACTION endpoints use their own native format; no text prompt needed."
+        )
+    env = _get_env()
+    template_name = (
+        "text_gen.jinja2"
+        if capability == EndpointCapability.TEXT_GENERATION
+        else "json_enforced.jinja2"
+    )
+    template = env.get_template(template_name)
+    candidate_dicts: list[dict] | None = None
+    if candidates:
+        candidate_dicts = [
+            {
+                "element": c.element,
+                "text": c.text,
+                "context": c.context,
+                "attrs": c.attrs,
+                **({"confidence": c.confidence} if c.confidence is not None else {}),
+            }
+            for c in candidates
+        ]
+    return template.render(
+        schema=schema,
+        source_text=source_text,
+        candidates=candidate_dicts,
+    )
+def make_correction_prompt(original_response: str, error_message: str) -> str:
+    """Build a self-correction retry prompt that includes the bad response and the error."""
+    return (
+        "Your previous response could not be parsed as JSON.\n"
+        f"Error: {error_message}\n\n"
+        f"Your previous response was:\n{original_response}\n\n"
+        "Please fix the JSON and return only a valid JSON array of span objects. "
+        "Do not include any markdown formatting or explanation."
+    )

tei_annotator/prompting/templates/json_enforced.jinja2 ADDED Viewed

	@@ -0,0 +1,21 @@

+You are a TEI XML annotation assistant.
+## TEI Schema
+{% for elem in schema.elements %}
+- `{{ elem.tag }}`: {{ elem.description }}{% if elem.attributes %} (attributes: {% for attr in elem.attributes %}`{{ attr.name }}`{% if not loop.last %}, {% endif %}{% endfor %}){% endif %}
+{% endfor %}
+## Source Text
+```
+{{ source_text }}
+```
+{% if candidates %}
+## Pre-detected Candidates (verify and extend)
+{{ candidates | tojson }}
+{% endif %}
+Return a JSON array. Each item must have: `element`, `text`, `context`, `attrs`.
+One entry per occurrence. `text` and `context` must be exact substrings of the source text.

tei_annotator/prompting/templates/text_gen.jinja2 ADDED Viewed

	@@ -0,0 +1,53 @@

+You are a TEI XML annotation assistant. Your task is to identify named entities and spans in the source text and annotate them with TEI XML tags.
+## TEI Schema
+The following TEI elements are in scope:
+{% for elem in schema.elements %}
+### `{{ elem.tag }}`
+{{ elem.description }}
+{% if elem.attributes %}
+Attributes:
+{% for attr in elem.attributes %}
+- `{{ attr.name }}`{% if attr.required %} *(required)*{% endif %}: {{ attr.description }}{% if attr.allowed_values %} — allowed values: `{{ attr.allowed_values | join("`, `") }}`{% endif %}
+{% endfor %}
+{% endif %}
+{% endfor %}
+## Source Text
+```
+{{ source_text }}
+```
+{% if candidates %}
+## Pre-detected Candidates
+The following spans were pre-detected by a fast model. Use them as hints — verify each one, correct any errors, and add any entities the detector missed:
+{{ candidates | tojson }}
+{% endif %}
+## Instructions
+Identify all occurrences of entities described in the schema above in the source text.
+Return a **JSON array** where each item is an object with:
+- `"element"`: the TEI tag name (e.g. `"persName"`)
+- `"text"`: the exact text span to annotate — must appear verbatim in the source text
+- `"context"`: a substring of the source text (50–150 characters) that contains `"text"` as a substring
+- `"attrs"`: a JSON object with attribute name → value pairs (use `{}` if no attributes needed)
+Rules:
+- Emit one entry per **occurrence**, not per unique entity
+- `"text"` must be an exact substring of the source text
+- `"context"` must be an exact substring of the source text and must contain `"text"`
+- Do not modify the source text in any way
+Output **only** the JSON array. Do not include markdown fences, explanations, or any other text.
+Example output:
+[
+  {"element": "persName", "text": "John Smith", "context": "He said John Smith yesterday.", "attrs": {}},
+  {"element": "placeName", "text": "Paris", "context": "traveled to Paris in 1920", "attrs": {"ref": "https://www.wikidata.org/wiki/Q90"}}
+]

tests/__init__.py ADDED Viewed

File without changes

tests/integration/__init__.py ADDED Viewed

File without changes

tests/integration/test_gliner_detector.py ADDED Viewed

	@@ -0,0 +1,59 @@

+"""
+Integration tests for GLiNER detection.
+These tests download a real HuggingFace model (~400 MB) on first run.
+Run with: pytest -m integration
+"""
+import pytest
+pytestmark = pytest.mark.integration
+def test_gliner_detects_person_name():
+    from tei_annotator.detection.gliner_detector import detect_spans
+    from tei_annotator.models.schema import TEIElement, TEISchema
+    schema = TEISchema(
+        elements=[
+            TEIElement(tag="persName", description="a person's name", attributes=[]),
+        ]
+    )
+    text = "Albert Einstein was born in Ulm in 1879."
+    spans = detect_spans(text, schema, model_id="numind/NuNER_Zero")
+    assert any(s.element == "persName" and "Einstein" in s.text for s in spans), (
+        f"Expected a persName span containing 'Einstein'; got: {spans}"
+    )
+def test_gliner_confidence_scores_present():
+    from tei_annotator.detection.gliner_detector import detect_spans
+    from tei_annotator.models.schema import TEIElement, TEISchema
+    schema = TEISchema(
+        elements=[
+            TEIElement(tag="persName", description="a person's name", attributes=[]),
+        ]
+    )
+    text = "Marie Curie discovered polonium."
+    spans = detect_spans(text, schema, model_id="numind/NuNER_Zero")
+    for span in spans:
+        if span.confidence is not None:
+            assert 0.0 <= span.confidence <= 1.0
+def test_gliner_context_contains_text():
+    from tei_annotator.detection.gliner_detector import detect_spans
+    from tei_annotator.models.schema import TEIElement, TEISchema
+    schema = TEISchema(
+        elements=[
+            TEIElement(tag="persName", description="a person's name", attributes=[]),
+        ]
+    )
+    text = "Charles Darwin published On the Origin of Species."
+    spans = detect_spans(text, schema, model_id="numind/NuNER_Zero")
+    for span in spans:
+        assert span.text in span.context, (
+            f"span.text {span.text!r} not found in context {span.context!r}"
+        )

tests/integration/test_pipeline_e2e.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""
+End-to-end integration tests: real GLiNER model + mocked call_fn.
+Tests that only use mocked call_fn (gliner_model=None) are also here because
+they exercise the full pipeline with non-trivial context resolution scenarios.
+Run with: pytest -m integration
+"""
+from __future__ import annotations
+import json
+import re
+import pytest
+pytestmark = pytest.mark.integration
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _strip_tags(xml: str) -> str:
+    return re.sub(r"<[^>]+>", "", xml)
+def _schema(*tags: tuple[str, str]):
+    """Build a TEISchema from (tag, description) pairs."""
+    from tei_annotator.models.schema import TEIAttribute, TEIElement, TEISchema
+    elements = []
+    for tag, desc in tags:
+        if tag == "persName":
+            elements.append(
+                TEIElement(
+                    tag="persName",
+                    description=desc,
+                    attributes=[
+                        TEIAttribute(name="ref", description="URI reference"),
+                        TEIAttribute(name="cert", description="certainty", allowed_values=["high", "low"]),
+                    ],
+                )
+            )
+        else:
+            elements.append(TEIElement(tag=tag, description=desc, attributes=[]))
+    return TEISchema(elements=elements)
+def _endpoint(call_fn, capability="json_enforced"):
+    from tei_annotator.inference.endpoint import EndpointCapability, EndpointConfig
+    cap = {
+        "json_enforced": EndpointCapability.JSON_ENFORCED,
+        "text_generation": EndpointCapability.TEXT_GENERATION,
+    }[capability]
+    return EndpointConfig(capability=cap, call_fn=call_fn)
+def _annotate(text, schema, call_fn, capability="json_enforced", gliner_model=None, **kw):
+    from tei_annotator.pipeline import annotate
+    return annotate(
+        text=text,
+        schema=schema,
+        endpoint=_endpoint(call_fn, capability),
+        gliner_model=gliner_model,
+        **kw,
+    )
+# ---------------------------------------------------------------------------
+# 1. Exact context longer than span text
+# ---------------------------------------------------------------------------
+def test_context_longer_than_span_text():
+    """Resolver must locate span.text inside a longer context window."""
+    source = "The treaty was signed by Cardinal Richelieu in Paris."
+    schema = _schema(("persName", "a person's name"), ("placeName", "a place name"))
+    def call_fn(_):
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "Cardinal Richelieu",
+                "context": "was signed by Cardinal Richelieu in Paris",
+                "attrs": {},
+            },
+            {
+                "element": "placeName",
+                "text": "Paris",
+                "context": "Cardinal Richelieu in Paris.",
+                "attrs": {},
+            },
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert "<persName>Cardinal Richelieu</persName>" in result.xml
+    assert "<placeName>Paris</placeName>" in result.xml
+    assert _strip_tags(result.xml) == source
+# ---------------------------------------------------------------------------
+# 2. Same span text appears twice — context disambiguates
+# ---------------------------------------------------------------------------
+def test_multiple_occurrences_disambiguated_by_context():
+    """
+    'John Smith' appears twice.  LLM returns two spans with distinct contexts
+    pointing at each occurrence.  Both must be annotated at the correct offset.
+    """
+    source = "John Smith arrived early. Later, John Smith left."
+    schema = _schema(("persName", "a person's name"))
+    def call_fn(_):
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "John Smith",
+                "context": "John Smith arrived early.",
+                "attrs": {},
+            },
+            {
+                "element": "persName",
+                "text": "John Smith",
+                "context": "Later, John Smith left.",
+                "attrs": {},
+            },
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert result.xml.count("<persName>") == 2
+    assert result.xml.count("John Smith") == 2
+    assert _strip_tags(result.xml) == source
+# ---------------------------------------------------------------------------
+# 3. Long text requiring chunking — global offset calculation
+# ---------------------------------------------------------------------------
+def test_long_text_entity_in_second_chunk():
+    """
+    Entity is far into a long text; its LLM context is relative to a later chunk.
+    Offset must be shifted by chunk.start_offset to land at the correct global position.
+    """
+    # Build a ~2500-char text; entity sits well past the first 1500-char chunk
+    filler = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. " * 25  # ~1425 chars
+    target_sentence = "Napoleon Bonaparte was exiled to Saint Helena."
+    source = filler + target_sentence
+    schema = _schema(
+        ("persName", "a person's name"),
+        ("placeName", "a place name"),
+    )
+    def call_fn(prompt):
+        # The LLM sees either a filler chunk (returns []) or the chunk containing
+        # the target sentence and returns the two spans.
+        if "Napoleon" not in prompt:
+            return "[]"
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "Napoleon Bonaparte",
+                "context": "Napoleon Bonaparte was exiled to Saint Helena.",
+                "attrs": {},
+            },
+            {
+                "element": "placeName",
+                "text": "Saint Helena",
+                "context": "was exiled to Saint Helena.",
+                "attrs": {},
+            },
+        ])
+    result = _annotate(source, schema, call_fn, chunk_size=1500, chunk_overlap=200)
+    assert "<persName>Napoleon Bonaparte</persName>" in result.xml
+    assert "<placeName>Saint Helena</placeName>" in result.xml
+    assert _strip_tags(result.xml) == source
+    # Verify the annotated positions are truly within the target sentence
+    napoleon_start = result.xml.index("Napoleon Bonaparte")
+    assert napoleon_start > 1400, (
+        f"Napoleon offset {napoleon_start} is too early — chunk offset was not applied"
+    )
+# ---------------------------------------------------------------------------
+# 4. Nested spans resolved end-to-end
+# ---------------------------------------------------------------------------
+def test_nested_spans_end_to_end():
+    """
+    LLM emits an outer persName and inner forename / surname spans.
+    Both are resolved separately and then nested by the injector.
+    """
+    source = "He met John Smith today."
+    schema = _schema(
+        ("persName", "a person's full name"),
+        ("forename", "a forename"),
+        ("surname", "a surname"),
+    )
+    def call_fn(_):
+        return json.dumps([
+            {"element": "persName", "text": "John Smith", "context": "met John Smith today.", "attrs": {}},
+            {"element": "forename", "text": "John", "context": "met John Smith today.", "attrs": {}},
+            {"element": "surname", "text": "Smith", "context": "John Smith today.", "attrs": {}},
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert "<persName>" in result.xml
+    assert "<forename>" in result.xml
+    assert "<surname>" in result.xml
+    # forename and surname must be inside persName
+    p_open = result.xml.index("<persName>")
+    p_close = result.xml.index("</persName>")
+    fn_open = result.xml.index("<forename>")
+    sn_close = result.xml.index("</surname>")
+    assert p_open < fn_open < sn_close < p_close
+    assert _strip_tags(result.xml) == source
+# ---------------------------------------------------------------------------
+# 5. Pre-existing XML preserved after annotation
+# ---------------------------------------------------------------------------
+def test_preexisting_xml_preserved():
+    """
+    Source already has markup (<note> tags).  After annotation the original
+    markup must still be present alongside the new TEI annotations.
+    """
+    source = "He met <note>allegedly</note> John Smith yesterday."
+    schema = _schema(("persName", "a person's name"))
+    def call_fn(_):
+        # The LLM sees stripped plain text: "He met allegedly John Smith yesterday."
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "John Smith",
+                "context": "allegedly John Smith yesterday.",
+                "attrs": {},
+            }
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert "<note>" in result.xml
+    assert "</note>" in result.xml
+    assert "<persName>John Smith</persName>" in result.xml
+    # Plain text must be unchanged
+    assert _strip_tags(result.xml) == _strip_tags(source)
+# ---------------------------------------------------------------------------
+# 6. Attributes preserved end-to-end
+# ---------------------------------------------------------------------------
+def test_attributes_preserved_end_to_end():
+    """Attribute values returned by the LLM must appear verbatim in the output tag."""
+    source = "The emperor Napoleon was defeated at Waterloo."
+    schema = _schema(("persName", "a person's name"))
+    def call_fn(_):
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "Napoleon",
+                "context": "emperor Napoleon was defeated",
+                "attrs": {"ref": "http://viaf.org/viaf/106964661", "cert": "high"},
+            }
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert 'ref="http://viaf.org/viaf/106964661"' in result.xml
+    assert 'cert="high"' in result.xml
+    assert _strip_tags(result.xml) == source
+# ---------------------------------------------------------------------------
+# 7. Hallucinated context → span silently rejected
+# ---------------------------------------------------------------------------
+def test_hallucinated_context_span_rejected():
+    """
+    LLM returns a plausible-looking but non-existent context.
+    The resolver must reject the span; the source text is returned unmodified.
+    """
+    source = "Marie Curie discovered polonium."
+    schema = _schema(("persName", "a person's name"))
+    def call_fn(_):
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "Marie Curie",
+                "context": "Dr. Marie Curie discovered polonium",  # "Dr. " not in source
+                "attrs": {},
+            }
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert "<persName>" not in result.xml
+    assert result.xml == source
+# ---------------------------------------------------------------------------
+# 8. Fuzzy context match → span annotated and flagged
+# ---------------------------------------------------------------------------
+def test_fuzzy_context_match_flags_span():
+    """
+    A context with a single-character typo should still resolve via fuzzy
+    matching (score > 0.92) and be included with fuzzy_match=True.
+    """
+    source = "Galileo Galilei observed the moons of Jupiter."
+    schema = _schema(("persName", "a person's name"))
+    def call_fn(_):
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "Galileo Galilei",
+                # One character different from the source — should trigger fuzzy
+                "context": "Galileo Galilei observd the moons of Jupiter.",
+                "attrs": {},
+            }
+        ])
+    result = _annotate(source, schema, call_fn)
+    # The span should still be annotated
+    assert "<persName>Galileo Galilei</persName>" in result.xml
+    # And flagged as fuzzy
+    assert len(result.fuzzy_spans) == 1
+    assert result.fuzzy_spans[0].element == "persName"
+    assert _strip_tags(result.xml) == source
+# ---------------------------------------------------------------------------
+# 9. Source text never modified (plain-text invariant)
+# ---------------------------------------------------------------------------
+def test_plain_text_invariant_with_multiple_entities():
+    """Stripping all tags from the output must yield exactly the input text."""
+    source = (
+        "Leonardo da Vinci was born in Vinci, Tuscany, "
+        "and later worked in Milan and Florence."
+    )
+    schema = _schema(
+        ("persName", "a person's name"),
+        ("placeName", "a place name"),
+    )
+    def call_fn(_):
+        return json.dumps([
+            {"element": "persName", "text": "Leonardo da Vinci",
+             "context": "Leonardo da Vinci was born in Vinci", "attrs": {}},
+            {"element": "placeName", "text": "Vinci",
+             "context": "born in Vinci, Tuscany", "attrs": {}},
+            {"element": "placeName", "text": "Tuscany",
+             "context": "Vinci, Tuscany, and later", "attrs": {}},
+            {"element": "placeName", "text": "Milan",
+             "context": "later worked in Milan and Florence", "attrs": {}},
+            {"element": "placeName", "text": "Florence",
+             "context": "in Milan and Florence.", "attrs": {}},
+        ])
+    result = _annotate(source, schema, call_fn)
+    assert _strip_tags(result.xml) == source
+    assert result.xml.count("<placeName>") == 4
+    assert "<persName>Leonardo da Vinci</persName>" in result.xml
+# ---------------------------------------------------------------------------
+# 10. Real GLiNER model (requires HuggingFace download)
+# ---------------------------------------------------------------------------
+def test_pipeline_with_real_gliner():
+    """Full pipeline: real GLiNER pre-detection + mocked LLM call_fn."""
+    schema = _schema(("persName", "a person's name"))
+    def mock_llm(_: str) -> str:
+        return json.dumps([
+            {
+                "element": "persName",
+                "text": "Albert Einstein",
+                "context": "Albert Einstein was born",
+                "attrs": {},
+            }
+        ])
+    result = _annotate(
+        "Albert Einstein was born in Ulm in 1879.",
+        schema,
+        mock_llm,
+        gliner_model="numind/NuNER_Zero",
+    )
+    assert "persName" in result.xml
+    assert "Albert Einstein" in result.xml
+    assert result.xml.count("Albert Einstein") == 1

tests/test_builder.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import pytest
+from tei_annotator.inference.endpoint import EndpointCapability
+from tei_annotator.models.schema import TEIElement, TEISchema
+from tei_annotator.models.spans import SpanDescriptor
+from tei_annotator.prompting.builder import build_prompt, make_correction_prompt
+def _schema():
+    return TEISchema(
+        elements=[
+            TEIElement(tag="persName", description="a person's name", attributes=[]),
+            TEIElement(tag="placeName", description="a place name", attributes=[]),
+        ]
+    )
+def test_text_gen_prompt_contains_json_instruction():
+    prompt = build_prompt("Some text.", _schema(), EndpointCapability.TEXT_GENERATION)
+    assert "JSON" in prompt or "json" in prompt
+def test_text_gen_prompt_contains_example():
+    prompt = build_prompt("Some text.", _schema(), EndpointCapability.TEXT_GENERATION)
+    # The template shows an example output array
+    assert "persName" in prompt or "element" in prompt
+def test_text_gen_prompt_contains_schema_elements():
+    prompt = build_prompt("Some text.", _schema(), EndpointCapability.TEXT_GENERATION)
+    assert "persName" in prompt
+    assert "placeName" in prompt
+def test_text_gen_prompt_contains_source_text():
+    prompt = build_prompt("unique_source_42", _schema(), EndpointCapability.TEXT_GENERATION)
+    assert "unique_source_42" in prompt
+def test_json_enforced_prompt_contains_schema():
+    prompt = build_prompt("text", _schema(), EndpointCapability.JSON_ENFORCED)
+    assert "persName" in prompt
+    assert "placeName" in prompt
+def test_json_enforced_prompt_shorter_than_text_gen():
+    text_gen = build_prompt("text", _schema(), EndpointCapability.TEXT_GENERATION)
+    json_enf = build_prompt("text", _schema(), EndpointCapability.JSON_ENFORCED)
+    assert len(json_enf) < len(text_gen)
+def test_candidates_appear_in_prompt():
+    candidates = [
+        SpanDescriptor(element="persName", text="John", context="said John went", attrs={})
+    ]
+    prompt = build_prompt(
+        "said John went.",
+        _schema(),
+        EndpointCapability.TEXT_GENERATION,
+        candidates=candidates,
+    )
+    assert "John" in prompt
+def test_no_candidate_section_when_none():
+    prompt = build_prompt("text", _schema(), EndpointCapability.TEXT_GENERATION, candidates=None)
+    assert "Pre-detected" not in prompt
+def test_empty_candidates_list_no_section():
+    prompt = build_prompt("text", _schema(), EndpointCapability.TEXT_GENERATION, candidates=[])
+    assert "Pre-detected" not in prompt
+def test_extraction_raises():
+    with pytest.raises(ValueError):
+        build_prompt("text", _schema(), EndpointCapability.EXTRACTION)
+def test_correction_prompt_contains_original_response():
+    prompt = make_correction_prompt("bad_json_here", "JSONDecodeError")
+    assert "bad_json_here" in prompt
+    assert "JSONDecodeError" in prompt

tests/test_chunker.py ADDED Viewed

	@@ -0,0 +1,79 @@

+from tei_annotator.chunking.chunker import Chunk, chunk_text
+def test_short_text_single_chunk():
+    text = "Short text."
+    chunks = chunk_text(text, chunk_size=1500)
+    assert len(chunks) == 1
+    assert chunks[0].text == text
+    assert chunks[0].start_offset == 0
+def test_long_text_multiple_chunks():
+    text = "word " * 400  # 2000 chars
+    chunks = chunk_text(text, chunk_size=500, overlap=50)
+    assert len(chunks) > 1
+    for i, chunk in enumerate(chunks):
+        assert chunk.start_offset >= 0
+        if i > 0:
+            assert chunk.start_offset > chunks[i - 1].start_offset
+def test_chunk_start_offsets_correct():
+    """Every chunk's text must match a slice of the original at start_offset."""
+    text = "hello world " * 200
+    chunks = chunk_text(text, chunk_size=300, overlap=50)
+    for chunk in chunks:
+        assert (
+            text[chunk.start_offset : chunk.start_offset + len(chunk.text)]
+            == chunk.text
+        )
+def test_long_text_covers_all_characters():
+    """Union of all chunk ranges must cover the entire source text."""
+    text = "abcdefghij" * 200  # 2000 chars
+    chunks = chunk_text(text, chunk_size=400, overlap=80)
+    covered: set[int] = set()
+    for chunk in chunks:
+        for j in range(chunk.start_offset, chunk.start_offset + len(chunk.text)):
+            covered.add(j)
+    assert covered == set(range(len(text)))
+def test_chunk_boundary_does_not_split_xml_tag():
+    """A chunk boundary must never fall inside an XML tag."""
+    # Place a tag that straddles the natural 500-char boundary
+    prefix = "a" * 495
+    tag = "<someElement>"
+    suffix = "b" * 600
+    text = prefix + tag + suffix
+    chunks = chunk_text(text, chunk_size=500, overlap=0)
+    for chunk in chunks:
+        # Each chunk must be self-consistent XML-tag-wise:
+        # count of '<' must equal count of '>' within the chunk text
+        # (a split tag would have an unbalanced '<' or '>')
+        assert chunk.text.count("<") == chunk.text.count(">"), (
+            f"Chunk at offset {chunk.start_offset} has unbalanced angle brackets: "
+            f"{chunk.text!r}"
+        )
+def test_exact_chunk_size_no_overflow():
+    text = "x" * 1500
+    chunks = chunk_text(text, chunk_size=1500, overlap=0)
+    assert len(chunks) == 1
+    assert chunks[0].text == text
+def test_overlap_produces_repeated_content():
+    """With positive overlap, the end of chunk N overlaps with the start of chunk N+1."""
+    text = "word " * 300  # 1500 chars
+    chunks = chunk_text(text, chunk_size=500, overlap=100)
+    assert len(chunks) >= 2
+    # The end of chunk 0 and the start of chunk 1 must share content
+    c0_end = chunks[0].start_offset + len(chunks[0].text)
+    c1_start = chunks[1].start_offset
+    assert c1_start < c0_end, "Expected overlapping content between consecutive chunks"

tests/test_injector.py ADDED Viewed

	@@ -0,0 +1,102 @@

+import pytest
+from tei_annotator.models.spans import ResolvedSpan
+from tei_annotator.postprocessing.injector import _build_nesting_tree, inject_xml
+def _span(element, start, end, attrs=None):
+    return ResolvedSpan(element=element, start=start, end=end, attrs=attrs or {})
+def test_no_spans_returns_source():
+    assert inject_xml("hello world", []) == "hello world"
+def test_single_span():
+    source = "He said John Smith yesterday."
+    # "John Smith" = [8:18]
+    span = _span("persName", 8, 18)
+    result = inject_xml(source, [span])
+    assert result == "He said <persName>John Smith</persName> yesterday."
+def test_two_non_overlapping_spans():
+    source = "John met Mary."
+    # "John" = [0:4], "Mary" = [9:13]
+    spans = [_span("persName", 0, 4), _span("persName", 9, 13)]
+    result = inject_xml(source, spans)
+    assert result == "<persName>John</persName> met <persName>Mary</persName>."
+def test_nested_spans():
+    # "Dr. Smith" = outer, "Dr." = inner
+    source = "He met Dr. Smith today."
+    # "Dr. Smith" = [7:16], "Dr." = [7:10]
+    spans = [_span("persName", 7, 16), _span("roleName", 7, 10)]
+    result = inject_xml(source, spans)
+    assert "<persName>" in result
+    assert "<roleName>" in result
+    # roleName must appear inside persName
+    assert result.index("<roleName>") > result.index("<persName>")
+    assert result.index("</roleName>") < result.index("</persName>")
+    # Text is split by the inner tag; check exact output structure
+    assert result == "He met <persName><roleName>Dr.</roleName> Smith</persName> today."
+def test_attrs_rendered_in_tag():
+    source = "Visit Paris."
+    span = _span("placeName", 6, 11, {"ref": "http://example.com/paris"})
+    result = inject_xml(source, [span])
+    assert 'ref="http://example.com/paris"' in result
+    assert "<placeName" in result
+    assert "Paris" in result
+def test_span_at_start_of_text():
+    source = "John went home."
+    span = _span("persName", 0, 4)
+    result = inject_xml(source, [span])
+    assert result.startswith("<persName>John</persName>")
+def test_span_covering_entire_text():
+    source = "John Smith"
+    span = _span("persName", 0, 10)
+    result = inject_xml(source, [span])
+    assert result == "<persName>John Smith</persName>"
+def test_span_at_end_of_text():
+    source = "He visited Paris"
+    span = _span("placeName", 11, 16)
+    result = inject_xml(source, [span])
+    assert result.endswith("<placeName>Paris</placeName>")
+def test_overlapping_spans_warns_and_skips():
+    source = "Hello World"
+    # Partial overlap: [0,7] and [5,11]
+    spans = [_span("a", 0, 7), _span("b", 5, 11)]
+    with pytest.warns(UserWarning, match="Overlapping"):
+        result = inject_xml(source, spans)
+    # Only the first span should be present
+    assert "<a>" in result
+    assert "<b>" not in result
+def test_build_nesting_tree_simple():
+    outer = _span("persName", 0, 20)
+    inner = _span("roleName", 0, 5)
+    roots = _build_nesting_tree([outer, inner])
+    assert len(roots) == 1
+    assert roots[0].element == "persName"
+    assert len(roots[0].children) == 1
+    assert roots[0].children[0].element == "roleName"
+def test_build_nesting_tree_siblings():
+    a = _span("a", 0, 5)
+    b = _span("b", 6, 10)
+    roots = _build_nesting_tree([a, b])
+    assert len(roots) == 2
+    assert all(len(r.children) == 0 for r in roots)

tests/test_parser.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import json
+import pytest
+from tei_annotator.postprocessing.parser import _strip_fences, parse_response
+VALID_JSON = json.dumps(
+    [{"element": "persName", "text": "John Smith", "context": "said John Smith", "attrs": {}}]
+)
+# ---- _strip_fences ----------------------------------------------------------
+def test_strip_fences_json_lang():
+    fenced = f"```json\n{VALID_JSON}\n```"
+    assert _strip_fences(fenced) == VALID_JSON
+def test_strip_fences_no_lang():
+    fenced = f"```\n{VALID_JSON}\n```"
+    assert _strip_fences(fenced) == VALID_JSON
+def test_strip_fences_no_fences():
+    assert _strip_fences(VALID_JSON) == VALID_JSON
+def test_strip_fences_with_preamble():
+    text = f"Here is the JSON:\n```json\n{VALID_JSON}\n```"
+    assert _strip_fences(text) == VALID_JSON
+# ---- parse_response ---------------------------------------------------------
+def test_valid_json_parsed_directly():
+    spans = parse_response(VALID_JSON)
+    assert len(spans) == 1
+    assert spans[0].element == "persName"
+    assert spans[0].text == "John Smith"
+def test_markdown_fenced_json_parsed():
+    spans = parse_response(f"```json\n{VALID_JSON}\n```")
+    assert len(spans) == 1
+def test_invalid_json_no_retry_raises():
+    with pytest.raises(ValueError):
+        parse_response("not json at all")
+def test_retry_triggered_on_first_failure():
+    call_count = [0]
+    def retry_fn(prompt: str) -> str:
+        call_count[0] += 1
+        return VALID_JSON
+    def correction_fn(bad: str, err: str) -> str:
+        return f"fix: {bad}"
+    spans = parse_response("bad json", call_fn=retry_fn, make_correction_prompt=correction_fn)
+    assert call_count[0] == 1
+    assert len(spans) == 1
+def test_retry_still_invalid_raises():
+    def retry_fn(prompt: str) -> str:
+        return "still bad"
+    def correction_fn(bad: str, err: str) -> str:
+        return "fix it"
+    with pytest.raises(ValueError):
+        parse_response("bad", call_fn=retry_fn, make_correction_prompt=correction_fn)
+def test_missing_fields_items_skipped():
+    raw = json.dumps(
+        [
+            {"element": "persName"},  # missing text and context → skip
+            {"element": "persName", "text": "John", "context": "John went"},  # valid
+        ]
+    )
+    spans = parse_response(raw)
+    assert len(spans) == 1
+    assert spans[0].text == "John"
+def test_non_list_response_raises():
+    with pytest.raises(ValueError):
+        parse_response(json.dumps({"element": "persName"}))
+def test_attrs_defaults_to_empty_dict():
+    raw = json.dumps([{"element": "persName", "text": "x", "context": "x"}])
+    spans = parse_response(raw)
+    assert spans[0].attrs == {}

tests/test_pipeline.py ADDED Viewed

	@@ -0,0 +1,137 @@

+import json
+import pytest
+from tei_annotator.inference.endpoint import EndpointCapability, EndpointConfig
+from tei_annotator.models.schema import TEIElement, TEISchema
+from tei_annotator.pipeline import annotate
+def _schema():
+    return TEISchema(
+        elements=[
+            TEIElement(
+                tag="persName",
+                description="a person's name",
+                allowed_children=[],
+                attributes=[],
+            )
+        ]
+    )
+def _mock_call_fn(prompt: str) -> str:
+    return json.dumps(
+        [
+            {
+                "element": "persName",
+                "text": "John Smith",
+                "context": "said John Smith yesterday",
+                "attrs": {},
+            }
+        ]
+    )
+def test_annotate_smoke():
+    result = annotate(
+        text="He said John Smith yesterday.",
+        schema=_schema(),
+        endpoint=EndpointConfig(
+            capability=EndpointCapability.JSON_ENFORCED,
+            call_fn=_mock_call_fn,
+        ),
+        gliner_model=None,
+    )
+    assert "persName" in result.xml
+    assert "John Smith" in result.xml
+    assert result.xml.count("John Smith") == 1  # text not duplicated
+def test_annotate_empty_response():
+    result = annotate(
+        text="No entities here.",
+        schema=_schema(),
+        endpoint=EndpointConfig(
+            capability=EndpointCapability.JSON_ENFORCED,
+            call_fn=lambda _: "[]",
+        ),
+        gliner_model=None,
+    )
+    assert result.xml == "No entities here."
+    assert result.fuzzy_spans == []
+def test_annotate_preserves_existing_xml():
+    # Pre-existing <b> tag must survive
+    def call_fn(prompt: str) -> str:
+        return json.dumps(
+            [
+                {
+                    "element": "persName",
+                    "text": "John Smith",
+                    "context": "said John Smith yesterday",
+                    "attrs": {},
+                }
+            ]
+        )
+    result = annotate(
+        text="He said <b>John Smith</b> yesterday.",
+        schema=_schema(),
+        endpoint=EndpointConfig(
+            capability=EndpointCapability.JSON_ENFORCED, call_fn=call_fn
+        ),
+        gliner_model=None,
+    )
+    assert "<b>" in result.xml
+    assert "John Smith" in result.xml
+def test_annotate_fuzzy_spans_surfaced():
+    """Spans flagged as fuzzy appear in AnnotationResult.fuzzy_spans."""
+    # We cannot force a fuzzy match easily without mocking internals,
+    # so we just verify the field exists and is a list.
+    result = annotate(
+        text="He said John Smith yesterday.",
+        schema=_schema(),
+        endpoint=EndpointConfig(
+            capability=EndpointCapability.JSON_ENFORCED,
+            call_fn=_mock_call_fn,
+        ),
+        gliner_model=None,
+    )
+    assert isinstance(result.fuzzy_spans, list)
+def test_annotate_text_generation_endpoint():
+    """TEXT_GENERATION capability path (with retry logic enabled) works end-to-end."""
+    result = annotate(
+        text="He said John Smith yesterday.",
+        schema=_schema(),
+        endpoint=EndpointConfig(
+            capability=EndpointCapability.TEXT_GENERATION,
+            call_fn=_mock_call_fn,
+        ),
+        gliner_model=None,
+    )
+    assert "persName" in result.xml
+def test_annotate_no_text_modification():
+    """The original text characters must all appear in the output (no hallucination)."""
+    original = "He said John Smith yesterday."
+    result = annotate(
+        text=original,
+        schema=_schema(),
+        endpoint=EndpointConfig(
+            capability=EndpointCapability.JSON_ENFORCED,
+            call_fn=_mock_call_fn,
+        ),
+        gliner_model=None,
+    )
+    # Strip all tags from output; plain text should equal original
+    import re
+    plain = re.sub(r"<[^>]+>", "", result.xml)
+    assert plain == original

tests/test_resolver.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import pytest
+from tei_annotator.models.spans import SpanDescriptor
+from tei_annotator.postprocessing.resolver import resolve_spans
+SOURCE = "He said John Smith yesterday, and John Smith agreed."
+def _span(element, text, context, attrs=None):
+    return SpanDescriptor(element=element, text=text, context=context, attrs=attrs or {})
+def test_exact_context_match():
+    span = _span("persName", "John Smith", "said John Smith yesterday")
+    resolved = resolve_spans(SOURCE, [span])
+    assert len(resolved) == 1
+    rs = resolved[0]
+    assert rs.start == SOURCE.index("John Smith")
+    assert rs.end == rs.start + len("John Smith")
+    assert not rs.fuzzy_match
+def test_context_not_found_rejected():
+    span = _span("persName", "John Smith", "this context does not exist xyz987")
+    assert resolve_spans(SOURCE, [span]) == []
+def test_text_not_in_context_window_rejected():
+    span = _span("persName", "Jane Doe", "said John Smith yesterday")
+    assert resolve_spans(SOURCE, [span]) == []
+def test_source_slice_verified():
+    span = _span("persName", "John Smith", "said John Smith yesterday")
+    resolved = resolve_spans(SOURCE, [span])
+    assert len(resolved) == 1
+    rs = resolved[0]
+    assert SOURCE[rs.start : rs.end] == "John Smith"
+def test_attrs_preserved():
+    span = _span("persName", "John Smith", "said John Smith yesterday", {"ref": "#js"})
+    resolved = resolve_spans(SOURCE, [span])
+    assert len(resolved) == 1
+    assert resolved[0].attrs == {"ref": "#js"}
+def test_multiple_spans_resolved():
+    spans = [
+        _span("persName", "John Smith", "He said John Smith yesterday"),
+        _span("persName", "John Smith", "and John Smith agreed"),
+    ]
+    resolved = resolve_spans(SOURCE, spans)
+    assert len(resolved) == 2
+    assert resolved[0].start != resolved[1].start
+def test_empty_span_list():
+    assert resolve_spans(SOURCE, []) == []
+def test_children_start_empty():
+    span = _span("persName", "John Smith", "said John Smith yesterday")
+    resolved = resolve_spans(SOURCE, [span])
+    assert resolved[0].children == []

tests/test_validator.py ADDED Viewed

	@@ -0,0 +1,82 @@

+from tei_annotator.models.schema import TEIAttribute, TEIElement, TEISchema
+from tei_annotator.models.spans import ResolvedSpan
+from tei_annotator.postprocessing.validator import validate_spans
+SOURCE = "He met John Smith."
+def _schema():
+    return TEISchema(
+        elements=[
+            TEIElement(
+                tag="persName",
+                description="a person's name",
+                attributes=[
+                    TEIAttribute(name="ref", description="reference URI"),
+                    TEIAttribute(
+                        name="cert",
+                        description="certainty",
+                        allowed_values=["high", "low"],
+                    ),
+                ],
+            )
+        ]
+    )
+def _span(element, start, end, attrs=None):
+    return ResolvedSpan(element=element, start=start, end=end, attrs=attrs or {})
+# SOURCE: "He met John Smith."
+# positions: H=0 e=1 ' '=2 m=3 e=4 t=5 ' '=6 J=7 o=8 h=9 n=10 ' '=11 S=12 m=13 i=14 t=15 h=16 .=17
+# "John Smith" => [7:17]
+def test_valid_span_passes():
+    result = validate_spans([_span("persName", 7, 17)], _schema(), SOURCE)
+    assert len(result) == 1
+def test_unknown_element_rejected():
+    result = validate_spans([_span("orgName", 7, 17)], _schema(), SOURCE)
+    assert len(result) == 0
+def test_unknown_attribute_rejected():
+    result = validate_spans(
+        [_span("persName", 7, 17, {"unknown_attr": "val"})], _schema(), SOURCE
+    )
+    assert len(result) == 0
+def test_invalid_attribute_value_rejected():
+    result = validate_spans(
+        [_span("persName", 7, 17, {"cert": "medium"})], _schema(), SOURCE
+    )
+    assert len(result) == 0
+def test_valid_constrained_attribute_passes():
+    result = validate_spans(
+        [_span("persName", 7, 17, {"cert": "high"})], _schema(), SOURCE
+    )
+    assert len(result) == 1
+def test_free_string_attribute_passes():
+    result = validate_spans(
+        [_span("persName", 7, 17, {"ref": "http://example.com/p/1"})], _schema(), SOURCE
+    )
+    assert len(result) == 1
+def test_out_of_bounds_span_rejected():
+    result = validate_spans([_span("persName", -1, 5)], _schema(), SOURCE)
+    assert len(result) == 0
+    result2 = validate_spans([_span("persName", 5, 200)], _schema(), SOURCE)
+    assert len(result2) == 0
+def test_empty_span_list():
+    assert validate_spans([], _schema(), SOURCE) == []

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff