Spaces:

cmboulanger
/

tei-annotator

Runtime error

App Files Files Community

cmboulanger commited on Feb 28

Commit

790b4e5

1 Parent(s): 484306e

Update implementation plan

Browse files

Files changed (1) hide show

implementation-plan.md +158 -45

implementation-plan.md CHANGED Viewed

@@ -5,21 +5,12 @@ Prompt:
 > Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
 >
 > **Inputs:**
 > - A text string to annotate (may already contain partial XML)
 > - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
 > - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
 > - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
->
-> **Pipeline:**
-> 1. Check for local GLiNER installation and print setup instructions if missing
-> 2. Run a local GLiNER model as a first-pass span detector, mapping TEI element descriptions to GLiNER labels
-> 3. Chunk long texts with overlap, tracking global character offsets
-> 4. Assemble a prompt using the GLiNER candidates, TEI schema context, and JSON output instructions tailored to the endpoint capability
-> 5. Parse and validate the returned span manifest — each span has `element`, `text`, `context` (surrounding text for position resolution), and `attrs`
-> 6. Resolve spans to character offsets by searching for the context string in the source, then locating the span text within it — reject any span where `source[start:end] != span.text`
-> 7. Inject tags deterministically into the original source text, handling nesting
-> 8. Return the annotated XML plus a list of fuzzy-matched spans flagged for human review
->
 > The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
 ## Package Structure
@@ -31,8 +22,7 @@ tei_annotator/
 │   ├── schema.py           # TEI element/attribute data structures
 │   └── spans.py            # Span manifest data structures
 ├── detection/
-│   ├── gliner_detector.py  # Local GLiNER first-pass span detection
-│   └── setup.py            # GLiNER availability check + install instructions
 ├── chunking/
 │   └── chunker.py          # Overlap-aware text chunker, XML-safe boundaries
 ├── prompting/
@@ -82,7 +72,8 @@ class SpanDescriptor:
     text: str
     context: str                     # must contain text as substring
     attrs: dict[str, str]
-    children: list["SpanDescriptor"] # for nested annotations
     confidence: float | None = None  # passed through from GLiNER
 @dataclass
@@ -97,28 +88,19 @@ class ResolvedSpan:
 ---
-## GLiNER Setup (`detection/setup.py`)
-```python
-def check_gliner() -> bool:
-    try:
-        import gliner
-        return True
-    except ImportError:
-        print("""
-GLiNER is not installed. To install:
-    pip install gliner
-Recommended models (downloaded automatically on first use):
-    urchade/gliner_medium-v2.1       # balanced, Apache 2.0
-    numind/NuNER_Zero                # stronger multi-word entities, MIT
-    knowledgator/gliner-multitask-large-v0.5  # adds relation extraction
-Models run on CPU; no GPU required.
-        """)
-        return False
-```
 ---
@@ -145,31 +127,36 @@ The `call_fn` injection means the library is agnostic about whether the caller i
 ## Pipeline (`pipeline.py`)
 ```python
 def annotate(
     text: str,                        # may contain existing XML tags
     schema: TEISchema,                # subset of TEI elements in scope
     endpoint: EndpointConfig,         # injected inference dependency
-    gliner_model: str = "numind/NuNER_Zero",
     chunk_size: int = 1500,           # chars
     chunk_overlap: int = 200,
-) -> str:                             # returns annotated XML string
 ```
 ### Execution Flow
 ```
 1. SETUP
-   check_gliner() → prompt user to install if missing
    strip existing XML tags from text for processing,
    preserve them as a restoration map for final merge
-2. GLINER PASS
    map TEISchema elements → flat label list for GLiNER
      e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
    chunk text if len(text) > chunk_size (with overlap)
    run gliner.predict_entities() on each chunk
    merge cross-chunk duplicates by span text + context overlap
    output: list[SpanDescriptor] with text + context + element + confidence
 3. PROMPT ASSEMBLY
    select template based on EndpointCapability:
@@ -190,7 +177,9 @@ def annotate(
    a. Parse
       JSON_ENFORCED/EXTRACTION: parse directly
       TEXT_GENERATION: strip markdown fences, parse JSON,
-                       retry once with correction prompt on failure
    b. Resolve  (resolver.py)
       for each SpanDescriptor:
@@ -201,11 +190,12 @@ def annotate(
    c. Validate  (validator.py)
       reject spans where text not in source
-      check nesting: children must be within parent bounds
       check attributes against TEISchema allowed values
       check element is in schema scope
    d. Inject  (injector.py)
       sort ResolvedSpans by start offset, handle nesting depth-first
       insert tags into a copy of the original source string
       restore previously existing XML tags from step 1
@@ -215,8 +205,10 @@ def annotate(
       optionally validate against full TEI RelaxNG schema via lxml
 6. RETURN
-   annotated XML string
-   + list of fuzzy-matched spans flagged for human review
 ```
 ---
@@ -224,6 +216,127 @@ def annotate(
 ## Key Design Constraints
 - The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
-- GLiNER is a **pre-filter**, not the authority. The LLM can reject, correct, or add to its candidates. GLiNER's value is positional reliability; the LLM's value is schema reasoning.
 - `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
-- Fuzzy-matched spans are **surfaced, not silently accepted** — the return value includes a reviewable list alongside the XML.

 > Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
 >
 > **Inputs:**
+>
 > - A text string to annotate (may already contain partial XML)
 > - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
 > - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
 > - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
 > The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
 ## Package Structure
 │   ├── schema.py           # TEI element/attribute data structures
 │   └── spans.py            # Span manifest data structures
 ├── detection/
+│   └── gliner_detector.py  # Optional local GLiNER first-pass span detection
 ├── chunking/
 │   └── chunker.py          # Overlap-aware text chunker, XML-safe boundaries
 ├── prompting/
     text: str
     context: str                     # must contain text as substring
     attrs: dict[str, str]
+    # always flat — nesting is inferred from offset containment in resolver/injector,
+    # not emitted by the model (models produce unreliable nested trees)
     confidence: float | None = None  # passed through from GLiNER
 @dataclass
 ---
+## GLiNER Dependency (`detection/gliner_detector.py`)
+`gliner` is a regular package dependency declared in `pyproject.toml` and installed via `uv add gliner`. No manual setup step is needed.
+Model weights are fetched from HuggingFace Hub automatically on first use of `GLiNER.from_pretrained(model_id)` and cached in `~/.cache/huggingface/`. If the import fails at runtime (e.g. the optional extra was not installed), the module raises a standard `ImportError` with a clear message — no wrapper needed.
+Recommended models (specified as `gliner_model` parameter):
+- `urchade/gliner_medium-v2.1` — balanced, Apache 2.0
+- `numind/NuNER_Zero` — stronger multi-word entities, MIT (default)
+- `knowledgator/gliner-multitask-large-v0.5` — adds relation extraction
+All models run on CPU; no GPU required.
 ---
 ## Pipeline (`pipeline.py`)
 ```python
+@dataclass
+class AnnotationResult:
+    xml: str                          # annotated XML string
+    fuzzy_spans: list[ResolvedSpan]   # spans flagged for human review
 def annotate(
     text: str,                        # may contain existing XML tags
     schema: TEISchema,                # subset of TEI elements in scope
     endpoint: EndpointConfig,         # injected inference dependency
+    gliner_model: str | None = "numind/NuNER_Zero",  # None disables GLiNER pass
     chunk_size: int = 1500,           # chars
     chunk_overlap: int = 200,
+) -> AnnotationResult:
 ```
 ### Execution Flow
 ```
 1. SETUP
    strip existing XML tags from text for processing,
    preserve them as a restoration map for final merge
+2. GLINER PASS  (skipped if gliner_model=None, endpoint is EXTRACTION, or text is short)
    map TEISchema elements → flat label list for GLiNER
      e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
    chunk text if len(text) > chunk_size (with overlap)
    run gliner.predict_entities() on each chunk
    merge cross-chunk duplicates by span text + context overlap
    output: list[SpanDescriptor] with text + context + element + confidence
+   (GLiNER is a pre-filter only; the LLM may reject, correct, or extend its candidates)
 3. PROMPT ASSEMBLY
    select template based on EndpointCapability:
    a. Parse
       JSON_ENFORCED/EXTRACTION: parse directly
       TEXT_GENERATION: strip markdown fences, parse JSON,
+                       on failure: retry once with a correction prompt that includes
+                       the original (bad) response and the parse error message,
+                       so the model can self-correct rather than starting from scratch
    b. Resolve  (resolver.py)
       for each SpanDescriptor:
    c. Validate  (validator.py)
       reject spans where text not in source
       check attributes against TEISchema allowed values
       check element is in schema scope
    d. Inject  (injector.py)
+      infer nesting from offset containment (child ⊂ parent by [start, end] bounds)
+      check inferred nesting: children must be within parent bounds
       sort ResolvedSpans by start offset, handle nesting depth-first
       insert tags into a copy of the original source string
       restore previously existing XML tags from step 1
       optionally validate against full TEI RelaxNG schema via lxml
 6. RETURN
+   AnnotationResult(
+     xml=annotated_xml_string,
+     fuzzy_spans=list_of_flagged_resolved_spans,
+   )
 ```
 ---
 ## Key Design Constraints
 - The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
+- The **GLiNER pass is optional** (`gliner_model=None` disables it). It is most useful for long texts with `TEXT_GENERATION` endpoints; it is skipped automatically for `EXTRACTION` endpoints or short inputs. When enabled, GLiNER is a pre-filter only — the LLM may reject, correct, or extend its candidates.
+- **Span nesting is inferred from offsets**, never emitted by the model. `SpanDescriptor` is always flat; `ResolvedSpan.children` is populated by the injector from containment relationships.
 - `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
+- Fuzzy-matched spans are **surfaced, not silently accepted** — `AnnotationResult.fuzzy_spans` provides a reviewable list alongside the XML.
+---
+## Testing Strategy
+### Mocking philosophy
+**Always mock `call_fn` and the GLiNER detector in unit tests.** Do not use a real GLiNER model as a substitute for a remote LLM endpoint — GLiNER is a span-labelling model that cannot produce JSON responses; it cannot exercise the parse/resolve/inject pipeline. Using a real model also makes tests slow (~seconds per inference on CPU), non-deterministic across versions and hardware, and dependent on a 400MB+ download.
+The `call_fn: (str) -> str` design makes mocking trivial — a lambda returning a hardcoded JSON string is sufficient. No mock framework is needed.
+### Test layers
+**Layer 1 — Unit tests** (always run, <1s total, fully mocked):
+```
+tests/
+├── test_chunker.py        # chunker unit tests
+├── test_resolver.py       # resolver unit tests
+├── test_validator.py      # validator unit tests
+├── test_injector.py       # injector unit tests
+├── test_builder.py        # prompt builder unit tests
+├── test_parser.py         # JSON parse + retry unit tests
+└── test_pipeline.py       # full pipeline smoke test (mocked call_fn + GLiNER)
+```
+**Layer 2 — Integration tests** (opt-in, gated by `pytest -m integration`):
+```
+tests/integration/
+├── test_gliner_detector.py   # real GLiNER model, real HuggingFace download
+└── test_pipeline_e2e.py      # full annotate() with real GLiNER + mocked call_fn
+```
+Integration tests are excluded from CI by default via `pyproject.toml`:
+```toml
+[tool.pytest.ini_options]
+addopts = "-m 'not integration'"
+```
+### TDD cycle (red → green → refactor)
+Each module is written test-first. Write a failing test, implement the minimum code to pass, refactor.
+### Key test cases per module
+**`chunker.py`**
+- Short text below `chunk_size` → single chunk, offset 0
+- Long text → multiple chunks with correct `start_offset` per chunk
+- Span exactly at a chunk boundary → appears in both chunks with correct global offset
+- Input with existing XML tags → chunk boundaries never split a tag
+**`resolver.py`**
+- Exact context match → `ResolvedSpan` with `fuzzy_match=False`
+- Context not found in source → span rejected
+- `source[start:end] != span.text` → span rejected
+- Context found but span text not within it → span rejected
+- Context found, span text found, score < 0.92 → span rejected
+- Context found, span text found, 0.92 ≤ score < 1.0 → `fuzzy_match=True`
+- Multiple occurrences of context → first match used, or rejection if ambiguous
+**`validator.py`**
+- Element not in schema → span rejected
+- Attribute not in schema → span rejected
+- Attribute value not in `allowed_values` → span rejected
+- Valid span → passes through unchanged
+**`injector.py`**
+- Two non-overlapping spans → both tags inserted correctly
+- Span B offset-contained in span A → B is child of A in output
+- Overlapping (non-nesting) spans → reject or flatten with warning
+- Restored XML tags from step 1 do not conflict with injected tags
+**`builder.py`**
+- `TEXT_GENERATION` capability → prompt contains JSON example and "output only JSON" instruction
+- `JSON_ENFORCED` capability → prompt is minimal (no JSON scaffolding)
+- GLiNER candidates present → candidates appear in prompt
+- GLiNER candidates absent (pass skipped) → prompt has no candidate section
+**`parser.py` (parse + retry logic)**
+- Valid JSON response → parsed to `list[SpanDescriptor]` without retry
+- Markdown-fenced JSON → fences stripped, parsed correctly
+- Invalid JSON on first attempt → retry triggered with correction prompt that includes original bad response + parse error message
+- Invalid JSON on second attempt → exception raised, chunk skipped
+**`pipeline.py` (smoke test)**
+```python
+def mock_call_fn(prompt: str) -> str:
+    return json.dumps([
+        {"element": "persName", "text": "John Smith",
+         "context": "...said John Smith yesterday...", "attrs": {}}
+    ])
+def test_annotate_smoke():
+    schema = TEISchema(elements=[
+        TEIElement(tag="persName", description="a person's name",
+                   allowed_children=[], attributes=[])
+    ])
+    endpoint = EndpointConfig(
+        capability=EndpointCapability.JSON_ENFORCED,
+        call_fn=mock_call_fn,
+    )
+    result = annotate(
+        text="He said John Smith yesterday.",
+        schema=schema,
+        endpoint=endpoint,
+        gliner_model=None,   # disable GLiNER in unit tests
+    )
+    assert "persName" in result.xml
+    assert "John Smith" in result.xml
+    assert result.xml.count("John Smith") == 1   # text not duplicated
+```