Spaces:
Runtime error
Runtime error
Commit Β·
790b4e5
1
Parent(s): 484306e
Update implementation plan
Browse files- implementation-plan.md +158 -45
implementation-plan.md
CHANGED
|
@@ -5,21 +5,12 @@ Prompt:
|
|
| 5 |
> Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
|
| 6 |
>
|
| 7 |
> **Inputs:**
|
|
|
|
| 8 |
> - A text string to annotate (may already contain partial XML)
|
| 9 |
> - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
|
| 10 |
> - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
|
| 11 |
> - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
|
| 12 |
-
|
| 13 |
-
> **Pipeline:**
|
| 14 |
-
> 1. Check for local GLiNER installation and print setup instructions if missing
|
| 15 |
-
> 2. Run a local GLiNER model as a first-pass span detector, mapping TEI element descriptions to GLiNER labels
|
| 16 |
-
> 3. Chunk long texts with overlap, tracking global character offsets
|
| 17 |
-
> 4. Assemble a prompt using the GLiNER candidates, TEI schema context, and JSON output instructions tailored to the endpoint capability
|
| 18 |
-
> 5. Parse and validate the returned span manifest β each span has `element`, `text`, `context` (surrounding text for position resolution), and `attrs`
|
| 19 |
-
> 6. Resolve spans to character offsets by searching for the context string in the source, then locating the span text within it β reject any span where `source[start:end] != span.text`
|
| 20 |
-
> 7. Inject tags deterministically into the original source text, handling nesting
|
| 21 |
-
> 8. Return the annotated XML plus a list of fuzzy-matched spans flagged for human review
|
| 22 |
-
>
|
| 23 |
> The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
|
| 24 |
|
| 25 |
## Package Structure
|
|
@@ -31,8 +22,7 @@ tei_annotator/
|
|
| 31 |
β βββ schema.py # TEI element/attribute data structures
|
| 32 |
β βββ spans.py # Span manifest data structures
|
| 33 |
βββ detection/
|
| 34 |
-
β
|
| 35 |
-
β βββ setup.py # GLiNER availability check + install instructions
|
| 36 |
βββ chunking/
|
| 37 |
β βββ chunker.py # Overlap-aware text chunker, XML-safe boundaries
|
| 38 |
βββ prompting/
|
|
@@ -82,7 +72,8 @@ class SpanDescriptor:
|
|
| 82 |
text: str
|
| 83 |
context: str # must contain text as substring
|
| 84 |
attrs: dict[str, str]
|
| 85 |
-
|
|
|
|
| 86 |
confidence: float | None = None # passed through from GLiNER
|
| 87 |
|
| 88 |
@dataclass
|
|
@@ -97,28 +88,19 @@ class ResolvedSpan:
|
|
| 97 |
|
| 98 |
---
|
| 99 |
|
| 100 |
-
## GLiNER
|
| 101 |
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
Recommended models (downloaded automatically on first use):
|
| 114 |
-
urchade/gliner_medium-v2.1 # balanced, Apache 2.0
|
| 115 |
-
numind/NuNER_Zero # stronger multi-word entities, MIT
|
| 116 |
-
knowledgator/gliner-multitask-large-v0.5 # adds relation extraction
|
| 117 |
-
|
| 118 |
-
Models run on CPU; no GPU required.
|
| 119 |
-
""")
|
| 120 |
-
return False
|
| 121 |
-
```
|
| 122 |
|
| 123 |
---
|
| 124 |
|
|
@@ -145,31 +127,36 @@ The `call_fn` injection means the library is agnostic about whether the caller i
|
|
| 145 |
## Pipeline (`pipeline.py`)
|
| 146 |
|
| 147 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
def annotate(
|
| 149 |
text: str, # may contain existing XML tags
|
| 150 |
schema: TEISchema, # subset of TEI elements in scope
|
| 151 |
endpoint: EndpointConfig, # injected inference dependency
|
| 152 |
-
gliner_model: str = "numind/NuNER_Zero",
|
| 153 |
chunk_size: int = 1500, # chars
|
| 154 |
chunk_overlap: int = 200,
|
| 155 |
-
) ->
|
| 156 |
```
|
| 157 |
|
| 158 |
### Execution Flow
|
| 159 |
|
| 160 |
```
|
| 161 |
1. SETUP
|
| 162 |
-
check_gliner() β prompt user to install if missing
|
| 163 |
strip existing XML tags from text for processing,
|
| 164 |
preserve them as a restoration map for final merge
|
| 165 |
|
| 166 |
-
2. GLINER PASS
|
| 167 |
map TEISchema elements β flat label list for GLiNER
|
| 168 |
e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
|
| 169 |
chunk text if len(text) > chunk_size (with overlap)
|
| 170 |
run gliner.predict_entities() on each chunk
|
| 171 |
merge cross-chunk duplicates by span text + context overlap
|
| 172 |
output: list[SpanDescriptor] with text + context + element + confidence
|
|
|
|
| 173 |
|
| 174 |
3. PROMPT ASSEMBLY
|
| 175 |
select template based on EndpointCapability:
|
|
@@ -190,7 +177,9 @@ def annotate(
|
|
| 190 |
a. Parse
|
| 191 |
JSON_ENFORCED/EXTRACTION: parse directly
|
| 192 |
TEXT_GENERATION: strip markdown fences, parse JSON,
|
| 193 |
-
retry once with correction prompt
|
|
|
|
|
|
|
| 194 |
|
| 195 |
b. Resolve (resolver.py)
|
| 196 |
for each SpanDescriptor:
|
|
@@ -201,11 +190,12 @@ def annotate(
|
|
| 201 |
|
| 202 |
c. Validate (validator.py)
|
| 203 |
reject spans where text not in source
|
| 204 |
-
check nesting: children must be within parent bounds
|
| 205 |
check attributes against TEISchema allowed values
|
| 206 |
check element is in schema scope
|
| 207 |
|
| 208 |
d. Inject (injector.py)
|
|
|
|
|
|
|
| 209 |
sort ResolvedSpans by start offset, handle nesting depth-first
|
| 210 |
insert tags into a copy of the original source string
|
| 211 |
restore previously existing XML tags from step 1
|
|
@@ -215,8 +205,10 @@ def annotate(
|
|
| 215 |
optionally validate against full TEI RelaxNG schema via lxml
|
| 216 |
|
| 217 |
6. RETURN
|
| 218 |
-
|
| 219 |
-
|
|
|
|
|
|
|
| 220 |
```
|
| 221 |
|
| 222 |
---
|
|
@@ -224,6 +216,127 @@ def annotate(
|
|
| 224 |
## Key Design Constraints
|
| 225 |
|
| 226 |
- The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
|
| 227 |
-
- GLiNER is
|
|
|
|
| 228 |
- `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
|
| 229 |
-
- Fuzzy-matched spans are **surfaced, not silently accepted** β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
> Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
|
| 6 |
>
|
| 7 |
> **Inputs:**
|
| 8 |
+
>
|
| 9 |
> - A text string to annotate (may already contain partial XML)
|
| 10 |
> - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
|
| 11 |
> - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
|
| 12 |
> - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
|
| 13 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
> The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
|
| 15 |
|
| 16 |
## Package Structure
|
|
|
|
| 22 |
β βββ schema.py # TEI element/attribute data structures
|
| 23 |
β βββ spans.py # Span manifest data structures
|
| 24 |
βββ detection/
|
| 25 |
+
β βββ gliner_detector.py # Optional local GLiNER first-pass span detection
|
|
|
|
| 26 |
βββ chunking/
|
| 27 |
β βββ chunker.py # Overlap-aware text chunker, XML-safe boundaries
|
| 28 |
βββ prompting/
|
|
|
|
| 72 |
text: str
|
| 73 |
context: str # must contain text as substring
|
| 74 |
attrs: dict[str, str]
|
| 75 |
+
# always flat β nesting is inferred from offset containment in resolver/injector,
|
| 76 |
+
# not emitted by the model (models produce unreliable nested trees)
|
| 77 |
confidence: float | None = None # passed through from GLiNER
|
| 78 |
|
| 79 |
@dataclass
|
|
|
|
| 88 |
|
| 89 |
---
|
| 90 |
|
| 91 |
+
## GLiNER Dependency (`detection/gliner_detector.py`)
|
| 92 |
|
| 93 |
+
`gliner` is a regular package dependency declared in `pyproject.toml` and installed via `uv add gliner`. No manual setup step is needed.
|
| 94 |
+
|
| 95 |
+
Model weights are fetched from HuggingFace Hub automatically on first use of `GLiNER.from_pretrained(model_id)` and cached in `~/.cache/huggingface/`. If the import fails at runtime (e.g. the optional extra was not installed), the module raises a standard `ImportError` with a clear message β no wrapper needed.
|
| 96 |
+
|
| 97 |
+
Recommended models (specified as `gliner_model` parameter):
|
| 98 |
+
|
| 99 |
+
- `urchade/gliner_medium-v2.1` β balanced, Apache 2.0
|
| 100 |
+
- `numind/NuNER_Zero` β stronger multi-word entities, MIT (default)
|
| 101 |
+
- `knowledgator/gliner-multitask-large-v0.5` β adds relation extraction
|
| 102 |
+
|
| 103 |
+
All models run on CPU; no GPU required.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
---
|
| 106 |
|
|
|
|
| 127 |
## Pipeline (`pipeline.py`)
|
| 128 |
|
| 129 |
```python
|
| 130 |
+
@dataclass
|
| 131 |
+
class AnnotationResult:
|
| 132 |
+
xml: str # annotated XML string
|
| 133 |
+
fuzzy_spans: list[ResolvedSpan] # spans flagged for human review
|
| 134 |
+
|
| 135 |
def annotate(
|
| 136 |
text: str, # may contain existing XML tags
|
| 137 |
schema: TEISchema, # subset of TEI elements in scope
|
| 138 |
endpoint: EndpointConfig, # injected inference dependency
|
| 139 |
+
gliner_model: str | None = "numind/NuNER_Zero", # None disables GLiNER pass
|
| 140 |
chunk_size: int = 1500, # chars
|
| 141 |
chunk_overlap: int = 200,
|
| 142 |
+
) -> AnnotationResult:
|
| 143 |
```
|
| 144 |
|
| 145 |
### Execution Flow
|
| 146 |
|
| 147 |
```
|
| 148 |
1. SETUP
|
|
|
|
| 149 |
strip existing XML tags from text for processing,
|
| 150 |
preserve them as a restoration map for final merge
|
| 151 |
|
| 152 |
+
2. GLINER PASS (skipped if gliner_model=None, endpoint is EXTRACTION, or text is short)
|
| 153 |
map TEISchema elements β flat label list for GLiNER
|
| 154 |
e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
|
| 155 |
chunk text if len(text) > chunk_size (with overlap)
|
| 156 |
run gliner.predict_entities() on each chunk
|
| 157 |
merge cross-chunk duplicates by span text + context overlap
|
| 158 |
output: list[SpanDescriptor] with text + context + element + confidence
|
| 159 |
+
(GLiNER is a pre-filter only; the LLM may reject, correct, or extend its candidates)
|
| 160 |
|
| 161 |
3. PROMPT ASSEMBLY
|
| 162 |
select template based on EndpointCapability:
|
|
|
|
| 177 |
a. Parse
|
| 178 |
JSON_ENFORCED/EXTRACTION: parse directly
|
| 179 |
TEXT_GENERATION: strip markdown fences, parse JSON,
|
| 180 |
+
on failure: retry once with a correction prompt that includes
|
| 181 |
+
the original (bad) response and the parse error message,
|
| 182 |
+
so the model can self-correct rather than starting from scratch
|
| 183 |
|
| 184 |
b. Resolve (resolver.py)
|
| 185 |
for each SpanDescriptor:
|
|
|
|
| 190 |
|
| 191 |
c. Validate (validator.py)
|
| 192 |
reject spans where text not in source
|
|
|
|
| 193 |
check attributes against TEISchema allowed values
|
| 194 |
check element is in schema scope
|
| 195 |
|
| 196 |
d. Inject (injector.py)
|
| 197 |
+
infer nesting from offset containment (child β parent by [start, end] bounds)
|
| 198 |
+
check inferred nesting: children must be within parent bounds
|
| 199 |
sort ResolvedSpans by start offset, handle nesting depth-first
|
| 200 |
insert tags into a copy of the original source string
|
| 201 |
restore previously existing XML tags from step 1
|
|
|
|
| 205 |
optionally validate against full TEI RelaxNG schema via lxml
|
| 206 |
|
| 207 |
6. RETURN
|
| 208 |
+
AnnotationResult(
|
| 209 |
+
xml=annotated_xml_string,
|
| 210 |
+
fuzzy_spans=list_of_flagged_resolved_spans,
|
| 211 |
+
)
|
| 212 |
```
|
| 213 |
|
| 214 |
---
|
|
|
|
| 216 |
## Key Design Constraints
|
| 217 |
|
| 218 |
- The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
|
| 219 |
+
- The **GLiNER pass is optional** (`gliner_model=None` disables it). It is most useful for long texts with `TEXT_GENERATION` endpoints; it is skipped automatically for `EXTRACTION` endpoints or short inputs. When enabled, GLiNER is a pre-filter only β the LLM may reject, correct, or extend its candidates.
|
| 220 |
+
- **Span nesting is inferred from offsets**, never emitted by the model. `SpanDescriptor` is always flat; `ResolvedSpan.children` is populated by the injector from containment relationships.
|
| 221 |
- `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
|
| 222 |
+
- Fuzzy-matched spans are **surfaced, not silently accepted** β `AnnotationResult.fuzzy_spans` provides a reviewable list alongside the XML.
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## Testing Strategy
|
| 227 |
+
|
| 228 |
+
### Mocking philosophy
|
| 229 |
+
|
| 230 |
+
**Always mock `call_fn` and the GLiNER detector in unit tests.** Do not use a real GLiNER model as a substitute for a remote LLM endpoint β GLiNER is a span-labelling model that cannot produce JSON responses; it cannot exercise the parse/resolve/inject pipeline. Using a real model also makes tests slow (~seconds per inference on CPU), non-deterministic across versions and hardware, and dependent on a 400MB+ download.
|
| 231 |
+
|
| 232 |
+
The `call_fn: (str) -> str` design makes mocking trivial β a lambda returning a hardcoded JSON string is sufficient. No mock framework is needed.
|
| 233 |
+
|
| 234 |
+
### Test layers
|
| 235 |
+
|
| 236 |
+
**Layer 1 β Unit tests** (always run, <1s total, fully mocked):
|
| 237 |
+
|
| 238 |
+
```
|
| 239 |
+
tests/
|
| 240 |
+
βββ test_chunker.py # chunker unit tests
|
| 241 |
+
βββ test_resolver.py # resolver unit tests
|
| 242 |
+
βββ test_validator.py # validator unit tests
|
| 243 |
+
βββ test_injector.py # injector unit tests
|
| 244 |
+
βββ test_builder.py # prompt builder unit tests
|
| 245 |
+
βββ test_parser.py # JSON parse + retry unit tests
|
| 246 |
+
βββ test_pipeline.py # full pipeline smoke test (mocked call_fn + GLiNER)
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
**Layer 2 β Integration tests** (opt-in, gated by `pytest -m integration`):
|
| 250 |
+
|
| 251 |
+
```
|
| 252 |
+
tests/integration/
|
| 253 |
+
βββ test_gliner_detector.py # real GLiNER model, real HuggingFace download
|
| 254 |
+
βββ test_pipeline_e2e.py # full annotate() with real GLiNER + mocked call_fn
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
Integration tests are excluded from CI by default via `pyproject.toml`:
|
| 258 |
+
|
| 259 |
+
```toml
|
| 260 |
+
[tool.pytest.ini_options]
|
| 261 |
+
addopts = "-m 'not integration'"
|
| 262 |
+
```
|
| 263 |
+
|
| 264 |
+
### TDD cycle (red β green β refactor)
|
| 265 |
+
|
| 266 |
+
Each module is written test-first. Write a failing test, implement the minimum code to pass, refactor.
|
| 267 |
+
|
| 268 |
+
### Key test cases per module
|
| 269 |
+
|
| 270 |
+
**`chunker.py`**
|
| 271 |
+
|
| 272 |
+
- Short text below `chunk_size` β single chunk, offset 0
|
| 273 |
+
- Long text β multiple chunks with correct `start_offset` per chunk
|
| 274 |
+
- Span exactly at a chunk boundary β appears in both chunks with correct global offset
|
| 275 |
+
- Input with existing XML tags β chunk boundaries never split a tag
|
| 276 |
+
|
| 277 |
+
**`resolver.py`**
|
| 278 |
+
|
| 279 |
+
- Exact context match β `ResolvedSpan` with `fuzzy_match=False`
|
| 280 |
+
- Context not found in source β span rejected
|
| 281 |
+
- `source[start:end] != span.text` β span rejected
|
| 282 |
+
- Context found but span text not within it β span rejected
|
| 283 |
+
- Context found, span text found, score < 0.92 β span rejected
|
| 284 |
+
- Context found, span text found, 0.92 β€ score < 1.0 β `fuzzy_match=True`
|
| 285 |
+
- Multiple occurrences of context β first match used, or rejection if ambiguous
|
| 286 |
+
|
| 287 |
+
**`validator.py`**
|
| 288 |
+
|
| 289 |
+
- Element not in schema β span rejected
|
| 290 |
+
- Attribute not in schema β span rejected
|
| 291 |
+
- Attribute value not in `allowed_values` β span rejected
|
| 292 |
+
- Valid span β passes through unchanged
|
| 293 |
+
|
| 294 |
+
**`injector.py`**
|
| 295 |
+
|
| 296 |
+
- Two non-overlapping spans β both tags inserted correctly
|
| 297 |
+
- Span B offset-contained in span A β B is child of A in output
|
| 298 |
+
- Overlapping (non-nesting) spans β reject or flatten with warning
|
| 299 |
+
- Restored XML tags from step 1 do not conflict with injected tags
|
| 300 |
+
|
| 301 |
+
**`builder.py`**
|
| 302 |
+
|
| 303 |
+
- `TEXT_GENERATION` capability β prompt contains JSON example and "output only JSON" instruction
|
| 304 |
+
- `JSON_ENFORCED` capability β prompt is minimal (no JSON scaffolding)
|
| 305 |
+
- GLiNER candidates present β candidates appear in prompt
|
| 306 |
+
- GLiNER candidates absent (pass skipped) β prompt has no candidate section
|
| 307 |
+
|
| 308 |
+
**`parser.py` (parse + retry logic)**
|
| 309 |
+
|
| 310 |
+
- Valid JSON response β parsed to `list[SpanDescriptor]` without retry
|
| 311 |
+
- Markdown-fenced JSON β fences stripped, parsed correctly
|
| 312 |
+
- Invalid JSON on first attempt β retry triggered with correction prompt that includes original bad response + parse error message
|
| 313 |
+
- Invalid JSON on second attempt β exception raised, chunk skipped
|
| 314 |
+
|
| 315 |
+
**`pipeline.py` (smoke test)**
|
| 316 |
+
|
| 317 |
+
```python
|
| 318 |
+
def mock_call_fn(prompt: str) -> str:
|
| 319 |
+
return json.dumps([
|
| 320 |
+
{"element": "persName", "text": "John Smith",
|
| 321 |
+
"context": "...said John Smith yesterday...", "attrs": {}}
|
| 322 |
+
])
|
| 323 |
+
|
| 324 |
+
def test_annotate_smoke():
|
| 325 |
+
schema = TEISchema(elements=[
|
| 326 |
+
TEIElement(tag="persName", description="a person's name",
|
| 327 |
+
allowed_children=[], attributes=[])
|
| 328 |
+
])
|
| 329 |
+
endpoint = EndpointConfig(
|
| 330 |
+
capability=EndpointCapability.JSON_ENFORCED,
|
| 331 |
+
call_fn=mock_call_fn,
|
| 332 |
+
)
|
| 333 |
+
result = annotate(
|
| 334 |
+
text="He said John Smith yesterday.",
|
| 335 |
+
schema=schema,
|
| 336 |
+
endpoint=endpoint,
|
| 337 |
+
gliner_model=None, # disable GLiNER in unit tests
|
| 338 |
+
)
|
| 339 |
+
assert "persName" in result.xml
|
| 340 |
+
assert "John Smith" in result.xml
|
| 341 |
+
assert result.xml.count("John Smith") == 1 # text not duplicated
|
| 342 |
+
```
|