cmboulanger commited on
Commit
790b4e5
Β·
1 Parent(s): 484306e

Update implementation plan

Browse files
Files changed (1) hide show
  1. implementation-plan.md +158 -45
implementation-plan.md CHANGED
@@ -5,21 +5,12 @@ Prompt:
5
  > Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
6
  >
7
  > **Inputs:**
 
8
  > - A text string to annotate (may already contain partial XML)
9
  > - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
10
  > - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
11
  > - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
12
- >
13
- > **Pipeline:**
14
- > 1. Check for local GLiNER installation and print setup instructions if missing
15
- > 2. Run a local GLiNER model as a first-pass span detector, mapping TEI element descriptions to GLiNER labels
16
- > 3. Chunk long texts with overlap, tracking global character offsets
17
- > 4. Assemble a prompt using the GLiNER candidates, TEI schema context, and JSON output instructions tailored to the endpoint capability
18
- > 5. Parse and validate the returned span manifest β€” each span has `element`, `text`, `context` (surrounding text for position resolution), and `attrs`
19
- > 6. Resolve spans to character offsets by searching for the context string in the source, then locating the span text within it β€” reject any span where `source[start:end] != span.text`
20
- > 7. Inject tags deterministically into the original source text, handling nesting
21
- > 8. Return the annotated XML plus a list of fuzzy-matched spans flagged for human review
22
- >
23
  > The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
24
 
25
  ## Package Structure
@@ -31,8 +22,7 @@ tei_annotator/
31
  β”‚ β”œβ”€β”€ schema.py # TEI element/attribute data structures
32
  β”‚ └── spans.py # Span manifest data structures
33
  β”œβ”€β”€ detection/
34
- β”‚ β”œβ”€β”€ gliner_detector.py # Local GLiNER first-pass span detection
35
- β”‚ └── setup.py # GLiNER availability check + install instructions
36
  β”œβ”€β”€ chunking/
37
  β”‚ └── chunker.py # Overlap-aware text chunker, XML-safe boundaries
38
  β”œβ”€β”€ prompting/
@@ -82,7 +72,8 @@ class SpanDescriptor:
82
  text: str
83
  context: str # must contain text as substring
84
  attrs: dict[str, str]
85
- children: list["SpanDescriptor"] # for nested annotations
 
86
  confidence: float | None = None # passed through from GLiNER
87
 
88
  @dataclass
@@ -97,28 +88,19 @@ class ResolvedSpan:
97
 
98
  ---
99
 
100
- ## GLiNER Setup (`detection/setup.py`)
101
 
102
- ```python
103
- def check_gliner() -> bool:
104
- try:
105
- import gliner
106
- return True
107
- except ImportError:
108
- print("""
109
- GLiNER is not installed. To install:
110
-
111
- pip install gliner
112
-
113
- Recommended models (downloaded automatically on first use):
114
- urchade/gliner_medium-v2.1 # balanced, Apache 2.0
115
- numind/NuNER_Zero # stronger multi-word entities, MIT
116
- knowledgator/gliner-multitask-large-v0.5 # adds relation extraction
117
-
118
- Models run on CPU; no GPU required.
119
- """)
120
- return False
121
- ```
122
 
123
  ---
124
 
@@ -145,31 +127,36 @@ The `call_fn` injection means the library is agnostic about whether the caller i
145
  ## Pipeline (`pipeline.py`)
146
 
147
  ```python
 
 
 
 
 
148
  def annotate(
149
  text: str, # may contain existing XML tags
150
  schema: TEISchema, # subset of TEI elements in scope
151
  endpoint: EndpointConfig, # injected inference dependency
152
- gliner_model: str = "numind/NuNER_Zero",
153
  chunk_size: int = 1500, # chars
154
  chunk_overlap: int = 200,
155
- ) -> str: # returns annotated XML string
156
  ```
157
 
158
  ### Execution Flow
159
 
160
  ```
161
  1. SETUP
162
- check_gliner() β†’ prompt user to install if missing
163
  strip existing XML tags from text for processing,
164
  preserve them as a restoration map for final merge
165
 
166
- 2. GLINER PASS
167
  map TEISchema elements β†’ flat label list for GLiNER
168
  e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
169
  chunk text if len(text) > chunk_size (with overlap)
170
  run gliner.predict_entities() on each chunk
171
  merge cross-chunk duplicates by span text + context overlap
172
  output: list[SpanDescriptor] with text + context + element + confidence
 
173
 
174
  3. PROMPT ASSEMBLY
175
  select template based on EndpointCapability:
@@ -190,7 +177,9 @@ def annotate(
190
  a. Parse
191
  JSON_ENFORCED/EXTRACTION: parse directly
192
  TEXT_GENERATION: strip markdown fences, parse JSON,
193
- retry once with correction prompt on failure
 
 
194
 
195
  b. Resolve (resolver.py)
196
  for each SpanDescriptor:
@@ -201,11 +190,12 @@ def annotate(
201
 
202
  c. Validate (validator.py)
203
  reject spans where text not in source
204
- check nesting: children must be within parent bounds
205
  check attributes against TEISchema allowed values
206
  check element is in schema scope
207
 
208
  d. Inject (injector.py)
 
 
209
  sort ResolvedSpans by start offset, handle nesting depth-first
210
  insert tags into a copy of the original source string
211
  restore previously existing XML tags from step 1
@@ -215,8 +205,10 @@ def annotate(
215
  optionally validate against full TEI RelaxNG schema via lxml
216
 
217
  6. RETURN
218
- annotated XML string
219
- + list of fuzzy-matched spans flagged for human review
 
 
220
  ```
221
 
222
  ---
@@ -224,6 +216,127 @@ def annotate(
224
  ## Key Design Constraints
225
 
226
  - The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
227
- - GLiNER is a **pre-filter**, not the authority. The LLM can reject, correct, or add to its candidates. GLiNER's value is positional reliability; the LLM's value is schema reasoning.
 
228
  - `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
229
- - Fuzzy-matched spans are **surfaced, not silently accepted** β€” the return value includes a reviewable list alongside the XML.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  > Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
6
  >
7
  > **Inputs:**
8
+ >
9
  > - A text string to annotate (may already contain partial XML)
10
  > - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
11
  > - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
12
  > - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
13
+
 
 
 
 
 
 
 
 
 
 
14
  > The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
15
 
16
  ## Package Structure
 
22
  β”‚ β”œβ”€β”€ schema.py # TEI element/attribute data structures
23
  β”‚ └── spans.py # Span manifest data structures
24
  β”œβ”€β”€ detection/
25
+ β”‚ └── gliner_detector.py # Optional local GLiNER first-pass span detection
 
26
  β”œβ”€β”€ chunking/
27
  β”‚ └── chunker.py # Overlap-aware text chunker, XML-safe boundaries
28
  β”œβ”€β”€ prompting/
 
72
  text: str
73
  context: str # must contain text as substring
74
  attrs: dict[str, str]
75
+ # always flat β€” nesting is inferred from offset containment in resolver/injector,
76
+ # not emitted by the model (models produce unreliable nested trees)
77
  confidence: float | None = None # passed through from GLiNER
78
 
79
  @dataclass
 
88
 
89
  ---
90
 
91
+ ## GLiNER Dependency (`detection/gliner_detector.py`)
92
 
93
+ `gliner` is a regular package dependency declared in `pyproject.toml` and installed via `uv add gliner`. No manual setup step is needed.
94
+
95
+ Model weights are fetched from HuggingFace Hub automatically on first use of `GLiNER.from_pretrained(model_id)` and cached in `~/.cache/huggingface/`. If the import fails at runtime (e.g. the optional extra was not installed), the module raises a standard `ImportError` with a clear message β€” no wrapper needed.
96
+
97
+ Recommended models (specified as `gliner_model` parameter):
98
+
99
+ - `urchade/gliner_medium-v2.1` β€” balanced, Apache 2.0
100
+ - `numind/NuNER_Zero` β€” stronger multi-word entities, MIT (default)
101
+ - `knowledgator/gliner-multitask-large-v0.5` β€” adds relation extraction
102
+
103
+ All models run on CPU; no GPU required.
 
 
 
 
 
 
 
 
 
104
 
105
  ---
106
 
 
127
  ## Pipeline (`pipeline.py`)
128
 
129
  ```python
130
+ @dataclass
131
+ class AnnotationResult:
132
+ xml: str # annotated XML string
133
+ fuzzy_spans: list[ResolvedSpan] # spans flagged for human review
134
+
135
  def annotate(
136
  text: str, # may contain existing XML tags
137
  schema: TEISchema, # subset of TEI elements in scope
138
  endpoint: EndpointConfig, # injected inference dependency
139
+ gliner_model: str | None = "numind/NuNER_Zero", # None disables GLiNER pass
140
  chunk_size: int = 1500, # chars
141
  chunk_overlap: int = 200,
142
+ ) -> AnnotationResult:
143
  ```
144
 
145
  ### Execution Flow
146
 
147
  ```
148
  1. SETUP
 
149
  strip existing XML tags from text for processing,
150
  preserve them as a restoration map for final merge
151
 
152
+ 2. GLINER PASS (skipped if gliner_model=None, endpoint is EXTRACTION, or text is short)
153
  map TEISchema elements β†’ flat label list for GLiNER
154
  e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
155
  chunk text if len(text) > chunk_size (with overlap)
156
  run gliner.predict_entities() on each chunk
157
  merge cross-chunk duplicates by span text + context overlap
158
  output: list[SpanDescriptor] with text + context + element + confidence
159
+ (GLiNER is a pre-filter only; the LLM may reject, correct, or extend its candidates)
160
 
161
  3. PROMPT ASSEMBLY
162
  select template based on EndpointCapability:
 
177
  a. Parse
178
  JSON_ENFORCED/EXTRACTION: parse directly
179
  TEXT_GENERATION: strip markdown fences, parse JSON,
180
+ on failure: retry once with a correction prompt that includes
181
+ the original (bad) response and the parse error message,
182
+ so the model can self-correct rather than starting from scratch
183
 
184
  b. Resolve (resolver.py)
185
  for each SpanDescriptor:
 
190
 
191
  c. Validate (validator.py)
192
  reject spans where text not in source
 
193
  check attributes against TEISchema allowed values
194
  check element is in schema scope
195
 
196
  d. Inject (injector.py)
197
+ infer nesting from offset containment (child βŠ‚ parent by [start, end] bounds)
198
+ check inferred nesting: children must be within parent bounds
199
  sort ResolvedSpans by start offset, handle nesting depth-first
200
  insert tags into a copy of the original source string
201
  restore previously existing XML tags from step 1
 
205
  optionally validate against full TEI RelaxNG schema via lxml
206
 
207
  6. RETURN
208
+ AnnotationResult(
209
+ xml=annotated_xml_string,
210
+ fuzzy_spans=list_of_flagged_resolved_spans,
211
+ )
212
  ```
213
 
214
  ---
 
216
  ## Key Design Constraints
217
 
218
  - The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
219
+ - The **GLiNER pass is optional** (`gliner_model=None` disables it). It is most useful for long texts with `TEXT_GENERATION` endpoints; it is skipped automatically for `EXTRACTION` endpoints or short inputs. When enabled, GLiNER is a pre-filter only β€” the LLM may reject, correct, or extend its candidates.
220
+ - **Span nesting is inferred from offsets**, never emitted by the model. `SpanDescriptor` is always flat; `ResolvedSpan.children` is populated by the injector from containment relationships.
221
  - `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
222
+ - Fuzzy-matched spans are **surfaced, not silently accepted** β€” `AnnotationResult.fuzzy_spans` provides a reviewable list alongside the XML.
223
+
224
+ ---
225
+
226
+ ## Testing Strategy
227
+
228
+ ### Mocking philosophy
229
+
230
+ **Always mock `call_fn` and the GLiNER detector in unit tests.** Do not use a real GLiNER model as a substitute for a remote LLM endpoint β€” GLiNER is a span-labelling model that cannot produce JSON responses; it cannot exercise the parse/resolve/inject pipeline. Using a real model also makes tests slow (~seconds per inference on CPU), non-deterministic across versions and hardware, and dependent on a 400MB+ download.
231
+
232
+ The `call_fn: (str) -> str` design makes mocking trivial β€” a lambda returning a hardcoded JSON string is sufficient. No mock framework is needed.
233
+
234
+ ### Test layers
235
+
236
+ **Layer 1 β€” Unit tests** (always run, <1s total, fully mocked):
237
+
238
+ ```
239
+ tests/
240
+ β”œβ”€β”€ test_chunker.py # chunker unit tests
241
+ β”œβ”€β”€ test_resolver.py # resolver unit tests
242
+ β”œβ”€β”€ test_validator.py # validator unit tests
243
+ β”œβ”€β”€ test_injector.py # injector unit tests
244
+ β”œβ”€β”€ test_builder.py # prompt builder unit tests
245
+ β”œβ”€β”€ test_parser.py # JSON parse + retry unit tests
246
+ └── test_pipeline.py # full pipeline smoke test (mocked call_fn + GLiNER)
247
+ ```
248
+
249
+ **Layer 2 β€” Integration tests** (opt-in, gated by `pytest -m integration`):
250
+
251
+ ```
252
+ tests/integration/
253
+ β”œβ”€β”€ test_gliner_detector.py # real GLiNER model, real HuggingFace download
254
+ └── test_pipeline_e2e.py # full annotate() with real GLiNER + mocked call_fn
255
+ ```
256
+
257
+ Integration tests are excluded from CI by default via `pyproject.toml`:
258
+
259
+ ```toml
260
+ [tool.pytest.ini_options]
261
+ addopts = "-m 'not integration'"
262
+ ```
263
+
264
+ ### TDD cycle (red β†’ green β†’ refactor)
265
+
266
+ Each module is written test-first. Write a failing test, implement the minimum code to pass, refactor.
267
+
268
+ ### Key test cases per module
269
+
270
+ **`chunker.py`**
271
+
272
+ - Short text below `chunk_size` β†’ single chunk, offset 0
273
+ - Long text β†’ multiple chunks with correct `start_offset` per chunk
274
+ - Span exactly at a chunk boundary β†’ appears in both chunks with correct global offset
275
+ - Input with existing XML tags β†’ chunk boundaries never split a tag
276
+
277
+ **`resolver.py`**
278
+
279
+ - Exact context match β†’ `ResolvedSpan` with `fuzzy_match=False`
280
+ - Context not found in source β†’ span rejected
281
+ - `source[start:end] != span.text` β†’ span rejected
282
+ - Context found but span text not within it β†’ span rejected
283
+ - Context found, span text found, score < 0.92 β†’ span rejected
284
+ - Context found, span text found, 0.92 ≀ score < 1.0 β†’ `fuzzy_match=True`
285
+ - Multiple occurrences of context β†’ first match used, or rejection if ambiguous
286
+
287
+ **`validator.py`**
288
+
289
+ - Element not in schema β†’ span rejected
290
+ - Attribute not in schema β†’ span rejected
291
+ - Attribute value not in `allowed_values` β†’ span rejected
292
+ - Valid span β†’ passes through unchanged
293
+
294
+ **`injector.py`**
295
+
296
+ - Two non-overlapping spans β†’ both tags inserted correctly
297
+ - Span B offset-contained in span A β†’ B is child of A in output
298
+ - Overlapping (non-nesting) spans β†’ reject or flatten with warning
299
+ - Restored XML tags from step 1 do not conflict with injected tags
300
+
301
+ **`builder.py`**
302
+
303
+ - `TEXT_GENERATION` capability β†’ prompt contains JSON example and "output only JSON" instruction
304
+ - `JSON_ENFORCED` capability β†’ prompt is minimal (no JSON scaffolding)
305
+ - GLiNER candidates present β†’ candidates appear in prompt
306
+ - GLiNER candidates absent (pass skipped) β†’ prompt has no candidate section
307
+
308
+ **`parser.py` (parse + retry logic)**
309
+
310
+ - Valid JSON response β†’ parsed to `list[SpanDescriptor]` without retry
311
+ - Markdown-fenced JSON β†’ fences stripped, parsed correctly
312
+ - Invalid JSON on first attempt β†’ retry triggered with correction prompt that includes original bad response + parse error message
313
+ - Invalid JSON on second attempt β†’ exception raised, chunk skipped
314
+
315
+ **`pipeline.py` (smoke test)**
316
+
317
+ ```python
318
+ def mock_call_fn(prompt: str) -> str:
319
+ return json.dumps([
320
+ {"element": "persName", "text": "John Smith",
321
+ "context": "...said John Smith yesterday...", "attrs": {}}
322
+ ])
323
+
324
+ def test_annotate_smoke():
325
+ schema = TEISchema(elements=[
326
+ TEIElement(tag="persName", description="a person's name",
327
+ allowed_children=[], attributes=[])
328
+ ])
329
+ endpoint = EndpointConfig(
330
+ capability=EndpointCapability.JSON_ENFORCED,
331
+ call_fn=mock_call_fn,
332
+ )
333
+ result = annotate(
334
+ text="He said John Smith yesterday.",
335
+ schema=schema,
336
+ endpoint=endpoint,
337
+ gliner_model=None, # disable GLiNER in unit tests
338
+ )
339
+ assert "persName" in result.xml
340
+ assert "John Smith" in result.xml
341
+ assert result.xml.count("John Smith") == 1 # text not duplicated
342
+ ```