cmboulanger commited on
Commit
484306e
Β·
0 Parent(s):

First commit

Browse files
Files changed (6) hide show
  1. .gitignore +10 -0
  2. .python-version +1 -0
  3. README.md +0 -0
  4. implementation-plan.md +229 -0
  5. main.py +6 -0
  6. pyproject.toml +7 -0
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
README.md ADDED
File without changes
implementation-plan.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plan: `tei-annotator`
2
+
3
+ Prompt:
4
+
5
+ > Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
6
+ >
7
+ > **Inputs:**
8
+ > - A text string to annotate (may already contain partial XML)
9
+ > - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
10
+ > - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
11
+ > - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
12
+ >
13
+ > **Pipeline:**
14
+ > 1. Check for local GLiNER installation and print setup instructions if missing
15
+ > 2. Run a local GLiNER model as a first-pass span detector, mapping TEI element descriptions to GLiNER labels
16
+ > 3. Chunk long texts with overlap, tracking global character offsets
17
+ > 4. Assemble a prompt using the GLiNER candidates, TEI schema context, and JSON output instructions tailored to the endpoint capability
18
+ > 5. Parse and validate the returned span manifest β€” each span has `element`, `text`, `context` (surrounding text for position resolution), and `attrs`
19
+ > 6. Resolve spans to character offsets by searching for the context string in the source, then locating the span text within it β€” reject any span where `source[start:end] != span.text`
20
+ > 7. Inject tags deterministically into the original source text, handling nesting
21
+ > 8. Return the annotated XML plus a list of fuzzy-matched spans flagged for human review
22
+ >
23
+ > The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
24
+
25
+ ## Package Structure
26
+
27
+ ```
28
+ tei_annotator/
29
+ β”œβ”€β”€ __init__.py
30
+ β”œβ”€β”€ models/
31
+ β”‚ β”œβ”€β”€ schema.py # TEI element/attribute data structures
32
+ β”‚ └── spans.py # Span manifest data structures
33
+ β”œβ”€β”€ detection/
34
+ β”‚ β”œβ”€β”€ gliner_detector.py # Local GLiNER first-pass span detection
35
+ β”‚ └── setup.py # GLiNER availability check + install instructions
36
+ β”œβ”€β”€ chunking/
37
+ β”‚ └── chunker.py # Overlap-aware text chunker, XML-safe boundaries
38
+ β”œβ”€β”€ prompting/
39
+ β”‚ β”œβ”€β”€ builder.py # Prompt assembly
40
+ β”‚ └── templates/
41
+ β”‚ β”œβ”€β”€ text_gen.jinja2 # For plain text-generation endpoints
42
+ β”‚ └── json_enforced.jinja2 # For JSON-mode / constrained endpoints
43
+ β”œβ”€β”€ inference/
44
+ β”‚ └── endpoint.py # Endpoint wrapper + capability enum
45
+ β”œβ”€β”€ postprocessing/
46
+ β”‚ β”œβ”€β”€ resolver.py # Context-anchor β†’ char offset resolution
47
+ β”‚ β”œβ”€β”€ validator.py # Span verification + schema validation
48
+ β”‚ └── injector.py # Deterministic XML construction
49
+ └── pipeline.py # Top-level orchestration
50
+ ```
51
+
52
+ ---
53
+
54
+ ## Data Structures (`models/`)
55
+
56
+ ```python
57
+ # schema.py
58
+ @dataclass
59
+ class TEIAttribute:
60
+ name: str # e.g. "ref", "type", "cert"
61
+ description: str
62
+ required: bool = False
63
+ allowed_values: list[str] | None = None # None = free string
64
+
65
+ @dataclass
66
+ class TEIElement:
67
+ tag: str # e.g. "persName"
68
+ description: str # from TEI Guidelines
69
+ allowed_children: list[str] # tags of legal child elements
70
+ attributes: list[TEIAttribute]
71
+
72
+ @dataclass
73
+ class TEISchema:
74
+ elements: list[TEIElement]
75
+ # convenience lookup
76
+ def get(self, tag: str) -> TEIElement | None: ...
77
+
78
+ # spans.py
79
+ @dataclass
80
+ class SpanDescriptor:
81
+ element: str
82
+ text: str
83
+ context: str # must contain text as substring
84
+ attrs: dict[str, str]
85
+ children: list["SpanDescriptor"] # for nested annotations
86
+ confidence: float | None = None # passed through from GLiNER
87
+
88
+ @dataclass
89
+ class ResolvedSpan:
90
+ element: str
91
+ start: int
92
+ end: int
93
+ attrs: dict[str, str]
94
+ children: list["ResolvedSpan"]
95
+ fuzzy_match: bool = False # flagged for human review
96
+ ```
97
+
98
+ ---
99
+
100
+ ## GLiNER Setup (`detection/setup.py`)
101
+
102
+ ```python
103
+ def check_gliner() -> bool:
104
+ try:
105
+ import gliner
106
+ return True
107
+ except ImportError:
108
+ print("""
109
+ GLiNER is not installed. To install:
110
+
111
+ pip install gliner
112
+
113
+ Recommended models (downloaded automatically on first use):
114
+ urchade/gliner_medium-v2.1 # balanced, Apache 2.0
115
+ numind/NuNER_Zero # stronger multi-word entities, MIT
116
+ knowledgator/gliner-multitask-large-v0.5 # adds relation extraction
117
+
118
+ Models run on CPU; no GPU required.
119
+ """)
120
+ return False
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Endpoint Abstraction (`inference/endpoint.py`)
126
+
127
+ ```python
128
+ class EndpointCapability(Enum):
129
+ TEXT_GENERATION = "text_generation" # plain LLM, JSON via prompt only
130
+ JSON_ENFORCED = "json_enforced" # constrained decoding guaranteed
131
+ EXTRACTION = "extraction" # GLiNER2/NuExtract-style native
132
+
133
+ @dataclass
134
+ class EndpointConfig:
135
+ capability: EndpointCapability
136
+ call_fn: Callable[[str], str]
137
+ # call_fn signature: takes a prompt string, returns a response string
138
+ # caller is responsible for auth, model selection, retries
139
+ ```
140
+
141
+ The `call_fn` injection means the library is agnostic about whether the caller is hitting Anthropic, OpenAI, a local Ollama instance, or Fastino's GLiNER2 API. The library just hands it a string and gets a string back.
142
+
143
+ ---
144
+
145
+ ## Pipeline (`pipeline.py`)
146
+
147
+ ```python
148
+ def annotate(
149
+ text: str, # may contain existing XML tags
150
+ schema: TEISchema, # subset of TEI elements in scope
151
+ endpoint: EndpointConfig, # injected inference dependency
152
+ gliner_model: str = "numind/NuNER_Zero",
153
+ chunk_size: int = 1500, # chars
154
+ chunk_overlap: int = 200,
155
+ ) -> str: # returns annotated XML string
156
+ ```
157
+
158
+ ### Execution Flow
159
+
160
+ ```
161
+ 1. SETUP
162
+ check_gliner() β†’ prompt user to install if missing
163
+ strip existing XML tags from text for processing,
164
+ preserve them as a restoration map for final merge
165
+
166
+ 2. GLINER PASS
167
+ map TEISchema elements β†’ flat label list for GLiNER
168
+ e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
169
+ chunk text if len(text) > chunk_size (with overlap)
170
+ run gliner.predict_entities() on each chunk
171
+ merge cross-chunk duplicates by span text + context overlap
172
+ output: list[SpanDescriptor] with text + context + element + confidence
173
+
174
+ 3. PROMPT ASSEMBLY
175
+ select template based on EndpointCapability:
176
+ TEXT_GENERATION: include JSON structure example + "output only JSON" instruction
177
+ JSON_ENFORCED: minimal prompt, schema enforced externally
178
+ EXTRACTION: pass schema directly in endpoint's native format, skip LLM prompt
179
+ inject into prompt:
180
+ - TEIElement descriptions + allowed attributes for in-scope elements
181
+ - GLiNER pre-detected spans as candidates for the model to enrich/correct
182
+ - source text chunk
183
+ - instruction to emit one SpanDescriptor per occurrence, not per unique entity
184
+
185
+ 4. INFERENCE
186
+ call endpoint.call_fn(prompt) β†’ raw response string
187
+
188
+ 5. POSTPROCESSING (per chunk, then merged)
189
+
190
+ a. Parse
191
+ JSON_ENFORCED/EXTRACTION: parse directly
192
+ TEXT_GENERATION: strip markdown fences, parse JSON,
193
+ retry once with correction prompt on failure
194
+
195
+ b. Resolve (resolver.py)
196
+ for each SpanDescriptor:
197
+ find context string in source β†’ exact match preferred
198
+ find text within context window
199
+ assert source[start:end] == span.text β†’ reject on mismatch
200
+ fuzzy fallback (threshold 0.92) β†’ flag for review
201
+
202
+ c. Validate (validator.py)
203
+ reject spans where text not in source
204
+ check nesting: children must be within parent bounds
205
+ check attributes against TEISchema allowed values
206
+ check element is in schema scope
207
+
208
+ d. Inject (injector.py)
209
+ sort ResolvedSpans by start offset, handle nesting depth-first
210
+ insert tags into a copy of the original source string
211
+ restore previously existing XML tags from step 1
212
+
213
+ e. Final validation
214
+ parse output as XML β†’ reject malformed documents
215
+ optionally validate against full TEI RelaxNG schema via lxml
216
+
217
+ 6. RETURN
218
+ annotated XML string
219
+ + list of fuzzy-matched spans flagged for human review
220
+ ```
221
+
222
+ ---
223
+
224
+ ## Key Design Constraints
225
+
226
+ - The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
227
+ - GLiNER is a **pre-filter**, not the authority. The LLM can reject, correct, or add to its candidates. GLiNER's value is positional reliability; the LLM's value is schema reasoning.
228
+ - `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
229
+ - Fuzzy-matched spans are **surfaced, not silently accepted** β€” the return value includes a reviewable list alongside the XML.
main.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ def main():
2
+ print("Hello from tei-annotator!")
3
+
4
+
5
+ if __name__ == "__main__":
6
+ main()
pyproject.toml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "tei-annotator"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = "==3.12"
7
+ dependencies = []