Spaces:
Runtime error
Runtime error
Commit Β·
484306e
0
Parent(s):
First commit
Browse files- .gitignore +10 -0
- .python-version +1 -0
- README.md +0 -0
- implementation-plan.md +229 -0
- main.py +6 -0
- pyproject.toml +7 -0
.gitignore
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python-generated files
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[oc]
|
| 4 |
+
build/
|
| 5 |
+
dist/
|
| 6 |
+
wheels/
|
| 7 |
+
*.egg-info
|
| 8 |
+
|
| 9 |
+
# Virtual environments
|
| 10 |
+
.venv
|
.python-version
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
3.12
|
README.md
ADDED
|
File without changes
|
implementation-plan.md
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Implementation Plan: `tei-annotator`
|
| 2 |
+
|
| 3 |
+
Prompt:
|
| 4 |
+
|
| 5 |
+
> Design a Python library called `tei-annotator` for annotating text with TEI XML tags using a two-stage LLM pipeline. The library should:
|
| 6 |
+
>
|
| 7 |
+
> **Inputs:**
|
| 8 |
+
> - A text string to annotate (may already contain partial XML)
|
| 9 |
+
> - An injected `call_fn: (str) -> str` for calling an arbitrary inference endpoint
|
| 10 |
+
> - An `EndpointCapability` enum indicating whether the endpoint is plain text generation, JSON-constrained, or a native extraction model like GLiNER2
|
| 11 |
+
> - A `TEISchema` data structure describing a subset of TEI elements with descriptions, allowed attributes, and legal child elements
|
| 12 |
+
>
|
| 13 |
+
> **Pipeline:**
|
| 14 |
+
> 1. Check for local GLiNER installation and print setup instructions if missing
|
| 15 |
+
> 2. Run a local GLiNER model as a first-pass span detector, mapping TEI element descriptions to GLiNER labels
|
| 16 |
+
> 3. Chunk long texts with overlap, tracking global character offsets
|
| 17 |
+
> 4. Assemble a prompt using the GLiNER candidates, TEI schema context, and JSON output instructions tailored to the endpoint capability
|
| 18 |
+
> 5. Parse and validate the returned span manifest β each span has `element`, `text`, `context` (surrounding text for position resolution), and `attrs`
|
| 19 |
+
> 6. Resolve spans to character offsets by searching for the context string in the source, then locating the span text within it β reject any span where `source[start:end] != span.text`
|
| 20 |
+
> 7. Inject tags deterministically into the original source text, handling nesting
|
| 21 |
+
> 8. Return the annotated XML plus a list of fuzzy-matched spans flagged for human review
|
| 22 |
+
>
|
| 23 |
+
> The source text must never be modified by any model. Provide a package structure, all key data structures, and a step-by-step execution flow.
|
| 24 |
+
|
| 25 |
+
## Package Structure
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
tei_annotator/
|
| 29 |
+
βββ __init__.py
|
| 30 |
+
βββ models/
|
| 31 |
+
β βββ schema.py # TEI element/attribute data structures
|
| 32 |
+
β βββ spans.py # Span manifest data structures
|
| 33 |
+
βββ detection/
|
| 34 |
+
β βββ gliner_detector.py # Local GLiNER first-pass span detection
|
| 35 |
+
β βββ setup.py # GLiNER availability check + install instructions
|
| 36 |
+
βββ chunking/
|
| 37 |
+
β βββ chunker.py # Overlap-aware text chunker, XML-safe boundaries
|
| 38 |
+
βββ prompting/
|
| 39 |
+
β βββ builder.py # Prompt assembly
|
| 40 |
+
β βββ templates/
|
| 41 |
+
β βββ text_gen.jinja2 # For plain text-generation endpoints
|
| 42 |
+
β βββ json_enforced.jinja2 # For JSON-mode / constrained endpoints
|
| 43 |
+
βββ inference/
|
| 44 |
+
β βββ endpoint.py # Endpoint wrapper + capability enum
|
| 45 |
+
βββ postprocessing/
|
| 46 |
+
β βββ resolver.py # Context-anchor β char offset resolution
|
| 47 |
+
β βββ validator.py # Span verification + schema validation
|
| 48 |
+
β βββ injector.py # Deterministic XML construction
|
| 49 |
+
βββ pipeline.py # Top-level orchestration
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## Data Structures (`models/`)
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
# schema.py
|
| 58 |
+
@dataclass
|
| 59 |
+
class TEIAttribute:
|
| 60 |
+
name: str # e.g. "ref", "type", "cert"
|
| 61 |
+
description: str
|
| 62 |
+
required: bool = False
|
| 63 |
+
allowed_values: list[str] | None = None # None = free string
|
| 64 |
+
|
| 65 |
+
@dataclass
|
| 66 |
+
class TEIElement:
|
| 67 |
+
tag: str # e.g. "persName"
|
| 68 |
+
description: str # from TEI Guidelines
|
| 69 |
+
allowed_children: list[str] # tags of legal child elements
|
| 70 |
+
attributes: list[TEIAttribute]
|
| 71 |
+
|
| 72 |
+
@dataclass
|
| 73 |
+
class TEISchema:
|
| 74 |
+
elements: list[TEIElement]
|
| 75 |
+
# convenience lookup
|
| 76 |
+
def get(self, tag: str) -> TEIElement | None: ...
|
| 77 |
+
|
| 78 |
+
# spans.py
|
| 79 |
+
@dataclass
|
| 80 |
+
class SpanDescriptor:
|
| 81 |
+
element: str
|
| 82 |
+
text: str
|
| 83 |
+
context: str # must contain text as substring
|
| 84 |
+
attrs: dict[str, str]
|
| 85 |
+
children: list["SpanDescriptor"] # for nested annotations
|
| 86 |
+
confidence: float | None = None # passed through from GLiNER
|
| 87 |
+
|
| 88 |
+
@dataclass
|
| 89 |
+
class ResolvedSpan:
|
| 90 |
+
element: str
|
| 91 |
+
start: int
|
| 92 |
+
end: int
|
| 93 |
+
attrs: dict[str, str]
|
| 94 |
+
children: list["ResolvedSpan"]
|
| 95 |
+
fuzzy_match: bool = False # flagged for human review
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## GLiNER Setup (`detection/setup.py`)
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
def check_gliner() -> bool:
|
| 104 |
+
try:
|
| 105 |
+
import gliner
|
| 106 |
+
return True
|
| 107 |
+
except ImportError:
|
| 108 |
+
print("""
|
| 109 |
+
GLiNER is not installed. To install:
|
| 110 |
+
|
| 111 |
+
pip install gliner
|
| 112 |
+
|
| 113 |
+
Recommended models (downloaded automatically on first use):
|
| 114 |
+
urchade/gliner_medium-v2.1 # balanced, Apache 2.0
|
| 115 |
+
numind/NuNER_Zero # stronger multi-word entities, MIT
|
| 116 |
+
knowledgator/gliner-multitask-large-v0.5 # adds relation extraction
|
| 117 |
+
|
| 118 |
+
Models run on CPU; no GPU required.
|
| 119 |
+
""")
|
| 120 |
+
return False
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Endpoint Abstraction (`inference/endpoint.py`)
|
| 126 |
+
|
| 127 |
+
```python
|
| 128 |
+
class EndpointCapability(Enum):
|
| 129 |
+
TEXT_GENERATION = "text_generation" # plain LLM, JSON via prompt only
|
| 130 |
+
JSON_ENFORCED = "json_enforced" # constrained decoding guaranteed
|
| 131 |
+
EXTRACTION = "extraction" # GLiNER2/NuExtract-style native
|
| 132 |
+
|
| 133 |
+
@dataclass
|
| 134 |
+
class EndpointConfig:
|
| 135 |
+
capability: EndpointCapability
|
| 136 |
+
call_fn: Callable[[str], str]
|
| 137 |
+
# call_fn signature: takes a prompt string, returns a response string
|
| 138 |
+
# caller is responsible for auth, model selection, retries
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
The `call_fn` injection means the library is agnostic about whether the caller is hitting Anthropic, OpenAI, a local Ollama instance, or Fastino's GLiNER2 API. The library just hands it a string and gets a string back.
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Pipeline (`pipeline.py`)
|
| 146 |
+
|
| 147 |
+
```python
|
| 148 |
+
def annotate(
|
| 149 |
+
text: str, # may contain existing XML tags
|
| 150 |
+
schema: TEISchema, # subset of TEI elements in scope
|
| 151 |
+
endpoint: EndpointConfig, # injected inference dependency
|
| 152 |
+
gliner_model: str = "numind/NuNER_Zero",
|
| 153 |
+
chunk_size: int = 1500, # chars
|
| 154 |
+
chunk_overlap: int = 200,
|
| 155 |
+
) -> str: # returns annotated XML string
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
### Execution Flow
|
| 159 |
+
|
| 160 |
+
```
|
| 161 |
+
1. SETUP
|
| 162 |
+
check_gliner() β prompt user to install if missing
|
| 163 |
+
strip existing XML tags from text for processing,
|
| 164 |
+
preserve them as a restoration map for final merge
|
| 165 |
+
|
| 166 |
+
2. GLINER PASS
|
| 167 |
+
map TEISchema elements β flat label list for GLiNER
|
| 168 |
+
e.g. [("persName", "a person's name"), ("placeName", "a place name"), ...]
|
| 169 |
+
chunk text if len(text) > chunk_size (with overlap)
|
| 170 |
+
run gliner.predict_entities() on each chunk
|
| 171 |
+
merge cross-chunk duplicates by span text + context overlap
|
| 172 |
+
output: list[SpanDescriptor] with text + context + element + confidence
|
| 173 |
+
|
| 174 |
+
3. PROMPT ASSEMBLY
|
| 175 |
+
select template based on EndpointCapability:
|
| 176 |
+
TEXT_GENERATION: include JSON structure example + "output only JSON" instruction
|
| 177 |
+
JSON_ENFORCED: minimal prompt, schema enforced externally
|
| 178 |
+
EXTRACTION: pass schema directly in endpoint's native format, skip LLM prompt
|
| 179 |
+
inject into prompt:
|
| 180 |
+
- TEIElement descriptions + allowed attributes for in-scope elements
|
| 181 |
+
- GLiNER pre-detected spans as candidates for the model to enrich/correct
|
| 182 |
+
- source text chunk
|
| 183 |
+
- instruction to emit one SpanDescriptor per occurrence, not per unique entity
|
| 184 |
+
|
| 185 |
+
4. INFERENCE
|
| 186 |
+
call endpoint.call_fn(prompt) β raw response string
|
| 187 |
+
|
| 188 |
+
5. POSTPROCESSING (per chunk, then merged)
|
| 189 |
+
|
| 190 |
+
a. Parse
|
| 191 |
+
JSON_ENFORCED/EXTRACTION: parse directly
|
| 192 |
+
TEXT_GENERATION: strip markdown fences, parse JSON,
|
| 193 |
+
retry once with correction prompt on failure
|
| 194 |
+
|
| 195 |
+
b. Resolve (resolver.py)
|
| 196 |
+
for each SpanDescriptor:
|
| 197 |
+
find context string in source β exact match preferred
|
| 198 |
+
find text within context window
|
| 199 |
+
assert source[start:end] == span.text β reject on mismatch
|
| 200 |
+
fuzzy fallback (threshold 0.92) β flag for review
|
| 201 |
+
|
| 202 |
+
c. Validate (validator.py)
|
| 203 |
+
reject spans where text not in source
|
| 204 |
+
check nesting: children must be within parent bounds
|
| 205 |
+
check attributes against TEISchema allowed values
|
| 206 |
+
check element is in schema scope
|
| 207 |
+
|
| 208 |
+
d. Inject (injector.py)
|
| 209 |
+
sort ResolvedSpans by start offset, handle nesting depth-first
|
| 210 |
+
insert tags into a copy of the original source string
|
| 211 |
+
restore previously existing XML tags from step 1
|
| 212 |
+
|
| 213 |
+
e. Final validation
|
| 214 |
+
parse output as XML β reject malformed documents
|
| 215 |
+
optionally validate against full TEI RelaxNG schema via lxml
|
| 216 |
+
|
| 217 |
+
6. RETURN
|
| 218 |
+
annotated XML string
|
| 219 |
+
+ list of fuzzy-matched spans flagged for human review
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
+
|
| 224 |
+
## Key Design Constraints
|
| 225 |
+
|
| 226 |
+
- The source text is **never modified by any model call**. All text in the output comes from the original input; models only contribute tag positions and attributes.
|
| 227 |
+
- GLiNER is a **pre-filter**, not the authority. The LLM can reject, correct, or add to its candidates. GLiNER's value is positional reliability; the LLM's value is schema reasoning.
|
| 228 |
+
- `call_fn` has **no required signature beyond `(str) -> str`**, making it trivial to swap endpoints, add logging, or inject mock functions for testing.
|
| 229 |
+
- Fuzzy-matched spans are **surfaced, not silently accepted** β the return value includes a reviewable list alongside the XML.
|
main.py
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
def main():
|
| 2 |
+
print("Hello from tei-annotator!")
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
if __name__ == "__main__":
|
| 6 |
+
main()
|
pyproject.toml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "tei-annotator"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "Add your description here"
|
| 5 |
+
readme = "README.md"
|
| 6 |
+
requires-python = "==3.12"
|
| 7 |
+
dependencies = []
|