Spaces:

cmboulanger
/

tei-annotator

Sleeping

App Files Files Community

tei-annotator / tei_annotator /postprocessing /README.md

cmboulanger

Add detailed explanation

8a7ede1 14 days ago

preview code

raw

history blame contribute delete

5.93 kB

Post-processing

Post-processing converts the LLM's raw text output into XML tags inserted at the correct positions in the source text. It is split into four focused modules that execute in sequence:

LLM response (string)
       │
       ▼
 1. parse      →  list[SpanDescriptor]         (parser.py)
       │
       ▼
 2. resolve    →  list[ResolvedSpan]            (resolver.py)
       │
       ▼
 3. validate   →  list[ResolvedSpan] (filtered) (validator.py)
       │
       ▼
 4. inject     →  annotated XML string          (injector.py)

1. Parse (`parser.py`)

Input: raw string returned by call_fn Output: list[SpanDescriptor]

Handles the messiness of real LLM output:

Fence stripping — models often wrap JSON in markdown code fences (```json … ```). These are detected and removed before parsing, regardless of the language tag used.
JSON extraction — the cleaned string is parsed with json.loads. Only a JSON array is accepted; any other top-level type raises ValueError.
Dict-to-span conversion — each dict is validated for required keys (element, text, context) and coerced to a SpanDescriptor. Dicts with missing or non-string values are silently dropped.
Retry on failure — for TEXT_GENERATION endpoints, if the initial parse fails the pipeline calls make_correction_prompt() and retries exactly once with the same call_fn. If the second parse also fails, the chunk is skipped with a warning.

2. Resolve (`resolver.py`)

Input: list[SpanDescriptor], plain source text Output: list[ResolvedSpan]

Converts context-anchored descriptors to absolute character offsets.

Why context anchoring?

LLMs cannot reliably count characters in long strings, so asking them to return numeric offsets produces frequent errors. Instead, each span is described with its surrounding text (SpanDescriptor.context). The resolver searches the source text for that context string to determine where the entity lives. This approach trades a small risk of positional ambiguity (two identical context strings in different positions) for a large gain in robustness. In practice, entity text is distinctive enough that collisions are rare.

Resolution algorithm

For each SpanDescriptor:

Locate context — search for context in the source text:
- First, try exact substring search (str.find).
- If not found, fall back to fuzzy matching via rapidfuzz with a default similarity threshold of 0.92. Fuzzy-matched spans receive ResolvedSpan.fuzzy_match = True and appear in AnnotationResult.fuzzy_spans for human review.
- If neither match succeeds, the span is silently discarded.
Locate text within context — once the context window is pinned to a position in the source text, do a substring search for SpanDescriptor.text within that window.
- If not found, the span is discarded.
Compute offsets — the context window's start offset plus the text's offset within the window gives (start, end) in the original source text.

Fuzzy matching details

rapidfuzz's extractOne function scores candidate substrings using the partial_ratio scorer, which finds the best-matching substring of the target for the query string. The 0.92 threshold was chosen empirically to accept minor OCR errors, whitespace normalisation differences, and Unicode equivalences while rejecting clearly wrong matches.

3. Validate (`validator.py`)

Input: list[ResolvedSpan], TEISchema Output: list[ResolvedSpan] (filtered)

Three checks are applied. Spans failing any check are silently dropped and logged at WARNING level:

Element exists — the span's element must appear in the schema's element list.
Attribute names — every key in ResolvedSpan.attributes must be declared in the element's TEIAttribute list.
Attribute values — if TEIAttribute.allowed_values is set, the value must be one of those strings.

Additionally, spans with invalid bounds (start < 0, end > len(text), or start >= end) are rejected.

Validation is a safety net against hallucinations: the LLM occasionally invents element tags or attribute names not in the schema. Dropping those silently keeps the output valid.

4. Inject (`injector.py`)

Input: plain source text, list[ResolvedSpan] Output: annotated XML string

Building the nesting tree

Spans are flat at this point (no children). The injector infers nesting geometrically: span A is a child of span B if B.start <= A.start and A.end <= B.end.

_build_nesting_tree implements a greedy algorithm:

Sort spans by start offset, then by length descending (so parents — which are longer — sort before their children).
For each span, find its tightest enclosing parent by scanning already-placed spans.
Attach it as a child of that parent, or as a root-level span if no enclosing parent exists.

Overlapping spans — where two spans partially overlap without one containing the other — are detected and logged as warnings. The offending span (the one with the later start position) is skipped, since overlapping XML tags are not well-formed.

Recursive tag injection

_inject_recursive walks the nesting tree depth-first. At each node it emits:

Any source text between the previous sibling's end and this span's start.
The opening tag with any attributes, e.g. <persName ref="...">.
Recursively, the content of all child spans.
Any source text after the last child and before this span's end.
The closing tag, e.g. </persName>.

Because the injector works from pre-computed offsets on the original source text, the text content is never altered — only tags are inserted. This preserves the exact wording, whitespace, and punctuation of the input.