Spaces:

cmboulanger
/

tei-annotator

Runtime error

App Files Files Community

tei-annotator / tei_annotator /chunking /README.md

cmboulanger

Add detailed explanation

8a7ede1 about 1 month ago

preview code

raw

history blame contribute delete

2.86 kB

	# Chunking

	Long texts exceed the context window (or the practical attention span) of most LLMs. The chunker splits the source text into overlapping windows so that each LLM call sees a manageable piece of text while entity boundaries between windows are never lost.

	---

	## API

	```python
	from tei_annotator.chunking.chunker import chunk_text, Chunk

	chunks: list[Chunk] = chunk_text(text, chunk_size=1500, chunk_overlap=200)
	```

	Each `Chunk` is a dataclass:

	```python
	@dataclass
	class Chunk:
	text: str # content of this window
	start_offset: int # character position of chunk[0] in the original text
	```

	`start_offset` is used by the resolver to convert chunk-local character positions back to positions in the full source text.

	---

	## Splitting algorithm

	1. Walk the text left-to-right, accumulating characters.
	2. When the accumulated length reaches `chunk_size`, scan backwards for the last whitespace character at or before that position and cut there. The boundary is always between words, never mid-token.
	3. The next chunk begins `chunk_overlap` characters before the cut point, so consecutive chunks share a strip of text.
	4. Repeat until all text is consumed.

	The final chunk is whatever remains after the last cut, even if it is shorter than `chunk_size`.

	---

	## XML safety

	If the source text contains existing XML markup, the splitter is tag-aware: it never places a cut inside an XML tag. Tags are treated as zero-width for length accounting — only visible text characters count toward `chunk_size`. This ensures that pre-existing markup is always preserved intact within whichever chunk it falls in.

	---

	## Why overlap matters

	Without overlap, an entity that straddles the boundary between two chunks could be missed by both LLM calls — the first call sees the entity's opening but not its close, and vice versa. With `chunk_overlap` characters of shared context, the entity appears complete in at least one chunk with enough surrounding text for the model to recognise and annotate it.

	Because the overlap causes the same entity to appear in two consecutive chunks, the pipeline deduplicates resolved spans after collecting results from all chunks: identical `(element, start, end)` triples are merged and only one instance is kept.

	---

	## Tuning

	\| Parameter \| Default \| Effect \|
	\|-----------\|---------\|--------\|
	\| `chunk_size` \| `1500` \| Larger values send more text per LLM call (cheaper) but risk exceeding context limits. Reduce for models with small windows. \|
	\| `chunk_overlap` \| `200` \| Larger values make boundary entities safer but increase the number of duplicate detections that must be deduplicated. \|

	A `chunk_size` of 1500 characters corresponds to roughly 375–500 tokens for Latin-script text, leaving ample room for the prompt preamble and JSON response within a 4 k-token limit.