Spaces:

cmboulanger
/

tei-annotator

Sleeping

App Files Files Community

tei-annotator / docs /batch-annotation-experiment.md

cmboulanger

Add report for whole-dataset performance experiment

dd4de5c 13 days ago

preview code

raw

history blame contribute delete

8.69 kB

Batch Annotation Experiment

Background

The evaluation script (scripts/evaluate_llm.py) originally called the LLM once per <bibl> record — one round-trip per citation. For a gold standard of 162 records this means 162 sequential API calls, creating considerable latency.

The question: does sending multiple records in a single LLM call degrade annotation quality, and does the answer differ by model?

Research (BatchPrompt, ICLR 2024) suggests batch prompting can maintain competitive accuracy while reducing token overhead by 18–30 %. The primary known risk is the lost in the middle effect: models tend to perform less reliably on items that appear in the middle of a long context window.

Implementation

A --batch-size N flag was added to evaluate_llm.py (default 1, fully backward-compatible). A --timeout SECONDS flag controls the HTTP read timeout per API call (default 120).

When batch_size > 1 the evaluator:

Collects the plain text of N consecutive <bibl> elements.
Joins them with a distinctive ASCII sentinel:
```
\n---RECORD|||SEP|||BOUNDARY---\n
```
Passes the combined text to the existing annotate() pipeline as a single call.
Splits the annotated XML output on the same sentinel to recover N per-record fragments.
Evaluates each fragment against its gold standard independently.
Falls back to empty predictions for the entire batch if the separator count mismatches.

The separator is guaranteed to survive the annotation pass unchanged because inject_xml() never modifies text characters — it only inserts XML tags. However, very complex prompts can cause a model to omit or duplicate the separator, triggering the fallback.

Experiment Setup

Gold standard: tests/fixtures/blbl-examples.tei.xml (162 <bibl> records)
Match mode: text (normalised whitespace)
GLiNER: disabled

Two models were tested:

Model	Provider	Inference speed
Gemini 2.0 Flash	Google	Fast (~3–6 s/record)
llama-3.3-70b-instruct	KISSKI / AcademicCloud	Slow (~14–20 s/record)

Note on concurrency: Runs against the same KISSKI API key must be sequential — concurrent jobs cause GPU resource contention, inflating response times and triggering timeouts.

Results summary

Model	Batch size	Records	Timeout	Micro F1	Macro F1	API calls
Gemini 2.0 Flash	1	30	120 s	0.825	0.754	30
Gemini 2.0 Flash	10	30	120 s	0.847	0.779	3
Gemini 2.0 Flash	1	162	120 s	—	—	—
Gemini 2.0 Flash	10	162	120 s	0.744	0.627	17
KISSKI Llama 3.3 70b	1	162	120 s	0.867	0.636	162
KISSKI Llama 3.3 70b	10	162	120 s	timed out	—	—
KISSKI Llama 3.3 70b	10	162	600 s	0.854	0.619	17

A full Gemini batch-1 run on 162 records has not been executed.

Detailed results: Gemini 2.0 Flash

30-record pilot (batch sizes 1 and 10)

Metric	Batch-1	Batch-10	Δ
Micro F1	0.825	0.847	+0.022
Micro Precision	0.807	0.825	+0.018
Micro Recall	0.843	0.870	+0.027
Macro F1	0.754	0.779	+0.025
API calls	30	3	−90 %
Wall time	~2:14	~2:21	+7 s

On the 30-record pilot, batch-10 was slightly better across every metric.

Full 162-record run (batch-10)

Metric	30-rec pilot (batch-10)	162-rec full (batch-10)	Δ
Micro F1	0.847	0.744	−0.103
Micro Precision	0.825	0.799	−0.026
Micro Recall	0.870	0.696	−0.174
Macro F1	0.779	0.627	−0.152

Quality degraded significantly on the full dataset. Inspection of the worst records reveals that five consecutive entries (#141–145, all within the same batch) scored F1=0.000 with every gold span missed and nothing predicted. This is the batch split failure pattern: the model omitted the separator in that prompt, the mismatch fallback triggered, and all 10 records in the batch received empty predictions.

Per-element breakdown (Gemini, 162 records, batch-10)

Element	Batch-10 F1	TP	FP	FN
`surname`	0.878	258	6	66
`label`	0.842	8	0	3
`date`	0.832	131	16	37
`forename`	0.820	242	27	79
`pubPlace`	0.794	54	15	13
`ptr`	0.750	3	0	2
`biblScope`	0.758	111	12	59
`publisher`	0.750	45	10	20
`title`	0.634	149	55	117
`author`	0.568	96	88	58
`idno`	0.667	1	0	1
`note`	0.359	7	10	15
`orgName`	0.300	6	23	5
`editor`	0.448	15	21	16

Detailed results: KISSKI llama-3.3-70b-instruct

Full 162-record run, all batch sizes

Batch size	Timeout	Records completed	Micro F1	Macro F1	Notes
1	120 s	162 / 162	0.867	0.636	Full, reliable
10	120 s	22 / 162	0.843*	0.672*	14/16 batches timed out
50	120 s	0 / 162	—	—	All timed out
10	600 s	162 / 162	0.854	0.619	Full run with raised timeout

* Biased sample — only 3 of 16 batches completed; not representative.

Batch-1 vs batch-10 (600 s) — full 162 records

Metric	Batch-1	Batch-10 (600 s)	Δ
Micro F1	0.867	0.854	−0.013
Micro Precision	0.876	0.869	−0.007
Micro Recall	0.858	0.839	−0.019
Macro F1	0.636	0.619	−0.017
API calls	162	17	−90 %
Wall time	~43 min	~57 min	+14 min

A small but consistent quality drop across the board. No complete batch failures (no batch scored all-zero), confirming Llama respects the separator more reliably than Gemini on this dataset.

Per-element comparison (KISSKI, 162 records)

Element	Batch-1 F1	Batch-10 F1	Δ
`forename`	0.954	0.945	−0.009
`surname`	0.983	0.972	−0.011
`date`	0.949	0.913	−0.036
`editor`	0.793	0.828	+0.035
`biblScope`	0.847	0.804	−0.043
`pubPlace`	0.853	0.855	+0.002
`publisher`	0.825	0.794	−0.031
`author`	0.778	0.778	0
`title`	0.743	0.733	−0.010
`note`	0.485	0.485	0
`orgName`	0.421	0.457	+0.036
`label`	0.235	0.222	−0.013

Most regressions are small (< 5 pp). editor and orgName actually improved slightly in batch mode.

Conclusions

Batch size is viable but model-dependent

Model	Batch-10 outcome	Recommendation
Gemini 2.0 Flash	−10 pp F1 at scale; risk of complete batch failure	Use batch-1 for full runs; batch-10 acceptable for quick pilots
KISSKI Llama 3.3 70b	−1.3 pp F1; no batch failures; requires 600 s timeout	Use `--batch-size 10 --timeout 600`

Key findings

Small batches on small datasets look deceptively good. Gemini batch-10 on 30 records was better than batch-1 (+2 pp). On 162 records it was significantly worse (−10 pp). The 30-record pilot was not representative because it happened to avoid the complex, multi-author batches near the end of the corpus.
Batch failure is catastrophic, not graceful. When Gemini loses the separator in a complex batch, all 10 records score 0.000 — a hard floor that drags down corpus-level metrics far more than occasional per-record errors.
Llama respects the separator more reliably. With adequate timeout (600 s), all 162 batches completed and there were no separator-loss events. The quality drop is small (−1.3 pp) and the latency gain is real (162 → 17 API calls).
Wall time does not improve with batching for slow models. Llama batch-10 took longer than batch-1 (~57 min vs ~43 min) because each batch generates ~10× the tokens in a single call. The benefit is fewer API calls, which matters for rate-limit management, not raw speed.
The --timeout flag is essential for slow models. Without --timeout 600, Llama batch-10 timed out on 87 % of batches.

Recommendations

Scenario	Command
Gemini, quality-critical full run	`--provider gemini --batch-size 1`
Gemini, quick pilot (≤ 30 records)	`--provider gemini --batch-size 10`
KISSKI Llama, full run (balanced)	`--provider kisski --batch-size 10 --timeout 600`
KISSKI Llama, full run (highest quality)	`--provider kisski --batch-size 1`