Spaces:
Sleeping
Sleeping
| # Batch Annotation Experiment | |
| ## Background | |
| The evaluation script (`scripts/evaluate_llm.py`) originally called the LLM once per `<bibl>` record β one round-trip per citation. For a gold standard of 162 records this means 162 sequential API calls, creating considerable latency. | |
| The question: **does sending multiple records in a single LLM call degrade annotation quality, and does the answer differ by model?** | |
| Research (BatchPrompt, ICLR 2024) suggests batch prompting can maintain competitive accuracy while reducing token overhead by 18β30 %. The primary known risk is the *lost in the middle* effect: models tend to perform less reliably on items that appear in the middle of a long context window. | |
| ## Implementation | |
| A `--batch-size N` flag was added to `evaluate_llm.py` (default `1`, fully backward-compatible). | |
| A `--timeout SECONDS` flag controls the HTTP read timeout per API call (default `120`). | |
| When `batch_size > 1` the evaluator: | |
| 1. Collects the plain text of N consecutive `<bibl>` elements. | |
| 2. Joins them with a distinctive ASCII sentinel: | |
| ```text | |
| \n---RECORD|||SEP|||BOUNDARY---\n | |
| ``` | |
| 3. Passes the combined text to the existing `annotate()` pipeline as a single call. | |
| 4. Splits the annotated XML output on the same sentinel to recover N per-record fragments. | |
| 5. Evaluates each fragment against its gold standard independently. | |
| 6. Falls back to empty predictions for the entire batch if the separator count mismatches. | |
| The separator is guaranteed to survive the annotation pass unchanged because `inject_xml()` never modifies text characters β it only inserts XML tags. However, very complex prompts can cause a model to omit or duplicate the separator, triggering the fallback. | |
| ## Experiment Setup | |
| - **Gold standard:** `tests/fixtures/blbl-examples.tei.xml` (162 `<bibl>` records) | |
| - **Match mode:** `text` (normalised whitespace) | |
| - **GLiNER:** disabled | |
| Two models were tested: | |
| | Model | Provider | Inference speed | | |
| | --- | --- | --- | | |
| | Gemini 2.0 Flash | Google | Fast (~3β6 s/record) | | |
| | llama-3.3-70b-instruct | KISSKI / AcademicCloud | Slow (~14β20 s/record) | | |
| > **Note on concurrency:** Runs against the same KISSKI API key must be **sequential** β concurrent jobs cause GPU resource contention, inflating response times and triggering timeouts. | |
| --- | |
| ## Results summary | |
| | Model | Batch size | Records | Timeout | Micro F1 | Macro F1 | API calls | | |
| | --- | --- | --- | --- | --- | --- | --- | | |
| | Gemini 2.0 Flash | 1 | 30 | 120 s | 0.825 | 0.754 | 30 | | |
| | Gemini 2.0 Flash | 10 | 30 | 120 s | 0.847 | 0.779 | 3 | | |
| | Gemini 2.0 Flash | 1 | 162 | 120 s | β | β | β | | |
| | Gemini 2.0 Flash | 10 | 162 | 120 s | **0.744** | 0.627 | 17 | | |
| | KISSKI Llama 3.3 70b | 1 | 162 | 120 s | **0.867** | 0.636 | 162 | | |
| | KISSKI Llama 3.3 70b | 10 | 162 | 120 s | timed out | β | β | | |
| | KISSKI Llama 3.3 70b | 10 | 162 | **600 s** | **0.854** | 0.619 | 17 | | |
| > A full Gemini batch-1 run on 162 records has not been executed. | |
| --- | |
| ## Detailed results: Gemini 2.0 Flash | |
| ### 30-record pilot (batch sizes 1 and 10) | |
| | Metric | Batch-1 | Batch-10 | Ξ | | |
| | --- | --- | --- | --- | | |
| | Micro F1 | 0.825 | **0.847** | +0.022 | | |
| | Micro Precision | 0.807 | **0.825** | +0.018 | | |
| | Micro Recall | 0.843 | **0.870** | +0.027 | | |
| | Macro F1 | 0.754 | **0.779** | +0.025 | | |
| | API calls | 30 | 3 | β90 % | | |
| | Wall time | ~2:14 | ~2:21 | +7 s | | |
| On the 30-record pilot, batch-10 was slightly *better* across every metric. | |
| ### Full 162-record run (batch-10) | |
| | Metric | 30-rec pilot (batch-10) | 162-rec full (batch-10) | Ξ | | |
| | --- | --- | --- | --- | | |
| | Micro F1 | 0.847 | **0.744** | β0.103 | | |
| | Micro Precision | 0.825 | 0.799 | β0.026 | | |
| | Micro Recall | 0.870 | 0.696 | **β0.174** | | |
| | Macro F1 | 0.779 | 0.627 | β0.152 | | |
| Quality degraded significantly on the full dataset. Inspection of the worst records reveals that five consecutive entries (#141β145, all within the same batch) scored F1=0.000 with every gold span missed and nothing predicted. This is the **batch split failure** pattern: the model omitted the separator in that prompt, the mismatch fallback triggered, and all 10 records in the batch received empty predictions. | |
| ### Per-element breakdown (Gemini, 162 records, batch-10) | |
| | Element | Batch-10 F1 | TP | FP | FN | | |
| | --- | --- | --- | --- | --- | | |
| | `surname` | **0.878** | 258 | 6 | 66 | | |
| | `label` | 0.842 | 8 | 0 | 3 | | |
| | `date` | 0.832 | 131 | 16 | 37 | | |
| | `forename` | 0.820 | 242 | 27 | 79 | | |
| | `pubPlace` | 0.794 | 54 | 15 | 13 | | |
| | `ptr` | 0.750 | 3 | 0 | 2 | | |
| | `biblScope` | 0.758 | 111 | 12 | 59 | | |
| | `publisher` | 0.750 | 45 | 10 | 20 | | |
| | `title` | 0.634 | 149 | 55 | 117 | | |
| | `author` | 0.568 | 96 | 88 | 58 | | |
| | `idno` | 0.667 | 1 | 0 | 1 | | |
| | `note` | 0.359 | 7 | 10 | 15 | | |
| | `orgName` | 0.300 | 6 | 23 | 5 | | |
| | `editor` | 0.448 | 15 | 21 | 16 | | |
| --- | |
| ## Detailed results: KISSKI llama-3.3-70b-instruct | |
| ### Full 162-record run, all batch sizes | |
| | Batch size | Timeout | Records completed | Micro F1 | Macro F1 | Notes | | |
| | --- | --- | --- | --- | --- | --- | | |
| | 1 | 120 s | **162 / 162** | **0.867** | **0.636** | Full, reliable | | |
| | 10 | 120 s | 22 / 162 | 0.843\* | 0.672\* | 14/16 batches timed out | | |
| | 50 | 120 s | 0 / 162 | β | β | All timed out | | |
| | 10 | **600 s** | **162 / 162** | 0.854 | 0.619 | Full run with raised timeout | | |
| \* Biased sample β only 3 of 16 batches completed; not representative. | |
| ### Batch-1 vs batch-10 (600 s) β full 162 records | |
| | Metric | Batch-1 | Batch-10 (600 s) | Ξ | | |
| | --- | --- | --- | --- | | |
| | Micro F1 | **0.867** | 0.854 | β0.013 | | |
| | Micro Precision | 0.876 | 0.869 | β0.007 | | |
| | Micro Recall | **0.858** | 0.839 | β0.019 | | |
| | Macro F1 | **0.636** | 0.619 | β0.017 | | |
| | API calls | 162 | 17 | β90 % | | |
| | Wall time | ~43 min | ~57 min | +14 min | | |
| A small but consistent quality drop across the board. No complete batch failures (no batch scored all-zero), confirming Llama respects the separator more reliably than Gemini on this dataset. | |
| ### Per-element comparison (KISSKI, 162 records) | |
| | Element | Batch-1 F1 | Batch-10 F1 | Ξ | | |
| | --- | --- | --- | --- | | |
| | `forename` | **0.954** | 0.945 | β0.009 | | |
| | `surname` | **0.983** | 0.972 | β0.011 | | |
| | `date` | **0.949** | 0.913 | β0.036 | | |
| | `editor` | 0.793 | **0.828** | +0.035 | | |
| | `biblScope` | 0.847 | **0.804** | β0.043 | | |
| | `pubPlace` | 0.853 | **0.855** | +0.002 | | |
| | `publisher` | **0.825** | 0.794 | β0.031 | | |
| | `author` | **0.778** | 0.778 | 0 | | |
| | `title` | **0.743** | 0.733 | β0.010 | | |
| | `note` | 0.485 | 0.485 | 0 | | |
| | `orgName` | 0.421 | **0.457** | +0.036 | | |
| | `label` | **0.235** | 0.222 | β0.013 | | |
| Most regressions are small (< 5 pp). `editor` and `orgName` actually improved slightly in batch mode. | |
| --- | |
| ## Conclusions | |
| ### Batch size is viable but model-dependent | |
| | Model | Batch-10 outcome | Recommendation | | |
| | --- | --- | --- | | |
| | Gemini 2.0 Flash | β10 pp F1 at scale; risk of complete batch failure | Use batch-1 for full runs; batch-10 acceptable for quick pilots | | |
| | KISSKI Llama 3.3 70b | β1.3 pp F1; no batch failures; requires 600 s timeout | Use `--batch-size 10 --timeout 600` | | |
| ### Key findings | |
| 1. **Small batches on small datasets look deceptively good.** Gemini batch-10 on 30 records was *better* than batch-1 (+2 pp). On 162 records it was significantly worse (β10 pp). The 30-record pilot was not representative because it happened to avoid the complex, multi-author batches near the end of the corpus. | |
| 2. **Batch failure is catastrophic, not graceful.** When Gemini loses the separator in a complex batch, all 10 records score 0.000 β a hard floor that drags down corpus-level metrics far more than occasional per-record errors. | |
| 3. **Llama respects the separator more reliably.** With adequate timeout (600 s), all 162 batches completed and there were no separator-loss events. The quality drop is small (β1.3 pp) and the latency gain is real (162 β 17 API calls). | |
| 4. **Wall time does not improve with batching for slow models.** Llama batch-10 took *longer* than batch-1 (~57 min vs ~43 min) because each batch generates ~10Γ the tokens in a single call. The benefit is fewer API calls, which matters for rate-limit management, not raw speed. | |
| 5. **The `--timeout` flag is essential for slow models.** Without `--timeout 600`, Llama batch-10 timed out on 87 % of batches. | |
| ## Recommendations | |
| | Scenario | Command | | |
| | --- | --- | | |
| | Gemini, quality-critical full run | `--provider gemini --batch-size 1` | | |
| | Gemini, quick pilot (β€ 30 records) | `--provider gemini --batch-size 10` | | |
| | KISSKI Llama, full run (balanced) | `--provider kisski --batch-size 10 --timeout 600` | | |
| | KISSKI Llama, full run (highest quality) | `--provider kisski --batch-size 1` | | |