Buckets:

SPerva
/

pillchecker-experiments

Files

xet

SPerva/pillchecker-experiments / BENCHMARK.md

SPerva

23 days ago

preview code

download

raw

14.3 kB

	# NER Benchmark: PharmaDetect on Pill Packaging Text

	Date: 2026-04-14
	Model: `OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M` (108M params)
	Dataset: 11,796 synthesized pack-label texts from HuggingFace `MattBastar/Medicine_Details`
	Pipeline tested: OCR cleaner → NER entity extraction (no RxNorm validation)

	## Dataset

	### Source

	[MattBastar/Medicine_Details](https://huggingface.co/datasets/MattBastar/Medicine_Details)
	— 11,825 Indian brand medicines with structured Composition, Manufacturer,
	Uses, and Image URL fields. We use the Medicine Name + Composition +
	Manufacturer columns to synthesize realistic pack-label OCR text.

	### How It Works

	`prepare_hf_dataset.py` takes each row and:

	1. Synthesizes a pack label from a random template (blister pack,
	box label, prescription-style, syrup label, etc.):

	```
	Augmentin 625 Duo Tablet
	Each tablet contains:
	Amoxycillin 500mg
	Clavulanic Acid 125mg
	Glaxo SmithKline Pharmaceuticals Ltd
	```

	2. Parses ground-truth labels from the Composition field:
	`"Amoxycillin (500mg) + Clavulanic Acid (125mg)"` → `["Amoxycillin", "Clavulanic Acid"]`

	3. Optionally injects OCR noise (`--noise light\|heavy`) using
	pharma-specific distortion patterns drawn from real OCR failures:

	\| Pattern \| Example \| Source \|
	\|---------\|---------\|--------\|
	\| m→rn \| Metformin → Metforrnin \| Glyph confusion in serif fonts \|
	\| I→l (word start) \| Ibuprofen → lbuprofen \| Uppercase I vs lowercase L \|
	\| l→1 (interior) \| Alprazolam → A1prazolam \| l/1 confusion \|
	\| o→0 / O→0 \| Omeprazole → 0mepraz0le \| Letter/digit confusion \|
	\| cl→d \| Clavulanic → Davulanic \| Ligature misread \|
	\| mg→rng \| 500mg → 500rng \| m→rn in dosage suffix \|
	\| Mid-word splits \| Bevacizumab → Bevacizu mab \| Line-wrap OCR artifact \|
	\| All-caps \| ATORVASTATIN 40MG \| Uppercase printed labels \|

	### Category Breakdown

	\| Category \| N \| Description \|
	\|----------\|---\|-------------\|
	\| single_ingredient \| 7,081 \| One active ingredient (e.g., Azithromycin) \|
	\| dual_ingredient \| 3,591 \| Two active ingredients (e.g., Amoxycillin + Clavulanic Acid) \|
	\| multi_ingredient \| 1,124 \| Three or more active ingredients \|

	### Reproducing

	```bash
	uv run python eval/prepare_hf_dataset.py # clean text (default)
	uv run python eval/prepare_hf_dataset.py --noise light # light OCR artifacts
	uv run python eval/prepare_hf_dataset.py --noise heavy # heavy OCR distortion
	uv run python eval/prepare_hf_dataset.py --limit 500 # smaller sample
	```

	## About the Model

	### Architecture

	PharmaDetect-BioPatient-108M is a token-classification (NER) model from the
	[OpenMed NER](https://arxiv.org/abs/2508.01630) suite (Panahi, 2025). It
	detects chemical entities using BIO tagging (`B-CHEM`, `I-CHEM`).

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| [`Bio_Discharge_Summary_BERT`](https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT) (BioBERT v1.0 → MIMIC-III discharge summaries, ~880M words) \|
	\| DAPT corpus \| 350k passages (90M tokens): 100k PubMed, 100k arXiv, 100k MIMIC-III, 50k ClinicalTrials.gov \|
	\| DAPT method \| LoRA (rank=16, α=32, dropout=0.05) on query/value matrices, 3 epochs, single A100 ~4h \|
	\| Fine-tuning dataset \| BC5CDR-CHEM (BioCreative V Chemical-Disease Relation) \|
	\| Parameters \| 108M (1.4% trainable via LoRA during DAPT) \|
	\| Entity types \| Chemical entities only (`B-CHEM` / `I-CHEM`) \|
	\| Published F1 \| 95.83% on BC5CDR-CHEM test set \|

	### Domain Gap: Literature vs. Packaging

	\| Aspect \| BC5CDR (training) \| Pill packaging (our use) \|
	\|--------\|-------------------\|--------------------------\|
	\| Text style \| Scientific prose \| Formulaic labels \|
	\| Length \| Full abstracts (~200 words) \| Short labels (~5-20 words) \|
	\| Chemical mentions \| In sentence context \| Standalone, prominent \|
	\| Brand names \| Rare \| Very common \|
	\| Salt forms \| Part of scientific name \| Separated on packaging \|
	\| OCR artifacts \| None \| Common (rn→m, 0→o, etc.) \|

	## Our Pipeline

	```
	Raw OCR text
	→ ocr_cleaner.py (fix rn→m, 0→o, ligatures, whitespace)
	→ ner_model.py (PharmaDetect with manual sub-word merging)
	→ drug_analyzer.py (filter CHEM labels, dedupe, skip single-char/punctuation)
	→ rxnorm_client.py (validate against RxNorm, get rxcui)
	→ [if NER finds nothing] rxnorm fallback (approximate term search)
	```

	The benchmark tests the first three steps (OCR cleaner → NER → basic filtering)
	without RxNorm validation, to isolate the NER model's behavior on packaging text.

	## Results

	### By Noise Level (Bare NER, 500-case samples)

	\| Noise \| Precision \| Recall \| F1 \| Detection \|
	\|-------\|-----------\|--------\|------\|-----------\|
	\| none (clean) \| 46.9% \| 84.4% \| 60.3% \| 99.6% \|
	\| light (5-15% char errors) \| 44.9% \| 79.8% \| 57.5% \| 99.8% \|
	\| heavy (40% errors + splits) \| 26.2% \| 53.5% \| 35.2% \| 99.8% \|

	Detection rate = percentage of cases where NER found at least one entity.

	### Full Pipeline vs Bare NER (50-case samples)

	\| Pipeline Step \| Noise \| Precision \| Recall \| F1 \| Latency / Case \|
	\|---------------\|-------\|-----------\|--------\|----\|----------------\|
	\| Bare NER \| none \| 48.6% \| 81.0% \| 60.7% \| 64ms \|
	\| Full Pipeline \| none \| 71.6% \| 81.0% \| 76.0% \| 961ms \|
	\| Bare NER \| light \| 47.5% \| 79.8% \| 59.6% \| 69ms \|
	\| Full Pipeline \| light \| 74.4% \| 79.8% \| 77.0% \| 1089ms \|
	\| Bare NER \| heavy \| 24.7% \| 47.6% \| 32.5% \| 78ms \|
	\| Full Pipeline \| heavy \| 65.6% \| 47.6% \| 55.2% \| 2597ms \|

	The RxNorm validation step successfully rejected 37 False Positives (boosting precision by ~23 points) without hurting recall. However, relying on an external HTTP API adds nearly 900ms of latency per case.

	### By Category (clean text, 500 cases)

	\| Category \| N \| Precision \| Recall \| F1 \| TP \| FP \| FN \|
	\|----------\|---\|-----------\|--------\|-----\|-----\|-----\|-----\|
	\| single_ingredient \| 279 \| 41% \| 86% \| 55% \| 239 \| 347 \| 40 \|
	\| dual_ingredient \| 155 \| 49% \| 85% \| 62% \| 261 \| 270 \| 47 \|
	\| multi_ingredient \| 66 \| 54% \| 82% \| 65% \| 170 \| 143 \| 37 \|

	### GLiNER Pipeline Augmentation (500-case samples, clean text)

	To improve precision (rejecting brands, salts, and manufacturers) and recall (catching edge cases), we evaluated `urchade/gliner_medium-v2.1` as a secondary observer and adjudicator alongside the baseline PharmaDetect + RxNorm pipeline.

	#### Evaluated Architectures

	1. `baseline` (Current): PharmaDetect NER → RxNorm Validation. If no entities are found, falls back to raw text blocks via RxNorm's `approximateTerm`.
	2. `gliner_filter` (Precision): GLiNER runs alongside PharmaDetect. If GLiNER explicitly tags a PharmaDetect entity as a negative label (e.g., `brand or trade name`, `manufacturer`, `salt or counter-ion`), the entity is rejected before hitting RxNorm.
	3. `gliner_sequential` (Speed & Precision): PharmaDetect runs first. Only the short entity spans extracted by PharmaDetect are passed to GLiNER for classification. If GLiNER tags the snippet as an `active pharmaceutical ingredient`, it proceeds to RxNorm. This saves massive CPU overhead compared to running GLiNER on the whole document.
	4. `gliner_fallback` (Recall): If PharmaDetect returns zero entities, the pipeline queries GLiNER for `active pharmaceutical ingredient` spans instead of running the raw text `approximateTerm` fallback.
	5. `gliner_union` (Recall & Precision): Both PharmaDetect and GLiNER run. All GLiNER `active pharmaceutical ingredient` spans are merged with the PharmaDetect results, deduplicated by their resolved RxNorm IDs.
	6. `gliner_adjudicated` (Complex Logic): An advanced version of the filter. It rejects negative GLiNER labels, but is "salt-aware"—it won't reject a standalone salt form if GLiNER detects an active ingredient immediately adjacent to it in the text.

	\| Experiment Mode \| Pipeline Precision \| Pipeline Recall \| Pipeline F1 \| Avg Latency (ms)* \|
	\|-----------------\|--------------------\|-----------------\|-------------\|-------------------\|
	\| baseline (Current) \| 76.2% \| 84.1% \| 80.0% \| 984.5 \|
	\| gliner_filter \| 77.8% \| 84.0% \| 80.8% \| 1221.9 \|
	\| gliner_fallback \| 76.5% \| 84.5% \| 80.3% \| 1208.5 \|
	\| gliner_sequential \| 84.9% \| 58.6% \| 69.3% \| 990.9 \|
	\| gliner_adjudicated \| 78.2% \| 84.4% \| 81.2% \| 1218.8 \|
	\| gliner_union \| 78.0% \| 93.6% \| 85.1% \| 1266.3 \|

	\Note: Latency is inflated by concurrent background testing hitting CPU limits, but relative differences show the overhead of GLiNER. The sequential mode has very low latency because it only runs GLiNER on short snippets.*

	Findings:
	1. Recall Improvement (`gliner_union`): The union mode dramatically increased recall to 93.6% (a 9.5pp increase) while slightly increasing precision to 78.0%. GLiNER successfully identifies active ingredients that PharmaDetect misses entirely, without introducing false positives.
	2. The Context Trap (`gliner_sequential`): The sequential mode achieved the highest precision (84.9%) and the fastest latency. However, its recall plummeted to 58.6%. Because GLiNER was only fed the short text snippets extracted by PharmaDetect, it lost all surrounding context. GLiNER relies heavily on context to classify entities; without it, it failed to recognize many valid active ingredients, causing them to be falsely filtered out.
	3. Precision Improvement (`gliner_filter` and `gliner_adjudicated`): The salt-aware adjudicator successfully stripped out false positives (brands, salts), raising precision by 2pp with virtually no loss in recall.
	4. Conclusion: `gliner_union` produced the best overall F1 score (85.1%) because running GLiNER on the full text preserves its contextual reasoning.

	### Interpretation

	Recall is strong (84%). The model finds the active ingredient in most
	cases. Multi-ingredient labels are slightly harder (82%) due to more
	complex text. Under light OCR noise, recall drops only to 80% — the
	model is reasonably robust to minor distortion.

	Precision is poor (47%). The model tags brand names, manufacturer
	names, salt forms, and dosage form words as chemical entities. This is
	correct per BC5CDR training ("find all chemicals") but wrong for our use
	case ("find active pharmaceutical ingredients only").

	Heavy noise halves recall (54%). Mid-word splits (`Bevacizu mab`)
	and character corruption (`Metforrnin`) break the tokenizer. This is
	the target for Phase 4 (OCR Modernization with dictionary-backed fuzzy
	matching).

	## False Positive Analysis

	All false positives fall into predictable categories:

	### 1. Brand Names
	The model tags brand names as chemical entities: Augmentin, Avastin,
	Allegra, Lipitor, etc. These are the medicine's trade names, not active
	ingredients.

	### 2. Salt Forms
	The model tags salt/counter-ion names separately: Sodium, Hydrochloride,
	Calcium, Phosphate, Maleate, Potassium. These appear in compositions
	like "Atorvastatin Calcium" but are not the active drug.

	### 3. Manufacturer Names
	Pharmaceutical company names occasionally get tagged: "Cipla", "Lupin",
	names that look chemical-like to the model.

	### 4. Dosage Form Words
	Words like "Tablet", "Capsule", "Syrup", "Injection" sometimes get
	tagged, especially in compact label layouts.

	## Noise Degradation Analysis

	\| Metric \| None → Light \| Light → Heavy \|
	\|--------\|-------------\|---------------\|
	\| Recall \| 84% → 80% (−4pp) \| 80% → 54% (−26pp) \|
	\| Precision \| 47% → 45% (−2pp) \| 45% → 26% (−19pp) \|
	\| F1 \| 60% → 58% (−3pp) \| 58% → 35% (−23pp) \|

	Light noise has minimal impact — the OCR cleaner handles common `o→0`
	and `l→1` substitutions. Heavy noise causes severe degradation because:

	- Mid-word splits break tokenization (`Amoxy cillin` becomes two tokens)
	- m→rn corruption in drug names (`Metforrnin`) escapes the cleaner's
	limited regex patterns
	- All-caps text shifts token distributions away from training data

	## Comparison with Published Benchmarks

	\| Benchmark \| Precision \| Recall \| F1 \|
	\|-----------\|-----------\|--------\|-----\|
	\| BC5CDR-CHEM (published) \| 95.1% \| 96.6% \| 95.8% \|
	\| Our packaging — clean \| 46.9% \| 84.4% \| 60.3% \|
	\| Our packaging — light noise \| 44.9% \| 79.8% \| 57.5% \|
	\| Our packaging — heavy noise \| 26.2% \| 53.5% \| 35.2% \|

	The precision gap (95% → 47%) reflects a task mismatch, not model
	quality. BC5CDR measures "find all chemicals" — we measure "find only
	active pharmaceutical ingredients." The recall gap (97% → 84%) reflects
	the domain shift from clean scientific text to formulaic packaging labels.

	## Remediation Plan

	\| Phase \| Target \| Expected Impact \|
	\|-------\|--------\|-----------------\|
	\| Phase 2: Entity Linking \| Filter NER output through DrugBank to reject non-drug entities \| Precision 47% → ~85%+ \|
	\| Phase 3: Fallback \| DrugBank fuzzy search when NER finds 0 entities \| Recall +5-10% on edge cases \|
	\| Phase 4: OCR Modernization \| Dictionary-backed fuzzy correction before NER \| Heavy-noise recall 54% → ~75%+ \|

	See `docs/plans/phase{2,3,4}_*.md` for detailed designs.

	## Reproducing

	```bash
	# Generate dataset (once)
	uv run python eval/prepare_hf_dataset.py # clean
	uv run python eval/prepare_hf_dataset.py --noise light # light OCR noise
	uv run python eval/prepare_hf_dataset.py --noise heavy # heavy OCR noise

	# Run benchmark
	uv run python eval/benchmark.py # full dataset
	uv run python eval/benchmark.py --limit 500 # quick run
	uv run python eval/benchmark.py --with-rxnorm # full pipeline (needs network)
	uv run python eval/benchmark.py -v # per-case details
	```

	Results are written to `eval/results.json`.

	## References

	- [OpenMed NER paper](https://arxiv.org/abs/2508.01630) — Panahi, 2025
	- [BC5CDR corpus](https://pmc.ncbi.nlm.nih.gov/articles/PMC4860626/) — Li et al., 2016
	- [Bio_Discharge_Summary_BERT](https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT) — Alsentzer et al.
	- [PharmaDetect-BioPatient-108M](https://huggingface.co/OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M) — OpenMed
	- [MattBastar/Medicine_Details](https://huggingface.co/datasets/MattBastar/Medicine_Details) — HuggingFace dataset

Xet Storage Details

Size:: 14.3 kB
Xet hash:: 21c9c64913ceff976b4a6525f32db09c722e0b706ac4fc050061e8ba08412406

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.