SPerva's picture
|
download
raw
14.3 kB

NER Benchmark: PharmaDetect on Pill Packaging Text

Date: 2026-04-14 Model: OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M (108M params) Dataset: 11,796 synthesized pack-label texts from HuggingFace MattBastar/Medicine_Details Pipeline tested: OCR cleaner → NER entity extraction (no RxNorm validation)

Dataset

Source

MattBastar/Medicine_Details — 11,825 Indian brand medicines with structured Composition, Manufacturer, Uses, and Image URL fields. We use the Medicine Name + Composition + Manufacturer columns to synthesize realistic pack-label OCR text.

How It Works

prepare_hf_dataset.py takes each row and:

  1. Synthesizes a pack label from a random template (blister pack, box label, prescription-style, syrup label, etc.):

    Augmentin 625 Duo Tablet
    Each tablet contains:
    Amoxycillin 500mg
    Clavulanic Acid 125mg
    Glaxo SmithKline Pharmaceuticals Ltd
    
  2. Parses ground-truth labels from the Composition field: "Amoxycillin (500mg) + Clavulanic Acid (125mg)"["Amoxycillin", "Clavulanic Acid"]

  3. Optionally injects OCR noise (--noise light|heavy) using pharma-specific distortion patterns drawn from real OCR failures:

    Pattern Example Source
    m→rn Metformin → Metforrnin Glyph confusion in serif fonts
    I→l (word start) Ibuprofen → lbuprofen Uppercase I vs lowercase L
    l→1 (interior) Alprazolam → A1prazolam l/1 confusion
    o→0 / O→0 Omeprazole → 0mepraz0le Letter/digit confusion
    cl→d Clavulanic → Davulanic Ligature misread
    mg→rng 500mg → 500rng m→rn in dosage suffix
    Mid-word splits Bevacizumab → Bevacizu mab Line-wrap OCR artifact
    All-caps ATORVASTATIN 40MG Uppercase printed labels

Category Breakdown

Category N Description
single_ingredient 7,081 One active ingredient (e.g., Azithromycin)
dual_ingredient 3,591 Two active ingredients (e.g., Amoxycillin + Clavulanic Acid)
multi_ingredient 1,124 Three or more active ingredients

Reproducing

uv run python eval/prepare_hf_dataset.py                  # clean text (default)
uv run python eval/prepare_hf_dataset.py --noise light     # light OCR artifacts
uv run python eval/prepare_hf_dataset.py --noise heavy     # heavy OCR distortion
uv run python eval/prepare_hf_dataset.py --limit 500       # smaller sample

About the Model

Architecture

PharmaDetect-BioPatient-108M is a token-classification (NER) model from the OpenMed NER suite (Panahi, 2025). It detects chemical entities using BIO tagging (B-CHEM, I-CHEM).

Property Value
Base model Bio_Discharge_Summary_BERT (BioBERT v1.0 → MIMIC-III discharge summaries, ~880M words)
DAPT corpus 350k passages (90M tokens): 100k PubMed, 100k arXiv, 100k MIMIC-III, 50k ClinicalTrials.gov
DAPT method LoRA (rank=16, α=32, dropout=0.05) on query/value matrices, 3 epochs, single A100 ~4h
Fine-tuning dataset BC5CDR-CHEM (BioCreative V Chemical-Disease Relation)
Parameters 108M (1.4% trainable via LoRA during DAPT)
Entity types Chemical entities only (B-CHEM / I-CHEM)
Published F1 95.83% on BC5CDR-CHEM test set

Domain Gap: Literature vs. Packaging

Aspect BC5CDR (training) Pill packaging (our use)
Text style Scientific prose Formulaic labels
Length Full abstracts (~200 words) Short labels (~5-20 words)
Chemical mentions In sentence context Standalone, prominent
Brand names Rare Very common
Salt forms Part of scientific name Separated on packaging
OCR artifacts None Common (rn→m, 0→o, etc.)

Our Pipeline

Raw OCR text
    → ocr_cleaner.py (fix rn→m, 0→o, ligatures, whitespace)
    → ner_model.py (PharmaDetect with manual sub-word merging)
    → drug_analyzer.py (filter CHEM labels, dedupe, skip single-char/punctuation)
    → rxnorm_client.py (validate against RxNorm, get rxcui)
    → [if NER finds nothing] rxnorm fallback (approximate term search)

The benchmark tests the first three steps (OCR cleaner → NER → basic filtering) without RxNorm validation, to isolate the NER model's behavior on packaging text.

Results

By Noise Level (Bare NER, 500-case samples)

Noise Precision Recall F1 Detection
none (clean) 46.9% 84.4% 60.3% 99.6%
light (5-15% char errors) 44.9% 79.8% 57.5% 99.8%
heavy (40% errors + splits) 26.2% 53.5% 35.2% 99.8%

Detection rate = percentage of cases where NER found at least one entity.

Full Pipeline vs Bare NER (50-case samples)

Pipeline Step Noise Precision Recall F1 Latency / Case
Bare NER none 48.6% 81.0% 60.7% 64ms
Full Pipeline none 71.6% 81.0% 76.0% 961ms
Bare NER light 47.5% 79.8% 59.6% 69ms
Full Pipeline light 74.4% 79.8% 77.0% 1089ms
Bare NER heavy 24.7% 47.6% 32.5% 78ms
Full Pipeline heavy 65.6% 47.6% 55.2% 2597ms

The RxNorm validation step successfully rejected 37 False Positives (boosting precision by ~23 points) without hurting recall. However, relying on an external HTTP API adds nearly 900ms of latency per case.

By Category (clean text, 500 cases)

Category N Precision Recall F1 TP FP FN
single_ingredient 279 41% 86% 55% 239 347 40
dual_ingredient 155 49% 85% 62% 261 270 47
multi_ingredient 66 54% 82% 65% 170 143 37

GLiNER Pipeline Augmentation (500-case samples, clean text)

To improve precision (rejecting brands, salts, and manufacturers) and recall (catching edge cases), we evaluated urchade/gliner_medium-v2.1 as a secondary observer and adjudicator alongside the baseline PharmaDetect + RxNorm pipeline.

Evaluated Architectures

  1. baseline (Current): PharmaDetect NER → RxNorm Validation. If no entities are found, falls back to raw text blocks via RxNorm's approximateTerm.
  2. gliner_filter (Precision): GLiNER runs alongside PharmaDetect. If GLiNER explicitly tags a PharmaDetect entity as a negative label (e.g., brand or trade name, manufacturer, salt or counter-ion), the entity is rejected before hitting RxNorm.
  3. gliner_sequential (Speed & Precision): PharmaDetect runs first. Only the short entity spans extracted by PharmaDetect are passed to GLiNER for classification. If GLiNER tags the snippet as an active pharmaceutical ingredient, it proceeds to RxNorm. This saves massive CPU overhead compared to running GLiNER on the whole document.
  4. gliner_fallback (Recall): If PharmaDetect returns zero entities, the pipeline queries GLiNER for active pharmaceutical ingredient spans instead of running the raw text approximateTerm fallback.
  5. gliner_union (Recall & Precision): Both PharmaDetect and GLiNER run. All GLiNER active pharmaceutical ingredient spans are merged with the PharmaDetect results, deduplicated by their resolved RxNorm IDs.
  6. gliner_adjudicated (Complex Logic): An advanced version of the filter. It rejects negative GLiNER labels, but is "salt-aware"—it won't reject a standalone salt form if GLiNER detects an active ingredient immediately adjacent to it in the text.
Experiment Mode Pipeline Precision Pipeline Recall Pipeline F1 Avg Latency (ms)*
baseline (Current) 76.2% 84.1% 80.0% 984.5
gliner_filter 77.8% 84.0% 80.8% 1221.9
gliner_fallback 76.5% 84.5% 80.3% 1208.5
gliner_sequential 84.9% 58.6% 69.3% 990.9
gliner_adjudicated 78.2% 84.4% 81.2% 1218.8
gliner_union 78.0% 93.6% 85.1% 1266.3

*Note: Latency is inflated by concurrent background testing hitting CPU limits, but relative differences show the overhead of GLiNER. The sequential mode has very low latency because it only runs GLiNER on short snippets.

Findings:

  1. Recall Improvement (gliner_union): The union mode dramatically increased recall to 93.6% (a 9.5pp increase) while slightly increasing precision to 78.0%. GLiNER successfully identifies active ingredients that PharmaDetect misses entirely, without introducing false positives.
  2. The Context Trap (gliner_sequential): The sequential mode achieved the highest precision (84.9%) and the fastest latency. However, its recall plummeted to 58.6%. Because GLiNER was only fed the short text snippets extracted by PharmaDetect, it lost all surrounding context. GLiNER relies heavily on context to classify entities; without it, it failed to recognize many valid active ingredients, causing them to be falsely filtered out.
  3. Precision Improvement (gliner_filter and gliner_adjudicated): The salt-aware adjudicator successfully stripped out false positives (brands, salts), raising precision by 2pp with virtually no loss in recall.
  4. Conclusion: gliner_union produced the best overall F1 score (85.1%) because running GLiNER on the full text preserves its contextual reasoning.

Interpretation

Recall is strong (84%). The model finds the active ingredient in most cases. Multi-ingredient labels are slightly harder (82%) due to more complex text. Under light OCR noise, recall drops only to 80% — the model is reasonably robust to minor distortion.

Precision is poor (47%). The model tags brand names, manufacturer names, salt forms, and dosage form words as chemical entities. This is correct per BC5CDR training ("find all chemicals") but wrong for our use case ("find active pharmaceutical ingredients only").

Heavy noise halves recall (54%). Mid-word splits (Bevacizu mab) and character corruption (Metforrnin) break the tokenizer. This is the target for Phase 4 (OCR Modernization with dictionary-backed fuzzy matching).

False Positive Analysis

All false positives fall into predictable categories:

1. Brand Names

The model tags brand names as chemical entities: Augmentin, Avastin, Allegra, Lipitor, etc. These are the medicine's trade names, not active ingredients.

2. Salt Forms

The model tags salt/counter-ion names separately: Sodium, Hydrochloride, Calcium, Phosphate, Maleate, Potassium. These appear in compositions like "Atorvastatin Calcium" but are not the active drug.

3. Manufacturer Names

Pharmaceutical company names occasionally get tagged: "Cipla", "Lupin", names that look chemical-like to the model.

4. Dosage Form Words

Words like "Tablet", "Capsule", "Syrup", "Injection" sometimes get tagged, especially in compact label layouts.

Noise Degradation Analysis

Metric None → Light Light → Heavy
Recall 84% → 80% (−4pp) 80% → 54% (−26pp)
Precision 47% → 45% (−2pp) 45% → 26% (−19pp)
F1 60% → 58% (−3pp) 58% → 35% (−23pp)

Light noise has minimal impact — the OCR cleaner handles common o→0 and l→1 substitutions. Heavy noise causes severe degradation because:

  • Mid-word splits break tokenization (Amoxy cillin becomes two tokens)
  • m→rn corruption in drug names (Metforrnin) escapes the cleaner's limited regex patterns
  • All-caps text shifts token distributions away from training data

Comparison with Published Benchmarks

Benchmark Precision Recall F1
BC5CDR-CHEM (published) 95.1% 96.6% 95.8%
Our packaging — clean 46.9% 84.4% 60.3%
Our packaging — light noise 44.9% 79.8% 57.5%
Our packaging — heavy noise 26.2% 53.5% 35.2%

The precision gap (95% → 47%) reflects a task mismatch, not model quality. BC5CDR measures "find all chemicals" — we measure "find only active pharmaceutical ingredients." The recall gap (97% → 84%) reflects the domain shift from clean scientific text to formulaic packaging labels.

Remediation Plan

Phase Target Expected Impact
Phase 2: Entity Linking Filter NER output through DrugBank to reject non-drug entities Precision 47% → ~85%+
Phase 3: Fallback DrugBank fuzzy search when NER finds 0 entities Recall +5-10% on edge cases
Phase 4: OCR Modernization Dictionary-backed fuzzy correction before NER Heavy-noise recall 54% → ~75%+

See docs/plans/phase{2,3,4}_*.md for detailed designs.

Reproducing

# Generate dataset (once)
uv run python eval/prepare_hf_dataset.py                  # clean
uv run python eval/prepare_hf_dataset.py --noise light     # light OCR noise
uv run python eval/prepare_hf_dataset.py --noise heavy     # heavy OCR noise

# Run benchmark
uv run python eval/benchmark.py                    # full dataset
uv run python eval/benchmark.py --limit 500        # quick run
uv run python eval/benchmark.py --with-rxnorm      # full pipeline (needs network)
uv run python eval/benchmark.py -v                 # per-case details

Results are written to eval/results.json.

References

Xet Storage Details

Size:
14.3 kB
·
Xet hash:
21c9c64913ceff976b4a6525f32db09c722e0b706ac4fc050061e8ba08412406

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.