Buckets:
| # NER Benchmark: PharmaDetect on Pill Packaging Text | |
| **Date**: 2026-04-14 | |
| **Model**: `OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M` (108M params) | |
| **Dataset**: 11,796 synthesized pack-label texts from HuggingFace `MattBastar/Medicine_Details` | |
| **Pipeline tested**: OCR cleaner → NER entity extraction (no RxNorm validation) | |
| ## Dataset | |
| ### Source | |
| [MattBastar/Medicine_Details](https://huggingface.co/datasets/MattBastar/Medicine_Details) | |
| — 11,825 Indian brand medicines with structured Composition, Manufacturer, | |
| Uses, and Image URL fields. We use the Medicine Name + Composition + | |
| Manufacturer columns to synthesize realistic pack-label OCR text. | |
| ### How It Works | |
| `prepare_hf_dataset.py` takes each row and: | |
| 1. **Synthesizes a pack label** from a random template (blister pack, | |
| box label, prescription-style, syrup label, etc.): | |
| ``` | |
| Augmentin 625 Duo Tablet | |
| Each tablet contains: | |
| Amoxycillin 500mg | |
| Clavulanic Acid 125mg | |
| Glaxo SmithKline Pharmaceuticals Ltd | |
| ``` | |
| 2. **Parses ground-truth labels** from the Composition field: | |
| `"Amoxycillin (500mg) + Clavulanic Acid (125mg)"` → `["Amoxycillin", "Clavulanic Acid"]` | |
| 3. **Optionally injects OCR noise** (`--noise light|heavy`) using | |
| pharma-specific distortion patterns drawn from real OCR failures: | |
| | Pattern | Example | Source | | |
| |---------|---------|--------| | |
| | m→rn | Metformin → Metforrnin | Glyph confusion in serif fonts | | |
| | I→l (word start) | Ibuprofen → lbuprofen | Uppercase I vs lowercase L | | |
| | l→1 (interior) | Alprazolam → A1prazolam | l/1 confusion | | |
| | o→0 / O→0 | Omeprazole → 0mepraz0le | Letter/digit confusion | | |
| | cl→d | Clavulanic → Davulanic | Ligature misread | | |
| | mg→rng | 500mg → 500rng | m→rn in dosage suffix | | |
| | Mid-word splits | Bevacizumab → Bevacizu mab | Line-wrap OCR artifact | | |
| | All-caps | ATORVASTATIN 40MG | Uppercase printed labels | | |
| ### Category Breakdown | |
| | Category | N | Description | | |
| |----------|---|-------------| | |
| | single_ingredient | 7,081 | One active ingredient (e.g., Azithromycin) | | |
| | dual_ingredient | 3,591 | Two active ingredients (e.g., Amoxycillin + Clavulanic Acid) | | |
| | multi_ingredient | 1,124 | Three or more active ingredients | | |
| ### Reproducing | |
| ```bash | |
| uv run python eval/prepare_hf_dataset.py # clean text (default) | |
| uv run python eval/prepare_hf_dataset.py --noise light # light OCR artifacts | |
| uv run python eval/prepare_hf_dataset.py --noise heavy # heavy OCR distortion | |
| uv run python eval/prepare_hf_dataset.py --limit 500 # smaller sample | |
| ``` | |
| ## About the Model | |
| ### Architecture | |
| PharmaDetect-BioPatient-108M is a token-classification (NER) model from the | |
| [OpenMed NER](https://arxiv.org/abs/2508.01630) suite (Panahi, 2025). It | |
| detects chemical entities using BIO tagging (`B-CHEM`, `I-CHEM`). | |
| | Property | Value | | |
| |----------|-------| | |
| | Base model | [`Bio_Discharge_Summary_BERT`](https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT) (BioBERT v1.0 → MIMIC-III discharge summaries, ~880M words) | | |
| | DAPT corpus | 350k passages (90M tokens): 100k PubMed, 100k arXiv, 100k MIMIC-III, 50k ClinicalTrials.gov | | |
| | DAPT method | LoRA (rank=16, α=32, dropout=0.05) on query/value matrices, 3 epochs, single A100 ~4h | | |
| | Fine-tuning dataset | BC5CDR-CHEM (BioCreative V Chemical-Disease Relation) | | |
| | Parameters | 108M (1.4% trainable via LoRA during DAPT) | | |
| | Entity types | Chemical entities only (`B-CHEM` / `I-CHEM`) | | |
| | Published F1 | 95.83% on BC5CDR-CHEM test set | | |
| ### Domain Gap: Literature vs. Packaging | |
| | Aspect | BC5CDR (training) | Pill packaging (our use) | | |
| |--------|-------------------|--------------------------| | |
| | Text style | Scientific prose | Formulaic labels | | |
| | Length | Full abstracts (~200 words) | Short labels (~5-20 words) | | |
| | Chemical mentions | In sentence context | Standalone, prominent | | |
| | Brand names | Rare | Very common | | |
| | Salt forms | Part of scientific name | Separated on packaging | | |
| | OCR artifacts | None | Common (rn→m, 0→o, etc.) | | |
| ## Our Pipeline | |
| ``` | |
| Raw OCR text | |
| → ocr_cleaner.py (fix rn→m, 0→o, ligatures, whitespace) | |
| → ner_model.py (PharmaDetect with manual sub-word merging) | |
| → drug_analyzer.py (filter CHEM labels, dedupe, skip single-char/punctuation) | |
| → rxnorm_client.py (validate against RxNorm, get rxcui) | |
| → [if NER finds nothing] rxnorm fallback (approximate term search) | |
| ``` | |
| The benchmark tests the first three steps (OCR cleaner → NER → basic filtering) | |
| without RxNorm validation, to isolate the NER model's behavior on packaging text. | |
| ## Results | |
| ### By Noise Level (Bare NER, 500-case samples) | |
| | Noise | Precision | Recall | F1 | Detection | | |
| |-------|-----------|--------|------|-----------| | |
| | **none** (clean) | **46.9%** | **84.4%** | **60.3%** | 99.6% | | |
| | **light** (5-15% char errors) | 44.9% | 79.8% | 57.5% | 99.8% | | |
| | **heavy** (40% errors + splits) | 26.2% | 53.5% | 35.2% | 99.8% | | |
| Detection rate = percentage of cases where NER found at least one entity. | |
| ### Full Pipeline vs Bare NER (50-case samples) | |
| | Pipeline Step | Noise | Precision | Recall | F1 | Latency / Case | | |
| |---------------|-------|-----------|--------|----|----------------| | |
| | **Bare NER** | none | 48.6% | 81.0% | 60.7% | 64ms | | |
| | **Full Pipeline** | none | 71.6% | 81.0% | 76.0% | 961ms | | |
| | **Bare NER** | light | 47.5% | 79.8% | 59.6% | 69ms | | |
| | **Full Pipeline** | light | 74.4% | 79.8% | 77.0% | 1089ms | | |
| | **Bare NER** | heavy | 24.7% | 47.6% | 32.5% | 78ms | | |
| | **Full Pipeline** | heavy | 65.6% | 47.6% | 55.2% | 2597ms | | |
| *The RxNorm validation step successfully rejected 37 False Positives (boosting precision by ~23 points) without hurting recall. However, relying on an external HTTP API adds nearly 900ms of latency per case.* | |
| ### By Category (clean text, 500 cases) | |
| | Category | N | Precision | Recall | F1 | TP | FP | FN | | |
| |----------|---|-----------|--------|-----|-----|-----|-----| | |
| | single_ingredient | 279 | 41% | 86% | 55% | 239 | 347 | 40 | | |
| | dual_ingredient | 155 | 49% | 85% | 62% | 261 | 270 | 47 | | |
| | multi_ingredient | 66 | 54% | 82% | 65% | 170 | 143 | 37 | | |
| ### GLiNER Pipeline Augmentation (500-case samples, clean text) | |
| To improve precision (rejecting brands, salts, and manufacturers) and recall (catching edge cases), we evaluated `urchade/gliner_medium-v2.1` as a secondary observer and adjudicator alongside the baseline PharmaDetect + RxNorm pipeline. | |
| #### Evaluated Architectures | |
| 1. **`baseline` (Current)**: PharmaDetect NER → RxNorm Validation. If no entities are found, falls back to raw text blocks via RxNorm's `approximateTerm`. | |
| 2. **`gliner_filter` (Precision)**: GLiNER runs alongside PharmaDetect. If GLiNER explicitly tags a PharmaDetect entity as a negative label (e.g., `brand or trade name`, `manufacturer`, `salt or counter-ion`), the entity is rejected before hitting RxNorm. | |
| 3. **`gliner_sequential` (Speed & Precision)**: PharmaDetect runs first. Only the short entity spans extracted by PharmaDetect are passed to GLiNER for classification. If GLiNER tags the snippet as an `active pharmaceutical ingredient`, it proceeds to RxNorm. This saves massive CPU overhead compared to running GLiNER on the whole document. | |
| 4. **`gliner_fallback` (Recall)**: If PharmaDetect returns zero entities, the pipeline queries GLiNER for `active pharmaceutical ingredient` spans instead of running the raw text `approximateTerm` fallback. | |
| 5. **`gliner_union` (Recall & Precision)**: Both PharmaDetect and GLiNER run. All GLiNER `active pharmaceutical ingredient` spans are merged with the PharmaDetect results, deduplicated by their resolved RxNorm IDs. | |
| 6. **`gliner_adjudicated` (Complex Logic)**: An advanced version of the filter. It rejects negative GLiNER labels, but is "salt-aware"—it won't reject a standalone salt form if GLiNER detects an active ingredient immediately adjacent to it in the text. | |
| | Experiment Mode | Pipeline Precision | Pipeline Recall | Pipeline F1 | Avg Latency (ms)* | | |
| |-----------------|--------------------|-----------------|-------------|-------------------| | |
| | **baseline** (Current) | 76.2% | 84.1% | 80.0% | 984.5 | | |
| | **gliner_filter** | 77.8% | 84.0% | 80.8% | 1221.9 | | |
| | **gliner_fallback** | 76.5% | 84.5% | 80.3% | 1208.5 | | |
| | **gliner_sequential** | **84.9%** | 58.6% | 69.3% | **990.9** | | |
| | **gliner_adjudicated** | 78.2% | 84.4% | 81.2% | 1218.8 | | |
| | **gliner_union** | 78.0% | **93.6%** | **85.1%** | 1266.3 | | |
| *\*Note: Latency is inflated by concurrent background testing hitting CPU limits, but relative differences show the overhead of GLiNER. The sequential mode has very low latency because it only runs GLiNER on short snippets.* | |
| **Findings:** | |
| 1. **Recall Improvement (`gliner_union`)**: The union mode dramatically increased recall to **93.6%** (a 9.5pp increase) while slightly *increasing* precision to 78.0%. GLiNER successfully identifies active ingredients that PharmaDetect misses entirely, without introducing false positives. | |
| 2. **The Context Trap (`gliner_sequential`)**: The sequential mode achieved the highest precision (84.9%) and the fastest latency. However, its recall plummeted to **58.6%**. Because GLiNER was only fed the short text snippets extracted by PharmaDetect, it lost all surrounding context. GLiNER relies heavily on context to classify entities; without it, it failed to recognize many valid active ingredients, causing them to be falsely filtered out. | |
| 3. **Precision Improvement (`gliner_filter` and `gliner_adjudicated`)**: The salt-aware adjudicator successfully stripped out false positives (brands, salts), raising precision by 2pp with virtually no loss in recall. | |
| 4. **Conclusion**: **`gliner_union`** produced the best overall F1 score (**85.1%**) because running GLiNER on the full text preserves its contextual reasoning. | |
| ### Interpretation | |
| **Recall is strong (84%).** The model finds the active ingredient in most | |
| cases. Multi-ingredient labels are slightly harder (82%) due to more | |
| complex text. Under light OCR noise, recall drops only to 80% — the | |
| model is reasonably robust to minor distortion. | |
| **Precision is poor (47%).** The model tags brand names, manufacturer | |
| names, salt forms, and dosage form words as chemical entities. This is | |
| correct per BC5CDR training ("find all chemicals") but wrong for our use | |
| case ("find active pharmaceutical ingredients only"). | |
| **Heavy noise halves recall (54%).** Mid-word splits (`Bevacizu mab`) | |
| and character corruption (`Metforrnin`) break the tokenizer. This is | |
| the target for Phase 4 (OCR Modernization with dictionary-backed fuzzy | |
| matching). | |
| ## False Positive Analysis | |
| All false positives fall into predictable categories: | |
| ### 1. Brand Names | |
| The model tags brand names as chemical entities: Augmentin, Avastin, | |
| Allegra, Lipitor, etc. These are the medicine's trade names, not active | |
| ingredients. | |
| ### 2. Salt Forms | |
| The model tags salt/counter-ion names separately: Sodium, Hydrochloride, | |
| Calcium, Phosphate, Maleate, Potassium. These appear in compositions | |
| like "Atorvastatin Calcium" but are not the active drug. | |
| ### 3. Manufacturer Names | |
| Pharmaceutical company names occasionally get tagged: "Cipla", "Lupin", | |
| names that look chemical-like to the model. | |
| ### 4. Dosage Form Words | |
| Words like "Tablet", "Capsule", "Syrup", "Injection" sometimes get | |
| tagged, especially in compact label layouts. | |
| ## Noise Degradation Analysis | |
| | Metric | None → Light | Light → Heavy | | |
| |--------|-------------|---------------| | |
| | Recall | 84% → 80% (−4pp) | 80% → 54% (−26pp) | | |
| | Precision | 47% → 45% (−2pp) | 45% → 26% (−19pp) | | |
| | F1 | 60% → 58% (−3pp) | 58% → 35% (−23pp) | | |
| Light noise has minimal impact — the OCR cleaner handles common `o→0` | |
| and `l→1` substitutions. Heavy noise causes severe degradation because: | |
| - **Mid-word splits** break tokenization (`Amoxy cillin` becomes two tokens) | |
| - **m→rn corruption** in drug names (`Metforrnin`) escapes the cleaner's | |
| limited regex patterns | |
| - **All-caps text** shifts token distributions away from training data | |
| ## Comparison with Published Benchmarks | |
| | Benchmark | Precision | Recall | F1 | | |
| |-----------|-----------|--------|-----| | |
| | BC5CDR-CHEM (published) | 95.1% | 96.6% | 95.8% | | |
| | Our packaging — clean | 46.9% | 84.4% | 60.3% | | |
| | Our packaging — light noise | 44.9% | 79.8% | 57.5% | | |
| | Our packaging — heavy noise | 26.2% | 53.5% | 35.2% | | |
| The precision gap (95% → 47%) reflects a **task mismatch**, not model | |
| quality. BC5CDR measures "find all chemicals" — we measure "find only | |
| active pharmaceutical ingredients." The recall gap (97% → 84%) reflects | |
| the domain shift from clean scientific text to formulaic packaging labels. | |
| ## Remediation Plan | |
| | Phase | Target | Expected Impact | | |
| |-------|--------|-----------------| | |
| | **Phase 2: Entity Linking** | Filter NER output through DrugBank to reject non-drug entities | Precision 47% → ~85%+ | | |
| | **Phase 3: Fallback** | DrugBank fuzzy search when NER finds 0 entities | Recall +5-10% on edge cases | | |
| | **Phase 4: OCR Modernization** | Dictionary-backed fuzzy correction before NER | Heavy-noise recall 54% → ~75%+ | | |
| See `docs/plans/phase{2,3,4}_*.md` for detailed designs. | |
| ## Reproducing | |
| ```bash | |
| # Generate dataset (once) | |
| uv run python eval/prepare_hf_dataset.py # clean | |
| uv run python eval/prepare_hf_dataset.py --noise light # light OCR noise | |
| uv run python eval/prepare_hf_dataset.py --noise heavy # heavy OCR noise | |
| # Run benchmark | |
| uv run python eval/benchmark.py # full dataset | |
| uv run python eval/benchmark.py --limit 500 # quick run | |
| uv run python eval/benchmark.py --with-rxnorm # full pipeline (needs network) | |
| uv run python eval/benchmark.py -v # per-case details | |
| ``` | |
| Results are written to `eval/results.json`. | |
| ## References | |
| - [OpenMed NER paper](https://arxiv.org/abs/2508.01630) — Panahi, 2025 | |
| - [BC5CDR corpus](https://pmc.ncbi.nlm.nih.gov/articles/PMC4860626/) — Li et al., 2016 | |
| - [Bio_Discharge_Summary_BERT](https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT) — Alsentzer et al. | |
| - [PharmaDetect-BioPatient-108M](https://huggingface.co/OpenMed/OpenMed-NER-PharmaDetect-BioPatient-108M) — OpenMed | |
| - [MattBastar/Medicine_Details](https://huggingface.co/datasets/MattBastar/Medicine_Details) — HuggingFace dataset | |
Xet Storage Details
- Size:
- 14.3 kB
- Xet hash:
- 21c9c64913ceff976b4a6525f32db09c722e0b706ac4fc050061e8ba08412406
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.