Mira-Q2 β€” Clinical Extraction SLM (v2)

By DILR β€” Enterprise-grade clinical document extraction. Reads documents, outputs structured, source-grounded JSON. Deployed on-prem.

Comprehensive Evaluation (782 docs across 4 test sets)

Eval Set N Type JSON Validity Identifier Leak Field-F1
test_gold 200 Same distribution (held-out) 100.0% [1.0-1.0] 0.0% 1.000 [0.999-1.0]
synthetic_v2 150 Different formatting dialect 100.0% [1.0-1.0] 0.0% n/a (unlabeled)
extraction_relevant 150 Real physician docs (on-schema) 94.7% [90.7-98.0] 0.0% n/a (unlabeled)
mtsamples 282 Real physician docs (39 specialties) 85.8% [81.9-89.7] 0.0% n/a (unlabeled)

95% bootstrap CIs (1000 resamples). Zero identifier leaks across all 782 documents.

Three-Way Comparison

Model Training Data Validity (test_gold) F1 (test_gold)
Qwen2.5-3B zero-shot β€” 0% (invents own schema) 0.0
Mira-Q1 (v1) 3,438 examples 98% (50-example eval) β€”
Mira-Q2 (this model) 8,400 examples 100% (200-example eval) 1.000

Training

Parameter Value
Base model Qwen/Qwen2.5-3B-Instruct (via Unsloth)
Method QLoRA (4-bit, r=16, alpha=32)
Training data 8,400 examples (6,400 gold-by-construction + 2,000 schema variants)
Data sources Real ICD-10 codes (71K), NLM drug names, curated lab reference ranges
Schema variants Renamed fields, dropped fields, minimal schemas (for generalization)
Epochs 2
Final train loss 0.132
Final eval loss 0.142
Overfit gap 0.010 (healthy)

Loss Curve

Step   50: 1.0723  (epoch 0.1)
Step  200: 0.1556  (epoch 0.4)
Step  525: 0.1414  (epoch 1.0) β€” checkpoint
Step  750: 0.1318  (epoch 1.4) β€” lowest
Step 1050: 0.1320  (epoch 2.0) β€” final
Eval:      0.1418  (epoch 2.0)

What's New vs Mira-Q1

  • 2.4x more training data (8,400 vs 3,438)
  • Gold-by-construction data β€” real ICD-10 codes, NLM drugs, real lab reference ranges (not Synthea-rendered)
  • Schema-variant training β€” 2,000 examples with modified schemas for schema-as-input generalization
  • 8% lower loss (0.132 vs 0.143)
  • 100% validity on 200-example gold eval (vs 98% on 50 examples)
  • Comprehensive eval on 782 docs including real physician dictations
  • Zero identifier leaks verified across all test sets

Synthetic-to-Real Gap

The honest finding: Mira-Q2 scores 100% on training-distribution data but 86% on general real physician prose (MTSamples). This is expected for a model trained on synthetic data β€” it learned our generator's patterns well but struggles with document types it never saw (operative notes, physical exams). The gap narrows to ~5% on on-schema real docs (94.7%).

This gap closes with: real partner data retraining (v1), broader document type coverage in training, and OCR pipeline integration.

Usage

# IMPORTANT: Load with Unsloth (not standard PeftModel β€” quantization mismatch)
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="dilr/Mira-Q2",
    max_seq_length=4096,
    dtype=torch.float16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "You are a clinical information extraction system..."},
    {"role": "user", "content": "Patient: 45/M\nHb 12.5 g/dL (13-17) LOW\nWBC 8.2 x10^9/L (4-11) Normal"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False,
                          eos_token_id=[tokenizer.eos_token_id,
                                        tokenizer.convert_tokens_to_ids("<|im_end|>")])
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Note: Do NOT load with PeftModel.from_pretrained(base, "dilr/Mira-Q2") β€” the adapter was trained with Unsloth's quantization which differs from standard bitsandbytes. Use FastLanguageModel as shown above.

Schema

Extracts 10 required fields:

  • document_type: lab_report | medication_list | discharge_summary | pathology_report | intake_form | progress_note | other
  • patient: {age, sex} β€” de-identified, never includes names/MRN
  • encounter: {date (ISO), department}
  • vitals[], labs[], medications[], diagnoses[], procedures[], allergies[]
  • extraction_notes

Architecture: Schema-as-Input

Mira-Q2 is trained with schema-variant examples β€” the model learns to follow any extraction schema injected in the system prompt, not just the clinical one. This enables customer onboarding with zero code changes (schema file + seed examples only).

Eval Data

The eval/ directory contains:

  • comprehensive_scorecard.json β€” full results with bootstrap CIs
  • test_gold_200_result.json β€” test_gold scorecard
  • mtsamples_282_result.json β€” real MTSamples probe
  • extraction_relevant_150_result.json β€” on-schema real docs
  • synthetic_v2_150_result.json β€” format robustness probe

Limitations

  • English only
  • Trained on synthetic data β€” real clinical document retraining improves accuracy (v1 with design partner)
  • 86% validity on general real docs (39 specialties) β€” strongest on lab/discharge/med types it was trained on
  • Every output is a draft for human review β€” not for autonomous clinical decisions
  • Must load with Unsloth (not vanilla PeftModel)

License

Apache-2.0 (same as base model)

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dilr/Mira-Q2

Base model

Qwen/Qwen2.5-3B
Adapter
(1294)
this model