MediBill SFT v2 (drift-aware teacher)

LoRA adapter for Qwen2.5-3B-Instruct trained on the MediBill OpenEnv environment β€” an OpenEnv environment that tests whether an LLM agent can detect silent policy drift in Indian health insurance claims.

Submission for Meta Γ— Scaler OpenEnv Hackathon 2026, Theme 3.1 β€” Professional World Modeling.

Headline result

task (n=5 held-out, seeds 16-20) Base Qwen 2.5 3B SFT v2 lift
easy_cashless 0.0000 1.0000 +1.0000
medium_multi_payer 0.0000 1.0000 +1.0000
hard_drift 0.0000 0.9996 +/- 0.0008 +0.9996
average 0.0000 0.9999 +0.9999

Zero parse failures. Codex reproducibility protocol verified (sha256 + fresh subprocess x 2).

Why this matters

Base Qwen 2.5 3B scores 0.0000 on every task β€” it cannot parse the OpenEnv tool-call protocol, so every action is rejected. After SFT v2 fine-tuning on 7,890 trajectories from the scripted_drift_aware teacher, the same model hits near-perfect 0.9996 on the hardest task (silent policy drift detection). Lift: +0.9999 average across all 3 tiers.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    load_in_4bit=True,
)

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

model = PeftModel.from_pretrained(
    base,
    "Anuj424614/medibill-sft-v2",
)

Training setup

  • Base model: Qwen/Qwen2.5-3B-Instruct
  • LoRA: r=32, alpha=64, dropout=0.05, target modules: q/k/v/o/gate/up/down_proj
  • Dataset: 7,890 SFT examples from scripted_drift_aware policy (drift-aware teacher)
  • Optimizer: AdamW, lr=2e-4, cosine decay, warmup 50 steps
  • Schedule: 3 epochs, effective batch size 16 (2 x 8 grad accum)
  • Total optimizer steps: 1,482
  • Hardware: Colab L4 (free tier β€” earlier runs on T4 also work)
  • Wallclock: ~33 minutes
  • Initial loss: 0.42 -> Final loss: 0.011

The journey (3 checkpoints)

  1. SFT v1 (scripted teacher) β€” 0.7573 on hard_drift (capped at scripted teacher ceiling 0.7611)
  2. GRPO over SFT v1 β€” saturated at 0.7575 (couldn't break the ceiling β€” reward already maxed out for compliance, no signal to learn drift detection)
  3. SFT v2 (drift-aware teacher) β€” 0.9996 <- when RL gets stuck, build a smarter teacher

Repo

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Anuj424614/medibill-sft-v2

Base model

Qwen/Qwen2.5-3B
Adapter
(1270)
this model