MediBill SFT v2 (drift-aware teacher)

LoRA adapter for Qwen2.5-3B-Instruct trained on the MediBill OpenEnv environment — an OpenEnv environment that tests whether an LLM agent can detect silent policy drift in Indian health insurance claims.

Submission for Meta × Scaler OpenEnv Hackathon 2026, Theme 3.1 — Professional World Modeling.

Headline result

task (n=5 held-out, seeds 16-20)	Base Qwen 2.5 3B	SFT v2	lift
easy_cashless	0.0000	1.0000	+1.0000
medium_multi_payer	0.0000	1.0000	+1.0000
hard_drift	0.0000	0.9996 +/- 0.0008	+0.9996
average	0.0000	0.9999	+0.9999

Zero parse failures. Codex reproducibility protocol verified (sha256 + fresh subprocess x 2).

Why this matters

Base Qwen 2.5 3B scores 0.0000 on every task — it cannot parse the OpenEnv tool-call protocol, so every action is rejected. After SFT v2 fine-tuning on 7,890 trajectories from the scripted_drift_aware teacher, the same model hits near-perfect 0.9996 on the hardest task (silent policy drift detection). Lift: +0.9999 average across all 3 tiers.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    load_in_4bit=True,
)

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")

model = PeftModel.from_pretrained(
    base,
    "Anuj424614/medibill-sft-v2",
)

Training setup

Base model: Qwen/Qwen2.5-3B-Instruct
LoRA: r=32, alpha=64, dropout=0.05, target modules: q/k/v/o/gate/up/down_proj
Dataset: 7,890 SFT examples from scripted_drift_aware policy (drift-aware teacher)
Optimizer: AdamW, lr=2e-4, cosine decay, warmup 50 steps
Schedule: 3 epochs, effective batch size 16 (2 x 8 grad accum)
Total optimizer steps: 1,482
Hardware: Colab L4 (free tier — earlier runs on T4 also work)
Wallclock: ~33 minutes
Initial loss: 0.42 -> Final loss: 0.011

The journey (3 checkpoints)

SFT v1 (scripted teacher) — 0.7573 on hard_drift (capped at scripted teacher ceiling 0.7611)
GRPO over SFT v1 — saturated at 0.7575 (couldn't break the ceiling — reward already maxed out for compliance, no signal to learn drift detection)
SFT v2 (drift-aware teacher) — 0.9996 <- when RL gets stuck, build a smarter teacher

Repo

Code, data, eval scripts: https://github.com/Algoace1403/METAHackthon2026
Live OpenEnv space: https://huggingface.co/spaces/Anuj424614/medibill

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Anuj424614/medibill-sft-v2

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1270)

this model