Instructions to use Anuj424614/medibill-sft-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Anuj424614/medibill-sft-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "Anuj424614/medibill-sft-v2") - Notebooks
- Google Colab
- Kaggle
MediBill SFT v2 (drift-aware teacher)
LoRA adapter for Qwen2.5-3B-Instruct trained on the MediBill OpenEnv environment β an OpenEnv environment that tests whether an LLM agent can detect silent policy drift in Indian health insurance claims.
Submission for Meta Γ Scaler OpenEnv Hackathon 2026, Theme 3.1 β Professional World Modeling.
Headline result
| task (n=5 held-out, seeds 16-20) | Base Qwen 2.5 3B | SFT v2 | lift |
|---|---|---|---|
| easy_cashless | 0.0000 | 1.0000 | +1.0000 |
| medium_multi_payer | 0.0000 | 1.0000 | +1.0000 |
| hard_drift | 0.0000 | 0.9996 +/- 0.0008 | +0.9996 |
| average | 0.0000 | 0.9999 | +0.9999 |
Zero parse failures. Codex reproducibility protocol verified (sha256 + fresh subprocess x 2).
Why this matters
Base Qwen 2.5 3B scores 0.0000 on every task β it cannot parse the OpenEnv tool-call protocol, so every action is rejected. After SFT v2 fine-tuning on 7,890 trajectories from the scripted_drift_aware teacher, the same model hits near-perfect 0.9996 on the hardest task (silent policy drift detection). Lift: +0.9999 average across all 3 tiers.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct",
load_in_4bit=True,
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
model = PeftModel.from_pretrained(
base,
"Anuj424614/medibill-sft-v2",
)
Training setup
- Base model: Qwen/Qwen2.5-3B-Instruct
- LoRA: r=32, alpha=64, dropout=0.05, target modules: q/k/v/o/gate/up/down_proj
- Dataset: 7,890 SFT examples from scripted_drift_aware policy (drift-aware teacher)
- Optimizer: AdamW, lr=2e-4, cosine decay, warmup 50 steps
- Schedule: 3 epochs, effective batch size 16 (2 x 8 grad accum)
- Total optimizer steps: 1,482
- Hardware: Colab L4 (free tier β earlier runs on T4 also work)
- Wallclock: ~33 minutes
- Initial loss: 0.42 -> Final loss: 0.011
The journey (3 checkpoints)
- SFT v1 (scripted teacher) β 0.7573 on hard_drift (capped at scripted teacher ceiling 0.7611)
- GRPO over SFT v1 β saturated at 0.7575 (couldn't break the ceiling β reward already maxed out for compliance, no signal to learn drift detection)
- SFT v2 (drift-aware teacher) β 0.9996 <- when RL gets stuck, build a smarter teacher
Repo
- Code, data, eval scripts: https://github.com/Algoace1403/METAHackthon2026
- Live OpenEnv space: https://huggingface.co/spaces/Anuj424614/medibill
- Downloads last month
- 1