DriftCall — Gemma-3n-E2B LoRA (apache-2.0)

LoRA adapter for unsloth/gemma-3n-E2B-it, GRPO-tuned on DriftCall — an OpenEnv-compliant voice-first Indic concierge environment where vendor APIs mutate mid-episode and the agent must keep its promise to the user across the schema drift.

trained on:    DriftCall (OpenEnv v1.0 — 5 reward components, 20 drift patterns)
hardware:      1× NVIDIA H100 80GB HBM3 (bf16, 16-bit LoRA)
trainer:       native PyTorch GRPO (no TRL)
curriculum:    3 stages × 240 GRPO steps total · group size 2
reward:        five deterministic components (no LLM judge), Brier-calibrated,
               uncertain-floor at 0.50

The companion env, demo, REST API, and full project site all live at one HF Space: https://huggingface.co/spaces/saumilyajj/driftcall.

Model details

Field	Value
Base model	`unsloth/gemma-3n-E2B-it` (Gemma-3n-E2B Instruction-tuned, Unsloth-quantised checkpoint)
Adapter type	PEFT / LoRA
`r`	16
`lora_alpha`	32
`lora_dropout`	0.0 (Unsloth fast path)
Precision	16-bit LoRA on bf16 base
File	`adapter_model.safetensors` · 84.6 MB · plus tokenizer (33.4 MB)
Languages	Hindi · Tamil · Kannada · English · Hinglish
License	Apache-2.0

This is an adapter-only release. No merged-fp16 weights are published — naive 4-bit → 16-bit merging produces silently broken weights for this base (see DriftCall DESIGN.md §10.5). Always load on top of the base.

Training

Stage	Drift regime	Steps	Initial weights
1	no drift	70	base Gemma-3n-E2B-it
2	single-pattern drift	100	stage-1 adapter
3	compound drift	70	stage-2 adapter

Algorithm: Group Relative Policy Optimization (GRPO), native PyTorch loop in scripts/train_driftcall_grpo.py (1300 LOC, no TRL dependency).
Group size (G): 2 rollouts per goal — small for GRPO; signal is primarily compounded across the curriculum rather than per-step.
Curriculum: language weights and drift patterns are stage-controlled (no drift → single pattern → compound). Held-out 50-episode eval + 200-episode reward-hacking probe (cells/step_18..20).
Wandb runs: vasudeo118-lnmiit/driftcall project — three runs (mypquww4, the s2 run, og9xqlwy).

Reward function — five components, no LLM judge

ID	Component	Weight	Implementation
R1	`task_completion`	0.40	`cells.step_08_rewards:task_completion`
R2	`drift_detection`	0.20	`cells.step_08_rewards:drift_detection`
R3	`constraint_adherence`	0.20	`cells.step_08_rewards:constraint_adherence`
R4	`format_compliance`	0.10	`cells.step_08_rewards:format_compliance`
R5	`anti_hack_penalty`	0.10	`cells.step_08_rewards:anti_hack_penalty`

Calibration pipeline:

quality        = combine_quality(R1..R5, weights)
brier          = brier_penalty(confidence, R1)
reward_raw     = quality * (1 - brier)
reward         = apply_uncertain_floor(reward_raw, confidence, quality)  # floor=0.50
final         := clamp(reward, -1.0, 1.0)

Hard rule: every reward bit traces to a deterministic schema- and trace-grounded check. There is no LLM-as-a-judge anywhere in the pipeline.

How to use

from unsloth import FastModel
from peft import PeftModel

model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-3n-E2B-it",
    max_seq_length=4096,
    load_in_4bit=False,         # 16-bit LoRA path; matches training
    full_finetuning=False,
)
model = PeftModel.from_pretrained(model, "DGXAI/gemma-3n-e2b-driftcall-lora")
model.eval()

prompt = (
    "BRIEF: 9 baje se pehle ek veg thali ₹500 ke andar Indiranagar mein.\n\n"
    "Reply with EXACTLY one JSON object matching the DriftCallAction schema."
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))

Or — run it against the live env over OpenEnv REST

# Public bearer token for the hackathon Space.
curl -X POST https://saumilyajj-driftcall.hf.space/reset \
  -H "Authorization: Bearer driftcall-demo" \
  -H "X-Session-Id: smoke-001" \
  -H "Content-Type: application/json" \
  -d '{"seed": 42, "curriculum_stage": 2}'

The OpenEnv gym client lives at deploy/inference/ and wraps /reset, /step, /state, /close in a gymnasium-style API.

Limitations

Small training run. 240 GRPO steps at G=2 is a smoke + push validation, not a learning run. Step-0-and-after reward fluctuates in [0.175, 0.300], largely against the uncertain-floor at 0.50. Real lift comes after several thousand steps with G=4–8.
Tool-use, not tool-execution. The agent emits JSON DriftCallAction payloads. Side effects (cab.book, payment.charge, …) are realised by the env's mock vendor surface, not by real infrastructure.
Indic ASR is upstream. Voice input goes through faster-whisper-small; this model never sees raw audio. Code-switched Hinglish accuracy is bounded by Whisper.
Reward components are deterministic, not perfect. R5 (anti_hack_penalty) catches known patterns; novel exploits would need to be added to the probe set in cells/step_20_probe.py.
Not safety-aligned beyond Gemma-3n's defaults. Off-task or adversarial inputs are not specifically guarded for in this run.