DriftCall β Gemma-3n-E2B LoRA (apache-2.0)
LoRA adapter for unsloth/gemma-3n-E2B-it, GRPO-tuned on
DriftCall β an OpenEnv-compliant
voice-first Indic concierge environment where vendor APIs mutate
mid-episode and the agent must keep its promise to the user across the
schema drift.
trained on: DriftCall (OpenEnv v1.0 β 5 reward components, 20 drift patterns)
hardware: 1Γ NVIDIA H100 80GB HBM3 (bf16, 16-bit LoRA)
trainer: native PyTorch GRPO (no TRL)
curriculum: 3 stages Γ 240 GRPO steps total Β· group size 2
reward: five deterministic components (no LLM judge), Brier-calibrated,
uncertain-floor at 0.50
The companion env, demo, REST API, and full project site all live at one HF Space: https://huggingface.co/spaces/saumilyajj/driftcall.
Model details
| Field | Value |
|---|---|
| Base model | unsloth/gemma-3n-E2B-it (Gemma-3n-E2B Instruction-tuned, Unsloth-quantised checkpoint) |
| Adapter type | PEFT / LoRA |
r |
16 |
lora_alpha |
32 |
lora_dropout |
0.0 (Unsloth fast path) |
| Precision | 16-bit LoRA on bf16 base |
| File | adapter_model.safetensors Β· 84.6 MB Β· plus tokenizer (33.4 MB) |
| Languages | Hindi Β· Tamil Β· Kannada Β· English Β· Hinglish |
| License | Apache-2.0 |
This is an adapter-only release. No merged-fp16 weights are published β naive 4-bit β 16-bit merging produces silently broken weights for this base (see DriftCall DESIGN.md Β§10.5). Always load on top of the base.
Training
| Stage | Drift regime | Steps | Initial weights |
|---|---|---|---|
| 1 | no drift | 70 | base Gemma-3n-E2B-it |
| 2 | single-pattern drift | 100 | stage-1 adapter |
| 3 | compound drift | 70 | stage-2 adapter |
- Algorithm: Group Relative Policy Optimization (GRPO), native PyTorch
loop in
scripts/train_driftcall_grpo.py(1300 LOC, no TRL dependency). - Group size (
G): 2 rollouts per goal β small for GRPO; signal is primarily compounded across the curriculum rather than per-step. - Curriculum: language weights and drift patterns are stage-controlled
(no drift β single pattern β compound). Held-out 50-episode eval +
200-episode reward-hacking probe (
cells/step_18..20). - Wandb runs:
vasudeo118-lnmiit/driftcallproject β three runs (mypquww4, the s2 run,og9xqlwy).
Reward function β five components, no LLM judge
| ID | Component | Weight | Implementation |
|---|---|---|---|
| R1 | task_completion |
0.40 | cells.step_08_rewards:task_completion |
| R2 | drift_detection |
0.20 | cells.step_08_rewards:drift_detection |
| R3 | constraint_adherence |
0.20 | cells.step_08_rewards:constraint_adherence |
| R4 | format_compliance |
0.10 | cells.step_08_rewards:format_compliance |
| R5 | anti_hack_penalty |
0.10 | cells.step_08_rewards:anti_hack_penalty |
Calibration pipeline:
quality = combine_quality(R1..R5, weights)
brier = brier_penalty(confidence, R1)
reward_raw = quality * (1 - brier)
reward = apply_uncertain_floor(reward_raw, confidence, quality) # floor=0.50
final := clamp(reward, -1.0, 1.0)
Hard rule: every reward bit traces to a deterministic schema- and trace-grounded check. There is no LLM-as-a-judge anywhere in the pipeline.
How to use
from unsloth import FastModel
from peft import PeftModel
model, tokenizer = FastModel.from_pretrained(
"unsloth/gemma-3n-E2B-it",
max_seq_length=4096,
load_in_4bit=False, # 16-bit LoRA path; matches training
full_finetuning=False,
)
model = PeftModel.from_pretrained(model, "DGXAI/gemma-3n-e2b-driftcall-lora")
model.eval()
prompt = (
"BRIEF: 9 baje se pehle ek veg thali βΉ500 ke andar Indiranagar mein.\n\n"
"Reply with EXACTLY one JSON object matching the DriftCallAction schema."
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
Or β run it against the live env over OpenEnv REST
# Public bearer token for the hackathon Space.
curl -X POST https://saumilyajj-driftcall.hf.space/reset \
-H "Authorization: Bearer driftcall-demo" \
-H "X-Session-Id: smoke-001" \
-H "Content-Type: application/json" \
-d '{"seed": 42, "curriculum_stage": 2}'
The OpenEnv gym client lives at
deploy/inference/
and wraps /reset, /step, /state, /close in a gymnasium-style API.
Limitations
- Small training run. 240 GRPO steps at G=2 is a smoke + push validation,
not a learning run. Step-0-and-after reward fluctuates in
[0.175, 0.300], largely against the uncertain-floor at 0.50. Real lift comes after several thousand steps with G=4β8. - Tool-use, not tool-execution. The agent emits JSON DriftCallAction
payloads. Side effects (
cab.book,payment.charge, β¦) are realised by the env's mock vendor surface, not by real infrastructure. - Indic ASR is upstream. Voice input goes through
faster-whisper-small; this model never sees raw audio. Code-switched Hinglish accuracy is bounded by Whisper. - Reward components are deterministic, not perfect. R5 (
anti_hack_penalty) catches known patterns; novel exploits would need to be added to the probe set incells/step_20_probe.py. - Not safety-aligned beyond Gemma-3n's defaults. Off-task or adversarial inputs are not specifically guarded for in this run.
Citation / acknowledgement
DriftCall is built on top of:
unsloth/gemma-3n-E2B-itβ base model- Unsloth β fast LoRA path
hexgrad/Kokoro-82Mβ TTS in the env's audio pipelineSystran/faster-whisper-smallβ ASR in the env's audio pipeline
Source: https://github.com/saumilyagupta/openenv-DGXAI Β· branch google/gemma-3n-E4B-it.
Hackathon: DGX Hackathon 2026 β Indic Voice + RL track.
- Downloads last month
- 63
Model tree for DGXAI/gemma-3n-e2b-driftcall-lora
Base model
google/gemma-3n-E4B