DriftCall β€” Gemma-3n-E2B LoRA (apache-2.0)

LoRA adapter for unsloth/gemma-3n-E2B-it, GRPO-tuned on DriftCall β€” an OpenEnv-compliant voice-first Indic concierge environment where vendor APIs mutate mid-episode and the agent must keep its promise to the user across the schema drift.

trained on:    DriftCall (OpenEnv v1.0 β€” 5 reward components, 20 drift patterns)
hardware:      1Γ— NVIDIA H100 80GB HBM3 (bf16, 16-bit LoRA)
trainer:       native PyTorch GRPO (no TRL)
curriculum:    3 stages Γ— 240 GRPO steps total Β· group size 2
reward:        five deterministic components (no LLM judge), Brier-calibrated,
               uncertain-floor at 0.50

The companion env, demo, REST API, and full project site all live at one HF Space: https://huggingface.co/spaces/saumilyajj/driftcall.


Model details

Field Value
Base model unsloth/gemma-3n-E2B-it (Gemma-3n-E2B Instruction-tuned, Unsloth-quantised checkpoint)
Adapter type PEFT / LoRA
r 16
lora_alpha 32
lora_dropout 0.0 (Unsloth fast path)
Precision 16-bit LoRA on bf16 base
File adapter_model.safetensors Β· 84.6 MB Β· plus tokenizer (33.4 MB)
Languages Hindi Β· Tamil Β· Kannada Β· English Β· Hinglish
License Apache-2.0

This is an adapter-only release. No merged-fp16 weights are published β€” naive 4-bit β†’ 16-bit merging produces silently broken weights for this base (see DriftCall DESIGN.md Β§10.5). Always load on top of the base.


Training

Stage Drift regime Steps Initial weights
1 no drift 70 base Gemma-3n-E2B-it
2 single-pattern drift 100 stage-1 adapter
3 compound drift 70 stage-2 adapter
  • Algorithm: Group Relative Policy Optimization (GRPO), native PyTorch loop in scripts/train_driftcall_grpo.py (1300 LOC, no TRL dependency).
  • Group size (G): 2 rollouts per goal β€” small for GRPO; signal is primarily compounded across the curriculum rather than per-step.
  • Curriculum: language weights and drift patterns are stage-controlled (no drift β†’ single pattern β†’ compound). Held-out 50-episode eval + 200-episode reward-hacking probe (cells/step_18..20).
  • Wandb runs: vasudeo118-lnmiit/driftcall project β€” three runs (mypquww4, the s2 run, og9xqlwy).

Reward function β€” five components, no LLM judge

ID Component Weight Implementation
R1 task_completion 0.40 cells.step_08_rewards:task_completion
R2 drift_detection 0.20 cells.step_08_rewards:drift_detection
R3 constraint_adherence 0.20 cells.step_08_rewards:constraint_adherence
R4 format_compliance 0.10 cells.step_08_rewards:format_compliance
R5 anti_hack_penalty 0.10 cells.step_08_rewards:anti_hack_penalty

Calibration pipeline:

quality        = combine_quality(R1..R5, weights)
brier          = brier_penalty(confidence, R1)
reward_raw     = quality * (1 - brier)
reward         = apply_uncertain_floor(reward_raw, confidence, quality)  # floor=0.50
final         := clamp(reward, -1.0, 1.0)

Hard rule: every reward bit traces to a deterministic schema- and trace-grounded check. There is no LLM-as-a-judge anywhere in the pipeline.


How to use

from unsloth import FastModel
from peft import PeftModel

model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-3n-E2B-it",
    max_seq_length=4096,
    load_in_4bit=False,         # 16-bit LoRA path; matches training
    full_finetuning=False,
)
model = PeftModel.from_pretrained(model, "DGXAI/gemma-3n-e2b-driftcall-lora")
model.eval()

prompt = (
    "BRIEF: 9 baje se pehle ek veg thali β‚Ή500 ke andar Indiranagar mein.\n\n"
    "Reply with EXACTLY one JSON object matching the DriftCallAction schema."
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))

Or β€” run it against the live env over OpenEnv REST

# Public bearer token for the hackathon Space.
curl -X POST https://saumilyajj-driftcall.hf.space/reset \
  -H "Authorization: Bearer driftcall-demo" \
  -H "X-Session-Id: smoke-001" \
  -H "Content-Type: application/json" \
  -d '{"seed": 42, "curriculum_stage": 2}'

The OpenEnv gym client lives at deploy/inference/ and wraps /reset, /step, /state, /close in a gymnasium-style API.


Limitations

  • Small training run. 240 GRPO steps at G=2 is a smoke + push validation, not a learning run. Step-0-and-after reward fluctuates in [0.175, 0.300], largely against the uncertain-floor at 0.50. Real lift comes after several thousand steps with G=4–8.
  • Tool-use, not tool-execution. The agent emits JSON DriftCallAction payloads. Side effects (cab.book, payment.charge, …) are realised by the env's mock vendor surface, not by real infrastructure.
  • Indic ASR is upstream. Voice input goes through faster-whisper-small; this model never sees raw audio. Code-switched Hinglish accuracy is bounded by Whisper.
  • Reward components are deterministic, not perfect. R5 (anti_hack_penalty) catches known patterns; novel exploits would need to be added to the probe set in cells/step_20_probe.py.
  • Not safety-aligned beyond Gemma-3n's defaults. Off-task or adversarial inputs are not specifically guarded for in this run.

Citation / acknowledgement

DriftCall is built on top of:

Source: https://github.com/saumilyagupta/openenv-DGXAI Β· branch google/gemma-3n-E4B-it.

Hackathon: DGX Hackathon 2026 β€” Indic Voice + RL track.

Downloads last month
63
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DGXAI/gemma-3n-e2b-driftcall-lora

Adapter
(2)
this model

Spaces using DGXAI/gemma-3n-e2b-driftcall-lora 4