File size: 6,730 Bytes
92034af
110c946
92034af
 
 
110c946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92034af
 
110c946
92034af
110c946
 
 
 
 
92034af
110c946
 
 
 
 
 
 
 
92034af
110c946
 
92034af
110c946
92034af
110c946
92034af
110c946
 
 
 
 
 
 
 
 
 
 
92034af
110c946
 
 
92034af
110c946
92034af
110c946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92034af
110c946
92034af
110c946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92034af
110c946
92034af
110c946
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92034af
110c946
92034af
110c946
92034af
110c946
92034af
110c946
 
 
 
92034af
110c946
92034af
110c946
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
license: apache-2.0
base_model: unsloth/gemma-3n-E2B-it
library_name: peft
tags:
  - lora
  - peft
  - rl
  - grpo
  - openenv
  - voice
  - indic
  - hindi
  - tamil
  - kannada
  - hinglish
  - schema-drift
  - gemma-3n
  - text-generation
  - tool-use
language:
  - en
  - hi
  - ta
  - kn
pipeline_tag: text-generation
datasets: []
inference: false
---

# DriftCall — Gemma-3n-E2B LoRA (apache-2.0)

LoRA adapter for **`unsloth/gemma-3n-E2B-it`**, GRPO-tuned on
[**DriftCall**](https://saumilyajj-driftcall.hf.space) — an OpenEnv-compliant
voice-first Indic concierge environment where vendor APIs **mutate
mid-episode** and the agent must keep its promise to the user across the
schema drift.

```
trained on:    DriftCall (OpenEnv v1.0 — 5 reward components, 20 drift patterns)
hardware:      1× NVIDIA H100 80GB HBM3 (bf16, 16-bit LoRA)
trainer:       native PyTorch GRPO (no TRL)
curriculum:    3 stages × 240 GRPO steps total · group size 2
reward:        five deterministic components (no LLM judge), Brier-calibrated,
               uncertain-floor at 0.50
```

The companion env, demo, REST API, and full project site all live at one
HF Space: **<https://huggingface.co/spaces/saumilyajj/driftcall>**.

---

## Model details

| Field | Value |
|---|---|
| Base model | [`unsloth/gemma-3n-E2B-it`](https://huggingface.co/unsloth/gemma-3n-E2B-it) (Gemma-3n-E2B Instruction-tuned, Unsloth-quantised checkpoint) |
| Adapter type | PEFT / LoRA |
| `r` | 16 |
| `lora_alpha` | 32 |
| `lora_dropout` | 0.0 (Unsloth fast path) |
| Precision | 16-bit LoRA on bf16 base |
| File | `adapter_model.safetensors` · 84.6 MB · plus tokenizer (33.4 MB) |
| Languages | Hindi · Tamil · Kannada · English · Hinglish |
| License | Apache-2.0 |

**This is an adapter-only release.** No merged-fp16 weights are published —
naive 4-bit → 16-bit merging produces silently broken weights for this base
(see DriftCall DESIGN.md §10.5). Always load on top of the base.

---

## Training

| Stage | Drift regime | Steps | Initial weights |
|---|---|---:|---|
| 1 | no drift | 70 | base Gemma-3n-E2B-it |
| 2 | single-pattern drift | 100 | stage-1 adapter |
| 3 | compound drift | 70 | stage-2 adapter |

- **Algorithm:** Group Relative Policy Optimization (GRPO), native PyTorch
  loop in `scripts/train_driftcall_grpo.py` (1300 LOC, no TRL dependency).
- **Group size (`G`):** 2 rollouts per goal — small for GRPO; signal is
  primarily compounded across the curriculum rather than per-step.
- **Curriculum:** language weights and drift patterns are stage-controlled
  (no drift → single pattern → compound). Held-out 50-episode eval +
  200-episode reward-hacking probe (`cells/step_18..20`).
- **Wandb runs:** `vasudeo118-lnmiit/driftcall` project — three runs
  (`mypquww4`, the s2 run, `og9xqlwy`).

### Reward function — five components, no LLM judge

| ID | Component | Weight | Implementation |
|---:|---|---:|---|
| R1 | `task_completion` | 0.40 | `cells.step_08_rewards:task_completion` |
| R2 | `drift_detection` | 0.20 | `cells.step_08_rewards:drift_detection` |
| R3 | `constraint_adherence` | 0.20 | `cells.step_08_rewards:constraint_adherence` |
| R4 | `format_compliance` | 0.10 | `cells.step_08_rewards:format_compliance` |
| R5 | `anti_hack_penalty` | 0.10 | `cells.step_08_rewards:anti_hack_penalty` |

Calibration pipeline:

```
quality        = combine_quality(R1..R5, weights)
brier          = brier_penalty(confidence, R1)
reward_raw     = quality * (1 - brier)
reward         = apply_uncertain_floor(reward_raw, confidence, quality)  # floor=0.50
final         := clamp(reward, -1.0, 1.0)
```

**Hard rule:** every reward bit traces to a deterministic schema- and
trace-grounded check. There is no LLM-as-a-judge anywhere in the pipeline.

---

## How to use

```python
from unsloth import FastModel
from peft import PeftModel

model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-3n-E2B-it",
    max_seq_length=4096,
    load_in_4bit=False,         # 16-bit LoRA path; matches training
    full_finetuning=False,
)
model = PeftModel.from_pretrained(model, "DGXAI/gemma-3n-e2b-driftcall-lora")
model.eval()

prompt = (
    "BRIEF: 9 baje se pehle ek veg thali ₹500 ke andar Indiranagar mein.\n\n"
    "Reply with EXACTLY one JSON object matching the DriftCallAction schema."
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

### Or — run it against the live env over OpenEnv REST

```bash
# Public bearer token for the hackathon Space.
curl -X POST https://saumilyajj-driftcall.hf.space/reset \
  -H "Authorization: Bearer driftcall-demo" \
  -H "X-Session-Id: smoke-001" \
  -H "Content-Type: application/json" \
  -d '{"seed": 42, "curriculum_stage": 2}'
```

The OpenEnv gym client lives at
[`deploy/inference/`](https://github.com/saumilyagupta/openenv-DGXAI/tree/google/gemma-3n-E4B-it/DRIFTCALL/deploy/inference)
and wraps `/reset`, `/step`, `/state`, `/close` in a gymnasium-style API.

---

## Limitations

- **Small training run.** 240 GRPO steps at G=2 is a smoke + push validation,
  not a learning run. Step-0-and-after reward fluctuates in `[0.175, 0.300]`,
  largely against the uncertain-floor at 0.50. Real lift comes after several
  thousand steps with G=4–8.
- **Tool-use, not tool-execution.** The agent emits JSON DriftCallAction
  payloads. Side effects (`cab.book`, `payment.charge`, …) are realised by
  the env's mock vendor surface, not by real infrastructure.
- **Indic ASR is upstream.** Voice input goes through `faster-whisper-small`;
  this model never sees raw audio. Code-switched Hinglish accuracy is bounded
  by Whisper.
- **Reward components are deterministic, not perfect.** R5 (`anti_hack_penalty`)
  catches known patterns; novel exploits would need to be added to the probe
  set in `cells/step_20_probe.py`.
- **Not safety-aligned beyond Gemma-3n's defaults.** Off-task or adversarial
  inputs are not specifically guarded for in this run.

---

## Citation / acknowledgement

DriftCall is built on top of:

- [`unsloth/gemma-3n-E2B-it`](https://huggingface.co/unsloth/gemma-3n-E2B-it) — base model
- [Unsloth](https://github.com/unslothai/unsloth) — fast LoRA path
- [`hexgrad/Kokoro-82M`](https://huggingface.co/hexgrad/Kokoro-82M) — TTS in the env's audio pipeline
- [`Systran/faster-whisper-small`](https://huggingface.co/Systran/faster-whisper-small) — ASR in the env's audio pipeline

Source: <https://github.com/saumilyagupta/openenv-DGXAI> · branch `google/gemma-3n-E4B-it`.

Hackathon: DGX Hackathon 2026 — Indic Voice + RL track.