Upload README.md with huggingface_hub

110c946 verified about 1 month ago

6.73 kB

	---
	license: apache-2.0
	base_model: unsloth/gemma-3n-E2B-it
	library_name: peft
	tags:
	- lora
	- peft
	- rl
	- grpo
	- openenv
	- voice
	- indic
	- hindi
	- tamil
	- kannada
	- hinglish
	- schema-drift
	- gemma-3n
	- text-generation
	- tool-use
	language:
	- en
	- hi
	- ta
	- kn
	pipeline_tag: text-generation
	datasets: []
	inference: false
	---

	# DriftCall — Gemma-3n-E2B LoRA (apache-2.0)

	LoRA adapter for `unsloth/gemma-3n-E2B-it`, GRPO-tuned on
	[DriftCall](https://saumilyajj-driftcall.hf.space) — an OpenEnv-compliant
	voice-first Indic concierge environment where vendor APIs **mutate
	mid-episode** and the agent must keep its promise to the user across the
	schema drift.

	```
	trained on: DriftCall (OpenEnv v1.0 — 5 reward components, 20 drift patterns)
	hardware: 1× NVIDIA H100 80GB HBM3 (bf16, 16-bit LoRA)
	trainer: native PyTorch GRPO (no TRL)
	curriculum: 3 stages × 240 GRPO steps total · group size 2
	reward: five deterministic components (no LLM judge), Brier-calibrated,
	uncertain-floor at 0.50
	```

	The companion env, demo, REST API, and full project site all live at one
	HF Space: <https://huggingface.co/spaces/saumilyajj/driftcall>.

	---

	## Model details

	\| Field \| Value \|
	\|---\|---\|
	\| Base model \| [`unsloth/gemma-3n-E2B-it`](https://huggingface.co/unsloth/gemma-3n-E2B-it) (Gemma-3n-E2B Instruction-tuned, Unsloth-quantised checkpoint) \|
	\| Adapter type \| PEFT / LoRA \|
	\| `r` \| 16 \|
	\| `lora_alpha` \| 32 \|
	\| `lora_dropout` \| 0.0 (Unsloth fast path) \|
	\| Precision \| 16-bit LoRA on bf16 base \|
	\| File \| `adapter_model.safetensors` · 84.6 MB · plus tokenizer (33.4 MB) \|
	\| Languages \| Hindi · Tamil · Kannada · English · Hinglish \|
	\| License \| Apache-2.0 \|

	This is an adapter-only release. No merged-fp16 weights are published —
	naive 4-bit → 16-bit merging produces silently broken weights for this base
	(see DriftCall DESIGN.md §10.5). Always load on top of the base.

	---

	## Training

	\| Stage \| Drift regime \| Steps \| Initial weights \|
	\|---\|---\|---:\|---\|
	\| 1 \| no drift \| 70 \| base Gemma-3n-E2B-it \|
	\| 2 \| single-pattern drift \| 100 \| stage-1 adapter \|
	\| 3 \| compound drift \| 70 \| stage-2 adapter \|

	- Algorithm: Group Relative Policy Optimization (GRPO), native PyTorch
	loop in `scripts/train_driftcall_grpo.py` (1300 LOC, no TRL dependency).
	- Group size (`G`): 2 rollouts per goal — small for GRPO; signal is
	primarily compounded across the curriculum rather than per-step.
	- Curriculum: language weights and drift patterns are stage-controlled
	(no drift → single pattern → compound). Held-out 50-episode eval +
	200-episode reward-hacking probe (`cells/step_18..20`).
	- Wandb runs: `vasudeo118-lnmiit/driftcall` project — three runs
	(`mypquww4`, the s2 run, `og9xqlwy`).

	### Reward function — five components, no LLM judge

	\| ID \| Component \| Weight \| Implementation \|
	\|---:\|---\|---:\|---\|
	\| R1 \| `task_completion` \| 0.40 \| `cells.step_08_rewards:task_completion` \|
	\| R2 \| `drift_detection` \| 0.20 \| `cells.step_08_rewards:drift_detection` \|
	\| R3 \| `constraint_adherence` \| 0.20 \| `cells.step_08_rewards:constraint_adherence` \|
	\| R4 \| `format_compliance` \| 0.10 \| `cells.step_08_rewards:format_compliance` \|
	\| R5 \| `anti_hack_penalty` \| 0.10 \| `cells.step_08_rewards:anti_hack_penalty` \|

	Calibration pipeline:

	```
	quality = combine_quality(R1..R5, weights)
	brier = brier_penalty(confidence, R1)
	reward_raw = quality * (1 - brier)
	reward = apply_uncertain_floor(reward_raw, confidence, quality) # floor=0.50
	final := clamp(reward, -1.0, 1.0)
	```

	Hard rule: every reward bit traces to a deterministic schema- and
	trace-grounded check. There is no LLM-as-a-judge anywhere in the pipeline.

	---

	## How to use

	```python
	from unsloth import FastModel
	from peft import PeftModel

	model, tokenizer = FastModel.from_pretrained(
	"unsloth/gemma-3n-E2B-it",
	max_seq_length=4096,
	load_in_4bit=False, # 16-bit LoRA path; matches training
	full_finetuning=False,
	)
	model = PeftModel.from_pretrained(model, "DGXAI/gemma-3n-e2b-driftcall-lora")
	model.eval()

	prompt = (
	"BRIEF: 9 baje se pehle ek veg thali ₹500 ke andar Indiranagar mein.\n\n"
	"Reply with EXACTLY one JSON object matching the DriftCallAction schema."
	)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
	print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
	```

	### Or — run it against the live env over OpenEnv REST

	```bash
	# Public bearer token for the hackathon Space.
	curl -X POST https://saumilyajj-driftcall.hf.space/reset \
	-H "Authorization: Bearer driftcall-demo" \
	-H "X-Session-Id: smoke-001" \
	-H "Content-Type: application/json" \
	-d '{"seed": 42, "curriculum_stage": 2}'
	```

	The OpenEnv gym client lives at
	[`deploy/inference/`](https://github.com/saumilyagupta/openenv-DGXAI/tree/google/gemma-3n-E4B-it/DRIFTCALL/deploy/inference)
	and wraps `/reset`, `/step`, `/state`, `/close` in a gymnasium-style API.

	---

	## Limitations

	- Small training run. 240 GRPO steps at G=2 is a smoke + push validation,
	not a learning run. Step-0-and-after reward fluctuates in `[0.175, 0.300]`,
	largely against the uncertain-floor at 0.50. Real lift comes after several
	thousand steps with G=4–8.
	- Tool-use, not tool-execution. The agent emits JSON DriftCallAction
	payloads. Side effects (`cab.book`, `payment.charge`, …) are realised by
	the env's mock vendor surface, not by real infrastructure.
	- Indic ASR is upstream. Voice input goes through `faster-whisper-small`;
	this model never sees raw audio. Code-switched Hinglish accuracy is bounded
	by Whisper.
	- Reward components are deterministic, not perfect. R5 (`anti_hack_penalty`)
	catches known patterns; novel exploits would need to be added to the probe
	set in `cells/step_20_probe.py`.
	- Not safety-aligned beyond Gemma-3n's defaults. Off-task or adversarial
	inputs are not specifically guarded for in this run.

	---

	## Citation / acknowledgement

	DriftCall is built on top of:

	- [`unsloth/gemma-3n-E2B-it`](https://huggingface.co/unsloth/gemma-3n-E2B-it) — base model
	- [Unsloth](https://github.com/unslothai/unsloth) — fast LoRA path
	- [`hexgrad/Kokoro-82M`](https://huggingface.co/hexgrad/Kokoro-82M) — TTS in the env's audio pipeline
	- [`Systran/faster-whisper-small`](https://huggingface.co/Systran/faster-whisper-small) — ASR in the env's audio pipeline

	Source: <https://github.com/saumilyagupta/openenv-DGXAI> · branch `google/gemma-3n-E4B-it`.

	Hackathon: DGX Hackathon 2026 — Indic Voice + RL track.