Spaces:

saumilyajj
/

driftcall

Sleeping

File size: 33,593 Bytes

f2df60e

# deploy_env_space.md — DriftCall Env HF Space Deployment

**Owner:** Person D (Deploy & Story)
**Implements:** DESIGN.md §3.3 (Deployed Env Topology), §11.1 (Env Space files), §13 (Deliverables)
**Depends on:** `docs/modules/env.md` (FastAPI surface contract), `docs/modules/models.md` (dataclass wire format), `docs/modules/audio.md` (Kokoro + Whisper runtime)
**Status:** DRAFT — pending ≥ 2 fresh critic rounds

---

## 1. Purpose

`driftcall-env` is the production hosting target for the DriftCall OpenEnv RL environment. It runs on a **free-tier Hugging Face Space (Docker SDK, CPU basic, 2 vCPU / 16 GB RAM)** and is the artifact the hackathon judges exercise via `openenv validate`. The Space exposes a FastAPI application implementing the OpenEnv REST contract (`/reset`, `/step`, `/state`, `/close`) plus a lightweight session cache so concurrent training / evaluation runs can share one deployment without state bleed.

The Space is **intentionally CPU-only**. Kokoro TTS (82 M params) and `faster-whisper-small` int8 (~244 M params) both run at roughly real-time on a single modern CPU core; the training topology (DESIGN.md §3.2, §9.4) never loads TTS/ASR because GRPO operates text-in / text-out. This module owns:

1. The Dockerfile (multi-stage build, <2 GB final image, pre-pulled audio weights).
2. `openenv.yaml` metadata (required for `openenv validate`).
3. `requirements.txt` pin set (fastapi, uvicorn, openenv, kokoro, faster-whisper, plus transitive deps).
4. The Space README (Space card) — must satisfy HF Space schema + hackathon submission rules.
5. The session cache implementation sketch delegated to `app.py` (full code in `docs/modules/env.md`; this doc specifies the cache's **deployment constraints** only).
6. The deployment command set (build, push, validate).

This doc is a design spec, not an executable. It must contain every decision needed so a single operator can ship the env Space in one 30-minute sitting on Apr 25 morning (DESIGN.md §12.2 pre-onsite hour 16 gate).

---

## 2. Interface

### 2.1 External HTTP surface (served by the Space)

The Space exposes the OpenEnv REST surface on **port 7860** (HF Spaces Docker SDK convention — any other port is unreachable). All endpoints accept and return `application/json`. Session identity is carried as a request header so the cache can dispatch to the right env instance.

```
POST   /reset           → 200 application/json   # create or recycle a session, return initial observation
POST   /step            → 200 application/json   # advance one turn; returns observation + reward + done
GET    /state           → 200 application/json   # read the current DriftCallState (debug / judge inspection)
POST   /close           → 200 application/json   # explicitly evict a session
GET    /healthz         → 200 text/plain "ok"    # Space healthcheck (HF pings this to mark the Space "running")
GET    /                → 200 text/html          # minimal landing page (see §4.4); NOT the agent surface
```

**Headers (all mutating endpoints):**

| Header | Required | Notes |
|---|---|---|
| `Authorization: Bearer <DRIFTCALL_ENV_TOKEN>` | yes (see §3.5) | Space secret; judge receives this via submission form |
| `X-Session-Id: <uuid4-or-caller-chosen>` | yes | Opaque string, max 64 chars, `[A-Za-z0-9_-]` only |
| `Content-Type: application/json` | yes | UTF-8 |

The endpoint contracts (request / response shapes) are owned by `docs/modules/env.md` and serialize the `DriftCallObservation` / `DriftCallState` / `DriftCallAction` dataclasses defined in `docs/modules/models.md`. This doc only pins the **deployment-visible** aspects: port, headers, auth, status codes.

> **Cross-doc sync note (2026-04-24):** DESIGN.md §3.3 was updated to match this doc's choice of carrying session identity via the `X-Session-Id` HTTP header (previously documented there as a `session_id` query param). Both docs now agree. No behavior change in this spec — the note is recorded so reviewers don't perceive divergence.

### 2.1.1 Success body shapes (top-level only)

Top-level JSON shapes for each success response. Inner dataclass fields (`DriftCallObservation`, `DriftCallAction`, `DriftCallState`) are owned by `docs/modules/env.md` and `docs/modules/models.md` — this section pins only the envelope each endpoint returns.

**`POST /reset`**

Request:
```json
{
  "config": {
    "curriculum_stage": 1,
    "language_weights": { "hi": 0.4, "ta": 0.2, "kn": 0.2, "hinglish": 0.2 },
    "audio_boundary_enabled": true
  },
  "seed": 42
}
```
- `config.curriculum_stage`: `1 | 2 | 3`
- `config.language_weights`: object, keys are language codes, values sum to 1.0
- `config.audio_boundary_enabled`: bool
- `seed`: `int | null`

Response:
```json
{
  "observation": { "...DriftCallObservation..." },
  "episode_id": "uuid4-string",
  "max_turns": 12
}
```

**`POST /step`**

Request:
```json
{ "action": { "...DriftCallAction..." } }
```

Response:
```json
{
  "observation": { "...DriftCallObservation..." },
  "reward": 0.0,
  "done": false,
  "info": { "...opaque..." }
}
```
- `reward`: `float | null` (null when reward is deferred to episode end)

**`GET /state`**

Response:
```json
{
  "state": { "...DriftCallState..." },
  "turn": 3
}
```

**`POST /close`**

Response:
```json
{
  "closed": true,
  "final_state": { "...DriftCallState... | null" }
}
```
- `final_state`: `object | null` (null if session was already evicted)

Deeper field-level detail for `DriftCallObservation`, `DriftCallAction`, and `DriftCallState` lives in `docs/modules/env.md` and `docs/modules/models.md` — do not duplicate it here.

### 2.2 Status code map

| Code | Meaning | Triggered by |
|---|---|---|
| 200 | Success | Normal return |
| 400 | Malformed JSON / missing header / invalid action shape | Parsing or dataclass validation failure |
| 401 | Missing or bad bearer | §3.5 auth check |
| 404 | `X-Session-Id` not in cache (for `/step` / `/state` / `/close`) | Session expired, evicted, or never created |
| 409 | Concurrent `/reset` on same session id (see §7, case 1) | Cache key collision during init |
| 429 | Max concurrent sessions reached | §3.2 cap hit |
| 500 | Unhandled exception inside env step | Bug; logged, stack trace NOT returned in body |
| 503 | Model weights not yet loaded on cold-start | §7, case 3 |

All error bodies are `{"error": {"code": "<slug>", "message": "<user-safe string>"}}`. Internal stack traces never cross the wire.

### 2.3 Outbound network

The Space makes **zero outbound HTTP calls at runtime**. Kokoro and Whisper weights are baked into the image (§4.2); no HF Hub fetches, no telemetry, no phone-home. This is load-bearing because HF Spaces free CPU tier often has slow / rate-limited egress, and because reproducibility demands an offline image.

### 2.4 Container entrypoint

```dockerfile
CMD ["uvicorn", "app:app", \
     "--host", "0.0.0.0", \
     "--port", "7860", \
     "--workers", "2", \
     "--timeout-keep-alive", "30", \
     "--log-level", "info"]
```

Two uvicorn workers (not four) — CPU basic tier has 2 vCPUs, and Kokoro/Whisper hold the GIL on synthesis/transcription; more workers just contend for the same cores.

---

## 3. Behavior Spec

### 3.1 Session lifecycle

A session is an instance of `DriftCallEnvironment` (the class whose full behavior lives in `docs/modules/env.md`). The deployment layer treats each session as an opaque object with `reset()`, `step()`, `state()`, `close()` methods and does not introspect it.

```
client                              Space (app.py)                cache
   │  POST /reset {seed, config}      │                              │
   │  X-Session-Id: S1                │                              │
   ├─────────────────────────────────▶│  look up S1                  │
   │                                  ├─────────────────────────────▶│
   │                                  │◀───── miss ──────────────────┤
   │                                  │  construct env, bind seed    │
   │                                  │  store (env, last_touched)   │
   │                                  ├─────────────────────────────▶│
   │                                  │  env.reset(...) → obs        │
   │◀────────────  200 obs ───────────┤                              │
   │                                  │                              │
   │  POST /step                      │                              │
   ├─────────────────────────────────▶│  lookup S1 → hit             │
   │                                  │  touch last_touched = now    │
   │                                  │  env.step(...) → obs,r,done  │
   │◀────────── 200 obs,r ────────────┤                              │
```

### 3.2 Cache policy (deployment-level invariants)

The cache is an in-process dict, keyed by `X-Session-Id`. The implementation lives in `app.py` (`docs/modules/env.md` §3 "session cache"), but this doc locks the policy:

| Invariant | Value | Source |
|---|---|---|
| Max concurrent sessions | **10** | DESIGN.md §3.3 |
| TTL (time since `last_touched`) | **3600 s = 1 hr** | DESIGN.md §3.3 |
| Storage | In-memory only (no Redis, no disk) | Free tier has no persistent disk writable at runtime; container state resets on Space rebuild |
| Eviction policy | LRU when cap reached; stale-TTL sweep every 60 s | §3.3 |
| Cross-process sharing | None — each uvicorn worker has its own cache | Acceptable because cache is advisory; clients that get routed to a different worker on re-connect re-issue `/reset` |

**Consequence of the "per-worker cache" choice:** a client's session id may land on worker W1 for `/reset` and W2 for `/step` (uvicorn uses round-robin-ish scheduling on the OS socket). In that case `/step` returns 404 and the client must re-`/reset`. This is acceptable for the hackathon because:

1. Training / eval runs keep a persistent HTTP connection via `requests.Session`, which typically pins to one worker for the life of the socket.
2. Judges use one session end-to-end; they hit `/reset` and then replay steps over the same connection.
3. Two-worker degradation is documented in the Space README so judges don't get silently surprised.

A future hardening path (not in-scope for this hackathon) is to run `--workers 1` with thread pool, or share the cache via `multiprocessing.Manager`. Both are listed in §9.

### 3.3 Eviction sweep

A background asyncio task (started in `app.py` `lifespan`) runs every 60 s:

```
for sid, entry in list(cache.items()):
    if now() - entry.last_touched > TTL:
        env = cache.pop(sid).env
        env.close()  # frees whatever audio buffers the env holds
```

LRU eviction on `/reset` when `len(cache) >= 10` drops the oldest `last_touched` entry first; the new session replaces it.

### 3.4 Streaming / keep-alive

All endpoint responses are single JSON bodies — **no SSE, no websockets, no chunked streaming**. OpenEnv's client library (`openenv.HTTPEnvClient`) uses blocking `POST` + `json()` and a shared `requests.Session`; anything exotic risks failing `openenv validate`. A `/step` call may take up to ~5 s when an audio pass is involved (Kokoro synth + Whisper transcribe on CPU), so we set `--timeout-keep-alive 30` to keep the socket alive comfortably below the 60 s HF Spaces proxy timeout.

### 3.5 Authentication

A single shared-secret bearer guards all mutating endpoints. The token is injected as a HF Space **Secret** named `DRIFTCALL_ENV_TOKEN` and read by `app.py` at import time. `/healthz` is **unauthenticated** (HF Space probes have no bearer).

- Token format: 32+ byte URL-safe random (`secrets.token_urlsafe(32)`).
- Token rotation: delete the Space secret and push a new one; all in-flight sessions 401 on the next request.
- Missing secret at Space boot → container exits 1 (fail-fast).
- The token is bundled with the hackathon submission package so judges can exercise `openenv validate` against the live Space.

### 3.6 Determinism

The deployment does not itself introduce nondeterminism. `env.py` owns seed handling; the cache is a pass-through. However, **two CPU-bound sources of wall-clock variance** can change observable latency (`tool_results[i].latency_ms` is wall-clock, not simulated):

1. Kokoro synth time on the first call after cold start can be 2–3× steady-state due to JIT / lazy graph compile.
2. Whisper VAD + decode time varies with input length.

Neither perturbs reward math — `latency_ms` is informational, never scored.

### 3.7 Logging

Structured JSON logs to stdout (HF Spaces captures stdout into the Logs tab). One log line per request, fields: `ts`, `level`, `session_id`, `endpoint`, `status`, `latency_ms`, `turn`, `err_code` (nullable). No PII, no audio bytes, no bearer token. The full `DriftCallAction` body is logged at DEBUG only, disabled by default.

---

## 4. Data structures

### 4.1 `SessionEntry`

```python
@dataclass(frozen=True)
class SessionEntry:
    env: DriftCallEnvironment        # opaque; see docs/modules/env.md
    created_at: float                # time.monotonic() at /reset
    last_touched: float               # time.monotonic() at every /step|/state
    reset_count: int                 # incremented on in-place /reset (§7, case 1)
```

Frozen per project rule (CLAUDE.md §7). `last_touched` updates produce a new `SessionEntry`; the cache dict replaces the old entry.

### 4.2 Dockerfile layout

Multi-stage build. Stage 1 installs wheels into a throwaway image; stage 2 copies only the site-packages dir and the app code. Target final image < 2 GB (DESIGN.md Risk 10).

```
# -------- Stage 1: builder --------
FROM python:3.11-slim AS builder
ENV PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential git libsndfile1 ffmpeg && \
    rm -rf /var/lib/apt/lists/*
COPY requirements.txt ./
RUN pip install --prefix=/install -r requirements.txt

# Pre-pull model weights so first /reset is fast
RUN pip install --prefix=/install huggingface_hub
RUN PYTHONPATH=/install/lib/python3.11/site-packages \
    python -c "from huggingface_hub import snapshot_download; \
               snapshot_download('hexgrad/Kokoro-82M', cache_dir='/weights'); \
               snapshot_download('Systran/faster-whisper-small', cache_dir='/weights')"

# -------- Stage 2: runtime --------
FROM python:3.11-slim
ENV PYTHONUNBUFFERED=1 \
    HF_HOME=/root/.cache/huggingface \
    TRANSFORMERS_OFFLINE=1 \
    HF_HUB_OFFLINE=1
RUN apt-get update && apt-get install -y --no-install-recommends \
        libsndfile1 ffmpeg ca-certificates && \
    rm -rf /var/lib/apt/lists/*
COPY --from=builder /install /usr/local
COPY --from=builder /weights /root/.cache/huggingface
WORKDIR /app
COPY app.py openenv.yaml ./
COPY driftcall/ ./driftcall/
COPY data/ ./data/
EXPOSE 7860
HEALTHCHECK --interval=30s --timeout=5s --start-period=45s \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:7860/healthz', timeout=4).read()" || exit 1
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "2", "--timeout-keep-alive", "30", "--log-level", "info"]
```

Key decisions:

- `python:3.11-slim` base: smallest stable Python base with glibc (alpine would force musl-incompatible wheels for `faster-whisper` / `ctranslate2`).
- `ffmpeg` installed because Whisper's audio loader shells out to it for anything non-WAV.
- `HF_HUB_OFFLINE=1` + `TRANSFORMERS_OFFLINE=1` are hard guarantees — if a download is attempted at runtime it raises, never silently fetches and hangs (§5, mode M6).
- Weights land under `/root/.cache/huggingface`; that's where both Kokoro and faster-whisper look by default.

### 4.3 `openenv.yaml`

```yaml
# openenv.yaml — consumed by `openenv validate`
# Schema source: https://github.com/meta-pytorch/OpenEnv
schema_version: "1.0"
env:
  id: driftcall
  version: "0.1.0"
  display_name: "DriftCall — Indic Voice Concierge under Schema Drift"
  description: >
    OpenEnv-compliant RL environment where a voice-first agent must complete
    Indic consumer concierge tasks while the vendor APIs undergo mid-episode
    schema, policy, T&C, pricing, and auth drift. Five independent reward
    components; deterministic seeded drift; Hindi/Tamil/Kannada/Hinglish
    briefs via Kokoro TTS + faster-whisper ASR.
  license: apache-2.0
  tags:
    - openenv
    - rl
    - voice
    - indic
    - schema-drift
  entrypoint:
    type: http
    base_url: "https://<team>-driftcall-env.hf.space"
    endpoints:
      reset: "/reset"
      step: "/step"
      state: "/state"
      close: "/close"
      health: "/healthz"
    auth:
      type: bearer
      secret_env: DRIFTCALL_ENV_TOKEN
  action_space:
    ref: "docs/modules/models.md#DriftCallAction"
  observation_space:
    ref: "docs/modules/models.md#DriftCallObservation"
  episode:
    max_turns: 16        # worst case, stage-3 curriculum (DESIGN.md §4.5)
    reset_config:
      seed: { type: int, required: false }
      curriculum_stage: { type: int, range: [1, 3], required: false }
      language_weights: { type: object, required: false }
  reward:
    shape: scalar
    range: [-1.0, 1.0]
    components:
      ref: "docs/modules/rewards.md"
```

Field names match the OpenEnv v1.0 schema (`entrypoint.type`, `action_space.ref`, etc.). The `ref` pointers resolve to paths inside the repo; `openenv validate` reads them to assert the env is self-describing.

### 4.4 `README.md` (Space card)

```
---
title: DriftCall Env
emoji: 🧭
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
short_description: OpenEnv — Indic voice concierge under schema drift.
---
```

Below the YAML header: one-paragraph description, `openenv validate` command, auth note, link to GitHub, link to the demo Space, link to the HF Hub model + dataset. The README is also rendered as the root `/` route's fallback (Docker Spaces serve nothing at `/` otherwise).

### 4.5 `requirements.txt`

```
fastapi==0.115.*
uvicorn[standard]==0.32.*
pydantic==2.*
openenv==0.2.*            # or whatever is current at build time; version-pin in PR
kokoro==0.9.*
faster-whisper==1.1.*
ctranslate2==4.5.*        # pinned to match faster-whisper's wheel
soundfile==0.12.*
numpy<2.0
huggingface_hub==0.26.*   # only used at build time (snapshot_download)
```

The version set matches `docs/modules/audio.md` §6.1 (upstream consumer) exactly. Pinning is deliberate: the env Space is a reproducibility artifact; judges may rebuild it months from now.

---

## 5. Error modes

Every failure path that can cross the HTTP boundary:

| ID | When | HTTP | Body `error.code` | Recovery |
|---|---|---|---|---|
| M1 | No `Authorization` header, or bad bearer | 401 | `unauthorized` | Client fixes token |
| M2 | No `X-Session-Id` on `/reset`/`/step`/`/state`/`/close` | 400 | `missing_session_id` | Client adds header |
| M3 | `/step`/`/state`/`/close` with unknown session id | 404 | `session_not_found` | Client re-issues `/reset` |
| M4 | Session was in cache but TTL expired between request and handler | 404 | `session_expired` | Client re-issues `/reset` |
| M5 | `/reset` when cache is full and LRU victim cannot be evicted (all 10 slots freshly `last_touched`) | 429 | `max_sessions` | Client backs off and retries; `Retry-After: 30` header set |
| M6 | Kokoro or Whisper model weights missing at startup (image build was broken) | 503 | `model_not_ready` | **Operator** fixes image; client cannot recover |
| M7 | Malformed JSON in request body | 400 | `bad_json` | Client fixes payload |
| M8 | Action fails pydantic / dataclass validation (wrong `ActionType`, missing `tool_name` for `TOOL_CALL`) | 400 | `invalid_action` | Client fixes action |
| M9 | Unhandled exception in `env.step` | 500 | `internal_error` | Logged with request id; client SHOULD NOT retry same action |
| M10 | Disk full writing tmp WAV in audio pipeline | 500 | `io_error` | Very rare on HF Spaces (no writable persistent disk, but /tmp is tmpfs and can fill); operator action |
| M11 | Request body exceeds 1 MiB | 413 | `payload_too_large` | Client trims (should never happen; actions are small) |
| M12 | Concurrent `/reset` on same session id (two requests race) | 409 | `reset_in_progress` | Client serializes resets on its side |

Rules:

- No stack traces in response bodies. `request_id` (uvicorn's ASGI scope id) is included so operators can grep logs.
- All error responses include `Cache-Control: no-store`.
- M5 (`429`) is the **only** code that includes `Retry-After`. Others are terminal for the request.

---

## 6. Dependencies

### 6.1 Upstream (consumed by the deployment artifact)

- **`docs/modules/env.md`** — defines `DriftCallEnvironment.__init__/reset/step/state/close` and the FastAPI route handlers. This doc references but does not duplicate env behavior.
- **`docs/modules/models.md`** — every dataclass crossing the HTTP boundary.
- **`docs/modules/audio.md`** — Kokoro + Whisper integration; tells this doc which weights to pre-pull and what CPU footprint to budget.
- **`docs/modules/rewards.md`** — cited from `openenv.yaml` `reward.components.ref`.
- **DESIGN.md §3.3, §9.1, §9.2, §11.1, §13, Risk 10** — authoritative.

### 6.2 External runtime dependencies (pinned in §4.5)

`fastapi`, `uvicorn[standard]`, `openenv`, `kokoro`, `faster-whisper`, `ctranslate2`, `soundfile`, `pydantic`, `numpy<2.0`, `huggingface_hub` (build-time only).

### 6.3 Hugging Face platform dependencies

- **Space SDK:** `docker` (NOT `gradio`/`static`). The Docker SDK is the only path that lets us bake weights into the image and pin `uvicorn` workers.
- **Space hardware:** `cpu-basic` (free). 2 vCPU, 16 GB RAM, 50 GB ephemeral disk, **no persistent storage**, no GPU.
- **Space secrets:** `DRIFTCALL_ENV_TOKEN` (required).
- **Space env vars:** none (all config is baked in or via `X-Session-Id`).
- **Space region:** default (us-east-1); we do not need region pinning for CPU-basic.

### 6.4 Downstream consumers (who pings this Space)

- `training/eval_baseline.py` and `training/eval_final.py` (DESIGN.md §12) — the training-side `HTTPEnvClient`.
- `demo/app_gradio.py` — the demo Space (documented in `docs/modules/deploy_demo_space.md`) uses this env over HTTP for live runs.
- `openenv validate .` — run against the Space URL as part of the hackathon submission gate.
- Hackathon judges — direct HTTP exercise via curl / the `openenv` CLI.

### 6.5 Explicit non-dependencies

- **No GPU** at runtime (load-bearing; DESIGN.md §3.3).
- **No LLM weights** on the env Space (Gemma 4 lives on the demo Space or on the trainer's local V100).
- **No training code** (`training/` is NOT copied into the image; see §4.2 `COPY` list).
- **No HF Hub network** at runtime (§2.3, §4.2 offline envs).

---

## 7. Edge cases

Six cases the deployment plan must handle correctly. Each is load-bearing for either the 30-min deploy window or the judge's `openenv validate` run.

### 7.1 Concurrent `/reset` on the same session id

Client A and client B both POST `/reset` with `X-Session-Id: S1` within the same ~100 ms window. The cache uses a per-session asyncio lock; the second request observes the session mid-construction.

**Handling:**
- If the first request is still inside `env.__init__`, the second request gets `409 reset_in_progress`. Client is expected to serialize on its side.
- If the first request has completed, the second request performs an in-place reset: the old env is `.close()`'d, a new env replaces it, `reset_count += 1`. This matches `gym`'s idempotent reset semantics.
- `seed` is honored on the winning reset; the losing (409'd) request's seed is discarded.

### 7.2 `/step` on an evicted session

A client idles for 65 minutes between `/step` calls. The sweep task evicts the session at minute 60. The client's next `/step` returns `404 session_expired`.

**Handling:**
- The client MUST re-issue `/reset` with the same or new seed; it cannot resume mid-episode. This is explicit in the Space README.
- No attempt is made to persist episode state across evictions. The free tier has no writable persistent disk, and replaying a seeded episode is cheap (< 1 s on the CPU basic tier).
- `env.close()` is called on eviction to release the Kokoro audio buffer (saves ~80 MB resident per lingering session).

### 7.3 Cold-start model-weight load race

The Space boots. Uvicorn workers start and each lazily triggers a Kokoro + Whisper load on the first audio-involving `/step`. Whisper's CTranslate2 model load takes ~3–5 s; Kokoro takes ~2 s. A `/step` arriving before load completes can block up to ~8 s.

**Handling:**
- `app.py`'s `lifespan` startup hook performs an **eager** load of both models during container boot. This turns cold-start latency into Space "Starting…" time (which HF surfaces via the spinner) instead of a hung client request.
- If eager load fails (bad weights, disk corruption), the container exits 1 and HF's Space restart loop catches it — operator sees the Space status as "Error" instead of silently hanging.
- The first `/healthz` probe is expected at +30 s (`--start-period=45s` on the HEALTHCHECK gives us a comfortable margin).

### 7.4 Kokoro voice pack missing for a language

Kokoro is loaded at startup but an individual voice pack for `language="kn"` (Kannada) is missing from the snapshot cache due to a partial download.

**Handling:**
- `audio/tts_kokoro.py` (per `docs/modules/audio.md` §5) raises `VoicePackMissingError`. The env treats this as a SPEAK-action failure and returns a `tool_results` entry with `status="schema_error"` and `response={"error": "voice_unavailable"}`. The episode continues; reward R4 (format compliance) may drop but R1/R2 are unaffected.
- The image build in §4.2 pre-pulls the **full** Kokoro snapshot (`snapshot_download('hexgrad/Kokoro-82M')`), which includes all voice packs. If a voice pack is missing at runtime, the image is broken — operator fixes the Dockerfile and rebuilds.

### 7.5 HTTP timeout mid-`/step`

A `/step` takes 35 s because Whisper is processing a long utterance and the Space is also handling three concurrent episodes. The HF Space edge proxy has a 60 s idle timeout — we stay under it but only barely.

**Handling:**
- `--timeout-keep-alive 30` means uvicorn holds the connection; the HTTP client's TCP timeout should be ≥ 60 s (default `requests.Session` timeout is infinite — safe).
- Inside `env.step`, audio ops have **hard caps** owned by `audio/*.py`: Whisper `max_duration_s=30`, Kokoro synth implicitly bounded by text length. The env cannot produce a `/step` longer than ~40 s at p99.
- If a `/step` does exceed 60 s (e.g., 10 concurrent sessions all doing audio at once on 2 vCPU), the proxy closes the socket and the client sees `ConnectionError`. Client re-issues; the session is still in the cache and the step was effectively a no-op on the server side because responses are atomic-on-return (state is only mutated after all work succeeds — see `docs/modules/env.md` §3 transactional step semantics).

### 7.6 Out-of-memory during concurrent audio

Five sessions simultaneously run audio-heavy `/step`s. Each Whisper int8 model takes ~250 MB RAM; Kokoro takes ~350 MB. Naive loading would hit `5 × 600 MB = 3 GB` plus Python overhead — well within the 16 GB tier budget, but the Space can still OOM if the image unexpectedly loads fp32 weights.

**Handling:**
- Whisper is forced to `compute_type="int8"` and Kokoro to fp32 (its default is already smallest viable). `audio/*.py` asserts these at load time.
- The models are **singletons** shared across sessions (they are stateless w.r.t. concurrent calls; CTranslate2 releases the GIL during decode). Memory budget is therefore `~600 MB total`, not per-session.
- If an OOM happens, the container is killed by the HF Space OOM-killer and auto-restarts. We lose all in-flight sessions; clients re-`/reset`. The eviction sweep and TTL ensure no permanently-dead sessions pile up.

---

## 8. Examples

### 8.1 End-to-end `/reset` → `/step` flow via curl

```bash
# Assume DRIFTCALL_ENV_TOKEN is set locally for scripting convenience.
TOKEN="${DRIFTCALL_ENV_TOKEN:?export DRIFTCALL_ENV_TOKEN first}"
BASE="https://<team>-driftcall-env.hf.space"

# 1. Reset with seed 42, stage 2 curriculum.
curl -sS -X POST "$BASE/reset" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Session-Id: demo-001" \
  -H "Content-Type: application/json" \
  -d '{"seed": 42, "config": {"curriculum_stage": 2}}'
# → 200 {"observation": {"turn": 0, "goal": {...}, "last_transcript": "Bhai Friday ko...", ...}}

# 2. Step: call airline.search.
curl -sS -X POST "$BASE/step" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Session-Id: demo-001" \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "action_type": "tool_call",
      "tool_name": "airline.search",
      "tool_args": {"origin": "DEL", "destination": "BLR", "date": "2026-04-26"}
    }
  }'
# → 200 {"observation": {...}, "reward": 0.0, "done": false, "info": {"drift_fired": []}}

# 3. Inspect state (judge-only, optional).
curl -sS "$BASE/state" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Session-Id: demo-001"
# → 200 {"episode_id": "...", "turn": 1, "max_turns": 12, "drift_schedule": [...], ...}

# 4. Close.
curl -sS -X POST "$BASE/close" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Session-Id: demo-001"
# → 200 {"closed": true}
```

### 8.2 Container build + smoke + push

```bash
# Local build (from DRIFTCALL/ repo root)
docker build -t driftcall-env:local .

# Local smoke (bind a dummy secret)
docker run --rm -p 7860:7860 \
  -e DRIFTCALL_ENV_TOKEN=dev-local-token \
  driftcall-env:local

# In another shell:
curl -sS http://localhost:7860/healthz            # → "ok"
curl -sS -X POST http://localhost:7860/reset \
  -H "Authorization: Bearer dev-local-token" \
  -H "X-Session-Id: smoke" \
  -H "Content-Type: application/json" -d '{}'
# → 200 with initial observation

# Push to HF Space via the new `hf` CLI.
# The team-lead brief flags that `huggingface-cli` is deprecated; we migrate
# DriftCall/CLAUDE.md §6 row "HF push env" to `hf upload` in a follow-up PR.
hf upload <team>/driftcall-env . --repo-type=space
# (Requires `pip install huggingface_hub>=0.25` and `hf auth login` completed.)
```

### 8.3 `openenv validate` against the live Space

```bash
# Against local container:
openenv validate http://localhost:7860 \
  --auth-bearer dev-local-token

# Against deployed Space:
openenv validate https://<team>-driftcall-env.hf.space \
  --auth-bearer "$DRIFTCALL_ENV_TOKEN"

# Expected output:
#   ✓ openenv.yaml parses, schema v1.0
#   ✓ GET  /healthz → 200 ok
#   ✓ POST /reset   → 200, observation matches observation_space.ref
#   ✓ POST /step    → 200, observation + reward + done
#   ✓ GET  /state   → 200, DriftCallState matches schema
#   ✓ POST /close   → 200
#   ✓ 6 endpoints validated, 0 errors
```

Running this before submission is the DESIGN.md §12.2 hour-16 gate. If it fails, we fix before moving to training.

---

## 9. Open questions

1. **OpenEnv schema version pin:** `openenv==0.2.*` in §4.5 is a placeholder. Confirm the exact current release on the hackathon kickoff morning (Apr 25) and tighten the pin; `openenv validate` schema fields may have shifted between 0.1 and 0.2.
2. **Per-worker cache divergence:** documented in §3.2 as acceptable. Re-evaluate after local load-testing — if even training hits the cross-worker 404 path > 1% of the time, switch to `--workers 1` with a bigger thread pool.
3. **HF Space CPU cold-start time:** the free CPU basic tier can sleep on idle and take 60–120 s to wake. This doc assumes Space is "always-on" because we exercise it during development; if the judge hits a cold Space, the first `/reset` may appear hung. Risk-register coverage owned by `docs/modules/risk_book.md`.
4. **`DRIFTCALL_ENV_TOKEN` rotation during the hackathon:** if the token leaks mid-judging, rotating it 401s the judge mid-run. Do we need a two-token grace period? Likely no (hackathon is 48 h and we trust submission channels), but flag for Person D's risk book.
5. **CLAUDE.md §6 `hf upload` migration:** the hackathon briefing flags `huggingface-cli` as deprecated. Update `DRIFTCALL/CLAUDE.md` §6 rows ("HF push env", "HF push dataset") to `hf upload ... --repo-type=...` in a separate small PR so this design doc doesn't diverge from the command catalogue. Own: Person D.
6. **Image-size margin vs §1.1 Whisper upgrade path:** if `docs/modules/audio.md` §1.1's WER bail-out triggers and we swap to `faster-whisper-medium`, final image grows from ~1.2 GB to ~1.8 GB. Still under the 2 GB Risk-10 bound but with less slack. Re-check image size after any audio-weights change.
7. **`/state` access control:** should `/state` require the same bearer as mutating endpoints, or should we expose a narrower "episode summary" for judges without the full vendor-states dump? Current design keeps full state behind the bearer; revisit if leaderboard ops ask for a public read-only pane.