gary-boon Claude Opus 4.5 commited on
Commit
ab4534a
·
1 Parent(s): 343dd57

Add Devstral + DGX Spark implementation plan

Browse files

Phased plan for:
- Securing GPU HF Space
- Adding Devstral model support
- Frontend dynamic layer handling
- DGX Spark deployment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs/devstral-spark-plan-full.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adding Devstral Model Support + DGX Spark Deployment (Full Plan)
2
+
3
+ > This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach.
4
+
5
+ ## Overview
6
+
7
+ Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.
8
+
9
+ ## Devstral Model Specifications
10
+
11
+ | Parameter | Devstral | CodeGen (current) | Code Llama |
12
+ |-----------|----------|-------------------|------------|
13
+ | Parameters | 24B | 350M | 7B |
14
+ | Layers | 40 | 20 | 32 |
15
+ | Attention Heads | 32 | 16 | 32 |
16
+ | KV Heads (GQA) | 8 | N/A (MHA) | 32 |
17
+ | Hidden Size | 5120 | 1024 | 4096 |
18
+ | Vocab Size | 131,072 | 51,200 | 32,000 |
19
+ | Context Length | 128K | 2K | 16K |
20
+ | Min VRAM (BF16)* | ~48GB | 2GB | 14GB |
21
+ | Architecture | `mistral` | `gpt_neox` | `llama` |
22
+
23
+ *VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.
24
+
25
+ ## Deployment Environment Summary
26
+
27
+ | Environment | Backend Location | Frontend | Use Case |
28
+ |-------------|-----------------|----------|----------|
29
+ | Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) |
30
+ | DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models |
31
+ | Production | HuggingFace Spaces | Vercel | Public access (CodeGen) |
32
+
33
+ ---
34
+
35
+ ## Backend Model Support
36
+
37
+ ### Add Devstral to Model Registry
38
+
39
+ **File:** `backend/model_config.py`
40
+
41
+ ```python
42
+ "devstral-small": {
43
+ "hf_path": "mistralai/Devstral-Small-2507",
44
+ "display_name": "Devstral Small 24B",
45
+ "architecture": "mistral",
46
+ "size": "24B",
47
+ "num_layers": 40,
48
+ "num_heads": 32,
49
+ "num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
50
+ "vocab_size": 131072,
51
+ "context_length": 131072,
52
+ "attention_type": "grouped_query",
53
+ "requires_gpu": True,
54
+ "min_vram_gb": 48.0,
55
+ "min_ram_gb": 96.0
56
+ }
57
+ ```
58
+
59
+ ### Create MistralAdapter
60
+
61
+ **File:** `backend/model_adapter.py`
62
+
63
+ ```python
64
+ class MistralAdapter(ModelAdapter):
65
+ """Adapter for Mistral-based models (Devstral, Mistral, etc.)"""
66
+
67
+ def _get_layers(self):
68
+ """Defensive access: Mistral layers may be nested differently"""
69
+ if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
70
+ return self.model.model.layers # MistralForCausalLM wrapper
71
+ elif hasattr(self.model, 'layers'):
72
+ return self.model.layers # Direct model access
73
+ raise AttributeError("Cannot find transformer layers in Mistral model")
74
+
75
+ def get_num_layers(self) -> int:
76
+ return self.model.config.num_hidden_layers
77
+
78
+ def get_num_heads(self) -> int:
79
+ return self.model.config.num_attention_heads
80
+
81
+ def get_num_kv_heads(self) -> Optional[int]:
82
+ return getattr(self.model.config, 'num_key_value_heads', None)
83
+
84
+ def get_layer_module(self, layer_idx: int):
85
+ return self._get_layers()[layer_idx]
86
+
87
+ def get_attention_module(self, layer_idx: int):
88
+ return self._get_layers()[layer_idx].self_attn
89
+
90
+ def get_mlp_module(self, layer_idx: int):
91
+ return self._get_layers()[layer_idx].mlp
92
+
93
+ def get_qkv_projections(self, layer_idx: int):
94
+ attn = self.get_attention_module(layer_idx)
95
+ return attn.q_proj, attn.k_proj, attn.v_proj
96
+ ```
97
+
98
+ ### Fix Hardcoded Layer Classification
99
+
100
+ **File:** `backend/model_service.py` (lines ~1505-1514)
101
+
102
+ ```python
103
+ # Fixed (percentage-based, 1-indexed fraction for transformer blocks):
104
+ layer_fraction = (layer_idx + 1) / n_layers
105
+ if layer_idx == 0:
106
+ layer_pattern = {"type": "positional", ...}
107
+ elif layer_fraction <= 0.25:
108
+ layer_pattern = {"type": "previous_token", ...}
109
+ elif layer_fraction <= 0.75:
110
+ layer_pattern = {"type": "induction", ...}
111
+ else:
112
+ layer_pattern = {"type": "semantic", ...}
113
+ ```
114
+
115
+ ---
116
+
117
+ ## Frontend Dynamic Layer Handling
118
+
119
+ ### Fix Hardcoded Layer Boundaries
120
+
121
+ **File:** `components/research/VerticalPipeline.tsx`
122
+
123
+ ```typescript
124
+ const getStageInfo = (layerIdx: number, totalLayers: number) => {
125
+ if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
126
+ const fraction = layerIdx / totalLayers;
127
+ if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
128
+ if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
129
+ return { color: 'purple', label: 'LATE' };
130
+ };
131
+ ```
132
+
133
+ ---
134
+
135
+ ## DGX Spark Docker Deployment
136
+
137
+ ### Dockerfile
138
+
139
+ ```dockerfile
140
+ # Bump with care, retest CUDA + torch compatibility
141
+ FROM nvcr.io/nvidia/pytorch:24.01-py3
142
+
143
+ WORKDIR /app
144
+
145
+ RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
146
+
147
+ COPY requirements.txt .
148
+ RUN pip install --no-cache-dir -r requirements.txt
149
+
150
+ COPY . .
151
+
152
+ EXPOSE 8000
153
+
154
+ CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
155
+ ```
156
+
157
+ ### Docker Compose for Spark
158
+
159
+ **File:** `docker/compose.spark.yml`
160
+
161
+ ```yaml
162
+ services:
163
+ visualisable-ai-backend:
164
+ build:
165
+ context: ..
166
+ dockerfile: Dockerfile
167
+ ports:
168
+ - "${PORT:-8000}:8000"
169
+ shm_size: "8gb"
170
+ volumes:
171
+ - ..:/app
172
+ - /srv/models:/srv/models:ro
173
+ - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
174
+ - ../runs:/app/runs
175
+ environment:
176
+ - HF_HOME=/srv/models-cache/huggingface
177
+ - TRANSFORMERS_CACHE=/srv/models-cache/huggingface
178
+ - DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
179
+ - API_KEY=${API_KEY}
180
+ - HF_TOKEN=${HF_TOKEN}
181
+ - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
182
+ - MAX_CONTEXT=${MAX_CONTEXT:-8192}
183
+ - BATCH_SIZE=${BATCH_SIZE:-1}
184
+ - TORCH_DTYPE=${TORCH_DTYPE:-bf16}
185
+ gpus: all
186
+ healthcheck:
187
+ test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
188
+ interval: 30s
189
+ timeout: 3s
190
+ retries: 5
191
+ restart: unless-stopped
192
+ ```
193
+
194
+ **Notes:**
195
+ - `/health` MUST return immediately (process up), not wait for model load
196
+ - Add `/ready` endpoint for model readiness
197
+ - Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...`
198
+
199
+ ---
200
+
201
+ ## Files to Modify/Create
202
+
203
+ | File | Action | Description |
204
+ |------|--------|-------------|
205
+ | `backend/model_config.py` | MODIFY | Add Devstral entry |
206
+ | `backend/model_adapter.py` | MODIFY | Add MistralAdapter |
207
+ | `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds |
208
+ | `Dockerfile` | CREATE | Docker image |
209
+ | `docker/compose.spark.yml` | CREATE | Spark compose config |
210
+ | `.env.spark.example` | CREATE | Environment template |
211
+ | `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries |
212
+
213
+ ---
214
+
215
+ ## Hardware Requirements Summary
216
+
217
+ | Model | Deployment | Hardware |
218
+ |-------|------------|----------|
219
+ | CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU |
220
+ | Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM |
221
+ | Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |
docs/devstral-spark-plan-phased.md ADDED
@@ -0,0 +1,1824 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Devstral + DGX Spark: Phased Implementation Plan
2
+
3
+ > Incremental approach: prove infrastructure first, then add model support.
4
+
5
+ ## Overview
6
+
7
+ This plan breaks the Devstral + DGX Spark work into phases that can be validated independently:
8
+
9
+ 0. **Phase 0**: Secure GPU HF Space + verify basic routing (make private, add HF token auth, test auth works)
10
+ 1. **Phase 0.5**: Fix critical API route routing (backendFetch for key endpoints, prove GPU routing works)
11
+ 2. **Phase 1**: Deploy existing CodeGen to DGX Spark (prove Docker/GPU infrastructure)
12
+ 3. **Phase 2**: Add Devstral backend support, test correctness locally
13
+ 4. **Phase 2b**: Frontend dynamic layer handling
14
+ 5. **Phase 2c**: Wire Spark into frontend backend router + Deploy Devstral to GPU HF Space
15
+ 6. **Phase 3**: Deploy Devstral to DGX Spark
16
+ 7. **Phase 4**: Future enhancements (optional)
17
+
18
+ ---
19
+
20
+ ## Existing Backend Routing Infrastructure
21
+
22
+ The frontend already has a sophisticated backend routing system that switches between multiple backends based on user settings and environment.
23
+
24
+ ### Current Architecture
25
+
26
+ **File:** `visualisable-ai/lib/backend-router.ts`
27
+
28
+ ```typescript
29
+ export type BackendTier = 'free' | 'premium' | 'research' | 'admin' | 'local';
30
+
31
+ export interface BackendConfig {
32
+ url: string;
33
+ wsUrl: string;
34
+ tier: BackendTier;
35
+ reason: string;
36
+ device: 'cpu' | 'gpu' | 'spark';
37
+ performance: { inferenceSpeed: string; concurrentUsers: string; };
38
+ }
39
+ ```
40
+
41
+ **Current Backend Targets:**
42
+ | Target | URL | When Used |
43
+ |--------|-----|-----------|
44
+ | Local | `localhost:8000` | Local mode + Remote NOT enabled |
45
+ | CPU HuggingFace | `visualisable-ai-api.hf.space` | Free tier (default) |
46
+ | GPU HuggingFace | `visualisable-ai-api-gpu.hf.space` | Premium tier (gpuEnabled=true) |
47
+
48
+ **Routing Logic (from `getBackendForUser`):**
49
+ 1. **Local mode + no Remote** → `localhost:8000`
50
+ 2. **Local mode + Remote + GPU** → GPU HF Space
51
+ 3. **Local mode + Remote + no GPU** → CPU HF Space
52
+ 4. **Production + GPU** → GPU HF Space
53
+ 5. **Production + no GPU** → CPU HF Space
54
+
55
+ ### Admin UI Controls
56
+
57
+ **File:** `visualisable-ai/app/admin/users/page.tsx`
58
+
59
+ Two toggles per user:
60
+ - **GPU Access** (`gpuEnabled`): Routes to GPU HuggingFace Space
61
+ - **Remote** (`backendOverride: 'remote'`): In local mode, switches from localhost to HuggingFace
62
+
63
+ ### Environment Variables
64
+
65
+ ```bash
66
+ NEXT_PUBLIC_MODE=local # Enables local mode (shows Remote toggle)
67
+ NEXT_PUBLIC_API_URL=http://localhost:8000 # Local backend URL
68
+ NEXT_PUBLIC_CPU_BACKEND_URL=... # CPU HuggingFace Space
69
+ NEXT_PUBLIC_GPU_BACKEND_URL=... # GPU HuggingFace Space
70
+ ```
71
+
72
+ ### Current Gap: Server-Side API Routes
73
+
74
+ **Issue:** The `backend-router.ts` correctly determines the backend URL per-user, but many Next.js API routes use a hardcoded `BACKEND_URL`:
75
+
76
+ ```typescript
77
+ // These routes use hardcoded BACKEND_URL (NOT per-user routing):
78
+ // - /api/research/attention/analyze/route.ts
79
+ // - /api/proxy/[...path]/route.ts
80
+ // - /api/demos/route.ts
81
+ // - /api/vocabulary/search/route.ts
82
+ // etc.
83
+
84
+ const BACKEND_URL = process.env.BACKEND_URL || 'https://visualisable-ai-api.hf.space';
85
+ ```
86
+
87
+ **Result:** Even if a user has `gpuEnabled=true`, server-side API routes still call the CPU Space.
88
+
89
+ **Fix Required:** API routes need to:
90
+ 1. Get current user via Clerk
91
+ 2. Call `getBackendForUser(user)` to get the correct backend URL
92
+ 3. Use that URL for the fetch
93
+
94
+ **Resolution:** Phase 0.5 fixes the critical `/api/research/attention/analyze` endpoint to prove routing works. Remaining routes are fixed in Phase 2c.
95
+
96
+ ---
97
+
98
+ ## Phase 0: Secure GPU HF Space + Verify Existing Routing
99
+
100
+ **Goal:** Before adding Devstral/Spark support, secure the GPU HuggingFace Space to prevent unauthorized wake-ups and cost leakage, then verify the existing CPU/GPU routing works correctly.
101
+
102
+ ### The Problem
103
+
104
+ Even with API key protection, a **public** HuggingFace Space can be:
105
+ - **Discovered** - Anyone can find it on HuggingFace
106
+ - **Woken up** - Visiting the URL or hitting any endpoint (even returning 401) wakes a sleeping Space
107
+ - **Kept awake** - Repeated requests keep the GPU running and billing
108
+
109
+ With high-VRAM GPU tiers (L40S at ~$4/hr, A100 at ~$6/hr), this is a real cost risk.
110
+
111
+ ### 0.1 Make GPU HF Space Private
112
+
113
+ On HuggingFace:
114
+ 1. Go to your GPU Space settings
115
+ 2. Change visibility from **Public** to **Private**
116
+ 3. This prevents discovery and unauthorized access
117
+
118
+ **Note:** Private Spaces require authentication via HuggingFace token.
119
+
120
+ **Important caveat:** Making the Space private will reduce random discovery and casual wake-ups, but note that any request that reaches the Space (even one returning 401 Unauthorized) can still wake it, depending on HuggingFace's behavior. Private is still the right move—it prevents casual discovery—but do not assume it is a perfect shield against all wake-ups. The sleep timeout (step 0.4) is the primary defense-in-depth measure.
121
+
122
+ ### 0.2 Add Server-Side HF Token to Vercel
123
+
124
+ Add a **server-side only** HF token (no `NEXT_PUBLIC_` prefix):
125
+
126
+ **In Vercel Environment Variables:**
127
+ ```
128
+ HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
129
+ ```
130
+
131
+ Generate this token at https://huggingface.co/settings/tokens with read access to your private Space.
132
+
133
+ **Important:** Do NOT use `NEXT_PUBLIC_HF_TOKEN` - that exposes the token to the client.
134
+
135
+ ### 0.3 Create Server-Only Auth Module
136
+
137
+ **Why a separate file?** In Next.js, any code imported into client components can end up in the client bundle. `backend-router.ts` contains `getBackendForUser()` which may be imported for URL/tier decisions in client code. If we put `process.env.HF_TOKEN` in the same file, it risks being referenced from client bundles (even if tree-shaken, it's fragile).
138
+
139
+ **Solution:** Keep `backend-router.ts` as "pure decision logic" (URLs, tiers, reasons) and put all server-only headers in a separate module that is **only imported from API routes**.
140
+
141
+ **File:** `visualisable-ai/lib/backend-auth.server.ts`
142
+
143
+ ```typescript
144
+ import 'server-only'; // Next.js guard: errors if accidentally imported from client
145
+
146
+ /**
147
+ * Server-only authentication headers for backend requests.
148
+ *
149
+ * IMPORTANT: This file must ONLY be imported from Next.js API routes (server-side).
150
+ * Never import this from client components or shared code.
151
+ * The 'server-only' import above will cause a build error if this is violated.
152
+ */
153
+
154
+ // Accept both env var names for backwards compatibility; standardise on API_KEY going forward
155
+ const API_KEY = process.env.API_KEY ||
156
+ process.env.BACKEND_API_KEY ||
157
+ '';
158
+
159
+ const HF_TOKEN = process.env.HF_TOKEN; // Server-side only, no NEXT_PUBLIC_ prefix
160
+
161
+ /**
162
+ * Get base authentication headers (API key only).
163
+ * Use this as the foundation, then add HF token conditionally based on target.
164
+ */
165
+ export function getBaseAuthHeaders(): HeadersInit {
166
+ const headers: HeadersInit = {
167
+ 'Content-Type': 'application/json',
168
+ };
169
+
170
+ if (API_KEY) {
171
+ headers['X-API-Key'] = API_KEY;
172
+ }
173
+
174
+ return headers;
175
+ }
176
+
177
+ /**
178
+ * Get HF-specific auth header (for private Spaces).
179
+ * Only attach this when the target is a HuggingFace Space.
180
+ */
181
+ export function getHfAuthHeader(): HeadersInit {
182
+ return HF_TOKEN ? { Authorization: `Bearer ${HF_TOKEN}` } : {};
183
+ }
184
+
185
+ /**
186
+ * Check if a URL is a HuggingFace Space.
187
+ */
188
+ export function isHfSpace(url: string): boolean {
189
+ return url.includes('.hf.space');
190
+ }
191
+ ```
192
+
193
+ **Update existing `getBackendHeaders()` in `backend-router.ts`:**
194
+
195
+ Leave the existing function for backward compatibility, but remove any server-side secrets:
196
+
197
+ ```typescript
198
+ // backend-router.ts - keep as client-safe decision logic only
199
+ export function getBackendHeaders(): HeadersInit {
200
+ // Note: This function returns headers safe for client-side use.
201
+ // For server-side requests with API keys/tokens, use backend-auth.server.ts
202
+ return {
203
+ 'Content-Type': 'application/json',
204
+ };
205
+ }
206
+ ```
207
+
208
+ **Rule:** `HF_TOKEN` and `API_KEY` are only used in Next.js API routes (server), never in client code.
209
+
210
+ ### 0.4 Configure Sleep Timeout (Defense in Depth)
211
+
212
+ On HuggingFace GPU Space settings:
213
+ - Set **Sleep timeout** to minimum (e.g., 5 minutes of inactivity)
214
+ - This reduces cost if the Space is somehow woken unexpectedly
215
+
216
+ **Trade-off note for stakeholders:** A 5-minute sleep timeout protects cost but increases cold starts. When a GPU-enabled user makes their first request after the Space has been sleeping, they will experience a delay while the Space wakes up (container restart + model load). For Devstral (~48GB), this cold start can take several minutes. Options to mitigate:
217
+ - **Longer timeout** (e.g., 15-30 minutes) - reduces cold starts but increases cost during idle periods
218
+ - **"Keep warm" scheduled pings** - a cron job that pings `/health` every few minutes to prevent sleep (increases cost to ~continuous billing)
219
+ - **Accept cold starts** - for research/premium users who understand the trade-off
220
+
221
+ Start with 5 minutes and adjust based on usage patterns and user feedback.
222
+
223
+ ### 0.5 Verify Existing Routing Works
224
+
225
+ Before proceeding to Phase 1, verify the current CPU/GPU routing is working.
226
+
227
+ **Note on user-specific tests:** Tests 1 and 2 require testing "as a specific user" because routing depends on Clerk user metadata (`gpuEnabled`). The curl examples cannot easily reproduce this. Use one of these approaches:
228
+
229
+ 1. **Browser test (simplest):** Log in as each user type and trigger the endpoint via the UI, then check backend logs to confirm which backend received the request.
230
+
231
+ 2. **Admin diagnostic endpoint (recommended for automation):** Add a temporary `/api/debug/backend-routing` endpoint that returns the backend URL chosen for the current user:
232
+ ```typescript
233
+ // app/api/debug/backend-routing/route.ts
234
+ import { currentUser } from '@clerk/nextjs/server';
235
+ import { getBackendForUser } from '@/lib/backend-router';
236
+
237
+ export async function GET() {
238
+ const user = await currentUser();
239
+ const backend = getBackendForUser(user);
240
+ return Response.json({
241
+ tier: backend.tier,
242
+ url: backend.url,
243
+ device: backend.device,
244
+ userEmail: user?.emailAddresses?.[0]?.emailAddress
245
+ });
246
+ }
247
+ ```
248
+ Then curl with a Clerk session cookie to test routing per-user.
249
+
250
+ 3. **Clerk session token in curl:** If you have tooling to extract a Clerk session token, pass it in the request.
251
+
252
+ **Test 1: CPU HF Space (free tier user)**
253
+ ```bash
254
+ # Option A: Browser test
255
+ # Log in as a user WITHOUT gpuEnabled, trigger analyze, check logs
256
+
257
+ # Option B: With diagnostic endpoint (if added)
258
+ # Log in as free tier user, then:
259
+ curl https://your-app.vercel.app/api/debug/backend-routing \
260
+ -H "Cookie: __session=<clerk_session_cookie>"
261
+ # Expected: tier=free, url=visualisable-ai-api.hf.space
262
+ ```
263
+
264
+ **Test 2: GPU HF Space (GPU-enabled user)**
265
+ ```bash
266
+ # Option A: Browser test
267
+ # Log in as a user WITH gpuEnabled=true, trigger analyze, check logs
268
+
269
+ # Option B: With diagnostic endpoint (if added)
270
+ # Log in as GPU-enabled user, then:
271
+ curl https://your-app.vercel.app/api/debug/backend-routing \
272
+ -H "Cookie: __session=<clerk_session_cookie>"
273
+ # Expected: tier=premium, url=visualisable-ai-api-gpu.hf.space
274
+ ```
275
+
276
+ **Test 3: Private Space rejects unauthenticated requests**
277
+ ```bash
278
+ # Direct request to GPU Space without token should fail
279
+ curl https://visualisable-ai-api-gpu.hf.space/health
280
+ # Expected: 401 Unauthorized or redirect to login (or HTML login page)
281
+ ```
282
+
283
+ **Test 4: Private Space accepts authenticated requests**
284
+ ```bash
285
+ # Direct request with HF token should succeed
286
+ curl -H "Authorization: Bearer hf_xxxx" \
287
+ https://visualisable-ai-api-gpu.hf.space/health
288
+ # Expected: 200 OK
289
+ ```
290
+
291
+ **Note on endpoint choice:** These tests use `/health`. Verify your backend actually serves `/health` at the root. Some HF Space setups front a Gradio app or use a different path prefix. If `/health` doesn't exist, substitute any cheap "always exists" endpoint you know is served (even `/` or a simple status endpoint). The goal is to test auth, not the specific endpoint.
292
+
293
+ **Note on private Space responses:** A private Space may return a redirect or HTML login page rather than a neat JSON 401. Both indicate the unauthenticated request was rejected, which is what we want to verify.
294
+
295
+ ### 0.6 Validation Criteria
296
+
297
+ - [ ] GPU HF Space set to **Private** on HuggingFace
298
+ - [ ] `HF_TOKEN` (server-side only) added to Vercel environment variables
299
+ - [ ] `lib/backend-auth.server.ts` created with `getBaseAuthHeaders()`, `getHfAuthHeader()`, `isHfSpace()`
300
+ - [ ] `getBackendHeaders()` in `backend-router.ts` cleaned up (no secrets)
301
+ - [ ] Sleep timeout configured on GPU Space (5 minutes recommended)
302
+ - [ ] **Test:** Direct unauthenticated request to GPU Space returns 401
303
+ - [ ] **Test:** Authenticated request via Vercel API routes succeeds
304
+ - [ ] **Test:** CPU HF Space still works for free tier users
305
+ - [ ] **Test:** GPU-enabled user requests route to GPU Space and succeed
306
+ - [ ] No changes to Devstral/Spark yet - existing CodeGen on both Spaces works
307
+
308
+ ---
309
+
310
+ ## Phase 0.5: Fix Critical API Route Routing
311
+
312
+ **Goal:** Before investing in Spark infrastructure (Phase 1), fix the most critical API routes to use per-user backend routing. This gives you confidence that GPU routing actually works before paying for a bigger GPU tier.
313
+
314
+ **Why now?** Phase 0 verifies that the routing logic in `getBackendForUser()` is correct and that the private Space accepts authenticated requests. But many API routes still use hardcoded `BACKEND_URL`, so GPU-enabled users may not actually reach the GPU Space. Phase 0.5 fixes this gap for the key endpoints you use to validate.
315
+
316
+ ### 0.5.1 Create Minimal `backendFetch` Helper
317
+
318
+ **File:** `visualisable-ai/lib/backend-fetch.ts`
319
+
320
+ This is the **minimal** helper for simple JSON POST calls. Proxy-style routes (method forwarding, query strings, binary bodies, streaming) will be handled separately in Phase 2c with a `backendProxy` helper.
321
+
322
+ ```typescript
323
+ import 'server-only'; // Prevent accidental client import
324
+
325
+ import { auth, currentUser } from '@clerk/nextjs/server';
326
+ import { getBackendForUser } from './backend-router';
327
+ import { getBaseAuthHeaders, getHfAuthHeader, isHfSpace } from './backend-auth.server';
328
+
329
+ /**
330
+ * Fetch from the backend appropriate for the current user.
331
+ *
332
+ * This helper:
333
+ * 1. Gets the current user via Clerk
334
+ * 2. Determines the correct backend (CPU HF, GPU HF, Spark, local)
335
+ * 3. Adds authentication headers (API key, and HF token only for HF targets)
336
+ *
337
+ * Use this in API routes instead of hardcoded BACKEND_URL.
338
+ *
339
+ * Note: For proxy-style routes that need method/query/body forwarding,
340
+ * use backendProxy() instead (added in Phase 2c).
341
+ */
342
+ export async function backendFetch(
343
+ endpoint: string,
344
+ options: RequestInit = {}
345
+ ): Promise<Response> {
346
+ const { userId } = await auth();
347
+ const user = userId ? await currentUser() : null;
348
+ const backend = getBackendForUser(user);
349
+
350
+ const url = `${backend.url}${endpoint}`;
351
+
352
+ return fetch(url, {
353
+ ...options,
354
+ headers: {
355
+ ...getBaseAuthHeaders(),
356
+ ...(isHfSpace(backend.url) ? getHfAuthHeader() : {}),
357
+ ...options.headers,
358
+ },
359
+ });
360
+ }
361
+ ```
362
+
363
+ ### 0.5.2 Update Critical Endpoints
364
+
365
+ Choose 1-2 endpoints that you actively use for testing routing:
366
+
367
+ **Recommended:** `/api/research/attention/analyze` (the main analyze endpoint)
368
+
369
+ **File:** `visualisable-ai/app/api/research/attention/analyze/route.ts`
370
+
371
+ ```typescript
372
+ import { NextRequest, NextResponse } from "next/server";
373
+ import { backendFetch } from "@/lib/backend-fetch";
374
+
375
+ export async function POST(request: NextRequest) {
376
+ try {
377
+ // For small JSON payloads like this, parse-then-stringify is fine.
378
+ // For large/binary/streaming payloads, use backendProxy instead.
379
+ const body = await request.json();
380
+ const { prompt, max_tokens, temperature } = body;
381
+
382
+ // Use backendFetch for per-user routing
383
+ const response = await backendFetch('/analyze/research/attention', {
384
+ method: 'POST',
385
+ body: JSON.stringify({
386
+ prompt,
387
+ max_tokens: max_tokens || 8,
388
+ temperature: temperature || 0.7
389
+ })
390
+ });
391
+
392
+ if (!response.ok) {
393
+ const error = await response.text();
394
+ throw new Error(`Backend error: ${error}`);
395
+ }
396
+
397
+ const data = await response.json();
398
+ return NextResponse.json(data);
399
+
400
+ } catch (error) {
401
+ console.error("Research attention analysis error:", error);
402
+ return NextResponse.json(
403
+ { error: error instanceof Error ? error.message : "Analysis failed" },
404
+ { status: 500 }
405
+ );
406
+ }
407
+ }
408
+ ```
409
+
410
+ ### 0.5.3 Validation
411
+
412
+ Re-run the Phase 0 user-specific tests with the updated endpoint:
413
+
414
+ **Test:** GPU-enabled user's request to `/api/research/attention/analyze` actually reaches GPU HF Space.
415
+
416
+ How to verify:
417
+ 1. Add temporary logging in the API route: `console.log('Routing to:', backend.url);`
418
+ 2. Or check GPU Space logs after triggering a request as a GPU-enabled user.
419
+
420
+ ### 0.5.4 Validation Criteria
421
+
422
+ - [ ] `lib/backend-fetch.ts` created
423
+ - [ ] At least one critical endpoint updated to use `backendFetch`
424
+ - [ ] **Test:** GPU-enabled user's analyze request reaches GPU HF Space (verified via logs)
425
+ - [ ] **Test:** Free tier user's analyze request still goes to CPU HF Space
426
+ - [ ] Remaining API route fixes deferred to Phase 2c (lower priority)
427
+
428
+ ---
429
+
430
+ ## Phase 1: Deploy CodeGen to DGX Spark
431
+
432
+ **Goal:** Prove the Docker deployment infrastructure works with the existing CodeGen model.
433
+
434
+ ### 1.1 Create Dockerfile
435
+
436
+ **File:** `Dockerfile`
437
+
438
+ ```dockerfile
439
+ # Bump with care, retest CUDA + torch compatibility
440
+ FROM nvcr.io/nvidia/pytorch:24.01-py3
441
+
442
+ WORKDIR /app
443
+
444
+ RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
445
+
446
+ COPY requirements.txt .
447
+ RUN pip install --no-cache-dir -r requirements.txt
448
+
449
+ COPY . .
450
+
451
+ EXPOSE 8000
452
+
453
+ CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
454
+ ```
455
+
456
+ ### 1.2 Create Docker Compose
457
+
458
+ **File:** `docker/compose.spark.yml`
459
+
460
+ ```yaml
461
+ services:
462
+ visualisable-ai-backend:
463
+ build:
464
+ context: ..
465
+ dockerfile: Dockerfile
466
+ # container_name: visualisable-ai-backend # Uncomment for single-instance; leave commented for multi-branch
467
+ ports:
468
+ - "${PORT:-8000}:8000"
469
+ shm_size: "8gb"
470
+ volumes:
471
+ - ..:/app # Mount repo for dev hot-reload (requires --reload in command)
472
+ - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw # Writable HF cache
473
+ - ../runs:/app/runs # Outputs (relative to docker/ folder)
474
+ environment:
475
+ - HF_HOME=/srv/models-cache/huggingface
476
+ - TRANSFORMERS_CACHE=/srv/models-cache/huggingface
477
+ - DEFAULT_MODEL=${DEFAULT_MODEL:-codegen-350m}
478
+ - API_KEY=${API_KEY}
479
+ - HF_TOKEN=${HF_TOKEN}
480
+ - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
481
+ # Operational tuning (included from day one for self-documentation)
482
+ - MAX_CONTEXT=${MAX_CONTEXT:-8192}
483
+ - BATCH_SIZE=${BATCH_SIZE:-1}
484
+ - TORCH_DTYPE=${TORCH_DTYPE:-fp16}
485
+ # Uncomment if experiencing CUDA memory fragmentation:
486
+ # - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
487
+ gpus: all
488
+ healthcheck:
489
+ test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
490
+ interval: 30s
491
+ timeout: 3s
492
+ retries: 5
493
+ restart: unless-stopped
494
+ # Dev mode: uncomment to enable hot-reload
495
+ # command: ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
496
+ ```
497
+
498
+ **Notes:**
499
+ - `/srv/models-cache/huggingface` is the writable HF cache directory
500
+ - No `/srv/models` mount needed for Phase 1 (CodeGen downloads to cache)
501
+ - **Multiple branches:** Use `PORT` and Compose project names to avoid collisions:
502
+ ```bash
503
+ PORT=8001 docker compose -p visai-branch-a -f docker/compose.spark.yml --env-file .env.spark up -d --build
504
+ PORT=8002 docker compose -p visai-branch-b -f docker/compose.spark.yml --env-file .env.spark up -d --build
505
+ ```
506
+
507
+ ### 1.3 Create Environment Template
508
+
509
+ **File:** `.env.spark.example`
510
+
511
+ ```bash
512
+ # DGX Spark Environment Configuration
513
+ # Copy to .env.spark and fill in values
514
+
515
+ # Backend port
516
+ PORT=8000
517
+
518
+ # Default model to load
519
+ DEFAULT_MODEL=codegen-350m
520
+
521
+ # Note: fp16 is recommended for GPU runs (faster, lower VRAM).
522
+ # Use fp32 only when debugging numerical issues.
523
+
524
+ # API key for authentication (generate a secure random string)
525
+ API_KEY=your-api-key-here
526
+
527
+ # HuggingFace token (for gated models)
528
+ HF_TOKEN=your-hf-token-here
529
+
530
+ # Model cache location on Spark (must be writable)
531
+ HF_HOME=/srv/models-cache/huggingface
532
+
533
+ # Operational tuning for large models
534
+ MAX_CONTEXT=8192
535
+ BATCH_SIZE=1
536
+ TORCH_DTYPE=fp16
537
+ # Uncomment if experiencing CUDA memory fragmentation:
538
+ # PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
539
+ ```
540
+
541
+ ### 1.4 Update .gitignore
542
+
543
+ **File:** `.gitignore` (append)
544
+
545
+ ```
546
+ # Spark deployment
547
+ .env.spark
548
+ runs/*
549
+ !runs/.gitkeep
550
+ ```
551
+
552
+ **Create the runs directory with placeholder:**
553
+
554
+ ```bash
555
+ mkdir -p runs
556
+ touch runs/.gitkeep
557
+ git add runs/.gitkeep
558
+ ```
559
+
560
+ This ensures the `runs/` folder exists in fresh clones (required by `compose.spark.yml` volume mount `../runs:/app/runs`).
561
+
562
+ **Important:** Commit `runs/.gitkeep` in the same PR as the `.gitignore` changes.
563
+
564
+ ### 1.5 Ensure /health Returns Fast and Add Debug Endpoints
565
+
566
+ **CRITICAL:** The `/health` endpoint MUST return immediately (HTTP 200) even while the model is still loading. If it blocks on model load, Compose will mark the container unhealthy during slow Devstral downloads in Phase 3.
567
+
568
+ Check existing `/health` implementation:
569
+ - Should return `{"status": "ok"}` immediately
570
+ - Model loading status should be on a separate `/ready` endpoint
571
+
572
+ If `/health` currently blocks, add a `/ready` endpoint:
573
+ - `/health` → process is up (always fast, always 200)
574
+ - `/ready` → model is loaded and ready for inference
575
+ - Return **200** when model is loaded and ready
576
+ - Return **503** when model is still loading (allows `watch` to show clear state change)
577
+
578
+ **Also add `/debug/device`** in Phase 1 so validation can verify model placement without relying on logs:
579
+ - `cuda_available`: whether CUDA is available
580
+ - `model_loaded`: whether the model is loaded
581
+ - `model_device`: the device the model is on
582
+ - `torch_dtype`: the dtype in use
583
+ - `model_id`: the loaded model ID
584
+
585
+ **Security note:** Do not return environment variables, tokens, or other secrets from `/debug/device`.
586
+
587
+ ### 1.6 Spark Prep
588
+
589
+ On DGX Spark host:
590
+
591
+ ```bash
592
+ # Create writable cache directory
593
+ sudo mkdir -p /srv/models-cache/huggingface
594
+ sudo chown -R root:dgx-ml /srv/models-cache
595
+ sudo chmod -R 2775 /srv/models-cache
596
+
597
+ # Clone repo
598
+ cd /srv/projects
599
+ git clone <repo> visualisable-ai-backend
600
+ cd visualisable-ai-backend
601
+
602
+ # Create env file
603
+ cp .env.spark.example .env.spark
604
+ vim .env.spark
605
+ ```
606
+
607
+ ### 1.7 Test CodeGen on Spark
608
+
609
+ ```bash
610
+ # Build and run
611
+ docker compose -f docker/compose.spark.yml --env-file .env.spark up -d --build
612
+
613
+ # Check logs
614
+ docker compose -f docker/compose.spark.yml logs -f
615
+
616
+ # Verify GPU access (deterministic check, not relying on log wording)
617
+ docker compose -f docker/compose.spark.yml --env-file .env.spark exec visualisable-ai-backend \
618
+ python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no-cuda')"
619
+
620
+ # Test endpoints
621
+ curl http://spark-c691.local:8000/health
622
+ curl http://spark-c691.local:8000/ready # Returns 503 until model is loaded, then 200
623
+ curl -s http://spark-c691.local:8000/debug/device | python -m json.tool
624
+ curl -X POST http://spark-c691.local:8000/analyze/research/attention \
625
+ -H "Content-Type: application/json" \
626
+ -d '{"prompt": "def hello():", "max_tokens": 5}'
627
+ ```
628
+
629
+ ### 1.8 Validation Criteria
630
+
631
+ - [ ] Container starts and `/health` returns 200 **immediately** (before model loads)
632
+ - [ ] `/health` remains fast even during model download
633
+ - [ ] CodeGen model loads successfully (check logs)
634
+ - [ ] `/ready` returns 200 after model is loaded
635
+ - [ ] `/analyze/research/attention` returns valid response
636
+ - [ ] CUDA is available in container (`torch.cuda.is_available()` returns `True`)
637
+ - [ ] Model device verified via `/debug/device` endpoint
638
+ - [ ] `.env.spark` is gitignored
639
+
640
+ ---
641
+
642
+ ## Phase 2: Add Devstral Backend Support
643
+
644
+ **Goal:** Add Devstral model support and validate correctness. This is a **correctness test**, not a performance test.
645
+
646
+ ### 2.1 Add MistralAdapter
647
+
648
+ **File:** `backend/model_adapter.py`
649
+
650
+ ```python
651
+ class MistralAdapter(ModelAdapter):
652
+ """Adapter for Mistral-based models (Devstral, Mistral, etc.)"""
653
+
654
+ def _get_layers(self):
655
+ """Defensive access: Mistral layers may be nested differently"""
656
+ if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
657
+ return self.model.model.layers
658
+ elif hasattr(self.model, 'layers'):
659
+ return self.model.layers
660
+ raise AttributeError("Cannot find transformer layers in Mistral model")
661
+
662
+ def get_num_layers(self) -> int:
663
+ return self.model.config.num_hidden_layers
664
+
665
+ def get_num_heads(self) -> int:
666
+ return self.model.config.num_attention_heads
667
+
668
+ def get_num_kv_heads(self) -> Optional[int]:
669
+ return getattr(self.model.config, 'num_key_value_heads', None)
670
+
671
+ def get_layer_module(self, layer_idx: int):
672
+ return self._get_layers()[layer_idx]
673
+
674
+ def get_attention_module(self, layer_idx: int):
675
+ return self._get_layers()[layer_idx].self_attn
676
+
677
+ def get_mlp_module(self, layer_idx: int):
678
+ return self._get_layers()[layer_idx].mlp
679
+
680
+ def get_qkv_projections(self, layer_idx: int):
681
+ attn = self.get_attention_module(layer_idx)
682
+ return attn.q_proj, attn.k_proj, attn.v_proj
683
+ ```
684
+
685
+ Update factory:
686
+ ```python
687
+ def create_adapter(model, tokenizer, model_id):
688
+ config = get_model_config(model_id)
689
+ architecture = config["architecture"]
690
+
691
+ if architecture == "gpt_neox":
692
+ return CodeGenAdapter(model, tokenizer, model_id)
693
+ elif architecture == "llama":
694
+ return CodeLlamaAdapter(model, tokenizer, model_id)
695
+ elif architecture == "mistral":
696
+ return MistralAdapter(model, tokenizer, model_id)
697
+ else:
698
+ raise ValueError(f"Unsupported architecture: {architecture}")
699
+ ```
700
+
701
+ ### 2.2 Add Devstral to Model Config
702
+
703
+ **File:** `backend/model_config.py`
704
+
705
+ ```python
706
+ "devstral-small": {
707
+ "hf_path": "mistralai/Devstral-Small-2507",
708
+ "display_name": "Devstral Small 24B",
709
+ "architecture": "mistral",
710
+ "size": "24B",
711
+ "num_layers": 40,
712
+ "num_heads": 32,
713
+ "num_kv_heads": 8,
714
+ "vocab_size": 131072,
715
+ "context_length": 131072,
716
+ "attention_type": "grouped_query",
717
+ "requires_gpu": True, # Keep True to steer users to Spark
718
+ "min_vram_gb": 48.0,
719
+ "min_ram_gb": 96.0
720
+ }
721
+ ```
722
+
723
+ **Note:** `requires_gpu: True` remains set to guide users toward Spark. CPU inference is technically possible on Mac Studio (512GB RAM) but is painfully slow and not recommended for regular use.
724
+
725
+ ### 2.3 Fix Hardcoded Layer Classification
726
+
727
+ **File:** `backend/model_service.py` (~line 1505)
728
+
729
+ ```python
730
+ # Fixed (percentage-based, 1-indexed fraction for transformer blocks):
731
+ layer_fraction = (layer_idx + 1) / n_layers
732
+ if layer_idx == 0:
733
+ layer_pattern = {"type": "positional", ...}
734
+ elif layer_fraction <= 0.25:
735
+ layer_pattern = {"type": "previous_token", ...}
736
+ elif layer_fraction <= 0.75:
737
+ layer_pattern = {"type": "induction", ...}
738
+ else:
739
+ layer_pattern = {"type": "semantic", ...}
740
+ ```
741
+
742
+ ### 2.4 Wire Env Vars into Model Loader
743
+
744
+ **File:** `backend/model_service.py` (in `load_model()` or `ModelManager.__init__`)
745
+
746
+ Ensure the backend reads and applies these environment variables:
747
+
748
+ - `MAX_CONTEXT`: caps input truncation (tokenizer max_length). If requests include `max_new_tokens`, do not silently override it unless you explicitly want global caps—this prevents confusion when callers expect per-request control.
749
+ - `BATCH_SIZE`: wire in where applicable; otherwise leave as reserved for future batching (only meaningful if the service implements request batching)
750
+ - `TORCH_DTYPE`: map string to dtype:
751
+ - `bf16` → `torch.bfloat16`
752
+ - `fp16` → `torch.float16`
753
+ - `fp32` → `torch.float32`
754
+
755
+ ### 2.5 Add `/models` and `/models/current` Endpoints
756
+
757
+ **File:** `backend/model_service.py`
758
+
759
+ These endpoints are required by the frontend (Phase 2b.4) and for validation (Phase 2c). Add them as explicit Phase 2 deliverables:
760
+
761
+ **`GET /models`** - List available models:
762
+ ```python
763
+ @app.get("/models")
764
+ def list_models():
765
+ """Return list of models this backend can serve."""
766
+ return {
767
+ "models": [
768
+ {
769
+ "id": model_id,
770
+ "name": config["display_name"],
771
+ "available": is_model_available(model_id), # Check VRAM, etc.
772
+ "requires_gpu": config.get("requires_gpu", False)
773
+ }
774
+ for model_id, config in SUPPORTED_MODELS.items()
775
+ ]
776
+ }
777
+ ```
778
+
779
+ **`GET /models/current`** - Return currently loaded model:
780
+ ```python
781
+ @app.get("/models/current")
782
+ def current_model():
783
+ """Return info about the currently loaded model."""
784
+ if not model_manager.model_loaded:
785
+ return {"id": None, "device": None, "dtype": None}
786
+ return {
787
+ "id": model_manager.model_id,
788
+ "device": str(model_manager.device),
789
+ "dtype": str(model_manager.dtype)
790
+ }
791
+ ```
792
+
793
+ **Why explicit deliverables?** Phase 2c validation depends on these endpoints. Making them "if missing" creates ambiguity. By adding them in Phase 2, the frontend work in 2b and validation in 2c can proceed cleanly.
794
+
795
+ ### 2.6 Local Validation (Correctness Only)
796
+
797
+ **Option A: Full load on Mac Studio (slow, ~96GB RAM needed)**
798
+
799
+ ```bash
800
+ export DEFAULT_MODEL=devstral-small
801
+ export HF_TOKEN=your-token-here
802
+ python -m uvicorn backend.model_service:app --host 0.0.0.0 --port 8000
803
+
804
+ # Test (will be VERY slow on CPU)
805
+ curl -X POST http://localhost:8000/analyze/research/attention \
806
+ -H "Content-Type: application/json" \
807
+ -d '{"prompt": "def hello():", "max_tokens": 2}'
808
+ ```
809
+
810
+ **Option B: Unit test without full model load**
811
+
812
+ Write a test that:
813
+ 1. Loads model config, verifies 40 layers
814
+ 2. Checks MistralAdapter layer access pattern
815
+ 3. Validates layer classification fractions
816
+
817
+ ### 2.7 Validation Criteria
818
+
819
+ - [ ] Devstral config added to SUPPORTED_MODELS
820
+ - [ ] MistralAdapter correctly accesses layers
821
+ - [ ] Layer classification works for 40-layer model (percentage-based)
822
+ - [ ] Env vars (MAX_CONTEXT, BATCH_SIZE, TORCH_DTYPE) are wired into loader
823
+ - [ ] `/models` endpoint returns list of available models
824
+ - [ ] `/models/current` endpoint returns currently loaded model info
825
+ - [ ] One successful endpoint call (correctness, not performance)
826
+
827
+ ---
828
+
829
+ ## Phase 2b: Frontend Dynamic Layer Handling
830
+
831
+ **Goal:** Update frontend to handle models with different layer counts and vocab sizes.
832
+
833
+ ### 2b.1 Fix Stage Boundaries
834
+
835
+ **File:** `components/research/VerticalPipeline.tsx`
836
+
837
+ Replace hardcoded layer boundaries with percentage-based:
838
+
839
+ ```typescript
840
+ // Current (hardcoded for 20 layers):
841
+ const getStageInfo = (layerIdx: number) => {
842
+ if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
843
+ if (layerIdx <= 5) return { color: 'green', label: 'EARLY' };
844
+ if (layerIdx <= 14) return { color: 'blue', label: 'MIDDLE' };
845
+ if (layerIdx <= 19) return { color: 'purple', label: 'LATE' };
846
+ return { color: 'orange', label: 'OUTPUT' };
847
+ };
848
+
849
+ // Fixed (percentage-based):
850
+ const getStageInfo = (layerIdx: number, totalLayers: number) => {
851
+ if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
852
+ const fraction = layerIdx / totalLayers;
853
+ if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
854
+ if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
855
+ return { color: 'purple', label: 'LATE' };
856
+ };
857
+ ```
858
+
859
+ Update layer slice operations:
860
+
861
+ ```typescript
862
+ const earlyEnd = Math.floor(numLayers * 0.25);
863
+ const middleEnd = Math.floor(numLayers * 0.75);
864
+
865
+ // EARLY LAYERS
866
+ {layersData.slice(1, earlyEnd + 1).map(...)}
867
+
868
+ // MIDDLE LAYERS
869
+ {layersData.slice(earlyEnd + 1, middleEnd + 1).map(...)}
870
+
871
+ // LATE LAYERS (JS slice end is exclusive)
872
+ {layersData.slice(middleEnd + 1, numLayers + 1).map(...)}
873
+ ```
874
+
875
+ ### 2b.2 Fix Hardcoded Vocabulary Display
876
+
877
+ **File:** `components/research/VerticalPipeline.tsx` (line ~305)
878
+
879
+ Replace `(51,200 tokens)` with dynamic value from `modelInfo.vocabSize`.
880
+
881
+ ### 2b.3 Fix Hardcoded head_dim
882
+
883
+ **File:** `components/research/SpreadsheetGrid.tsx` (if exists)
884
+
885
+ Replace `const dHead = 64` with dynamic calculation:
886
+ ```typescript
887
+ const dHead = modelInfo.hiddenSize / modelInfo.numHeads;
888
+ if (!Number.isInteger(dHead)) {
889
+ console.warn("Non-integer head_dim", { hiddenSize: modelInfo.hiddenSize, numHeads: modelInfo.numHeads });
890
+ }
891
+ ```
892
+
893
+ ### 2b.4 Dynamic Model List from Backend
894
+
895
+ If the frontend model selector is a static list, update it to populate dynamically from the backend `/models` endpoint (or similar). This ensures:
896
+ - Models only appear when actually available on the connected backend
897
+ - Devstral only shows when connected to Spark (not HuggingFace)
898
+
899
+ If the frontend already fetches `supported_models` from the backend, this is naturally handled.
900
+
901
+ ### 2b.5 Validation Criteria
902
+
903
+ - [ ] Stage boundaries work correctly for 40-layer model
904
+ - [ ] Vocab display shows correct value for each model
905
+ - [ ] head_dim calculated dynamically (if applicable)
906
+ - [ ] UI renders correctly with both CodeGen (20 layers) and Devstral (40 layers)
907
+ - [ ] Model selector only shows models available on the connected backend (requires Phase 2c for full test)
908
+
909
+ ---
910
+
911
+ ## Phase 2c: Wire Spark into Frontend Backend Router
912
+
913
+ **Goal:** Add DGX Spark as a fourth backend option in the existing routing infrastructure, and fix server-side API routes to respect per-user backend selection.
914
+
915
+ **Dependency:** Phase 2 must be complete (Devstral support merged, `/models` and `/models/current` endpoints added) before enabling Devstral as the `DEFAULT_MODEL` on the GPU HF Space.
916
+
917
+ ### Important Network Constraint
918
+
919
+ **Spark is a local-network-only backend.** The hostname `spark-c691.local` is only resolvable on your local network (mDNS).
920
+
921
+ | Environment | Can reach Spark? | Notes |
922
+ |-------------|------------------|-------|
923
+ | Local dev (your machine) | ✅ Yes | Same LAN as Spark |
924
+ | Vercel production | ❌ No | Cannot resolve `.local` hostnames |
925
+ | HuggingFace Spaces | ❌ No | Cannot resolve `.local` hostnames |
926
+
927
+ **Implications:**
928
+ - Spark toggle is a **developer/research feature** for local mode only
929
+ - Production GPU users should use the **GPU HuggingFace Space** (via `gpuEnabled` toggle)
930
+ - Do NOT expose Spark to the public internet without proper security (VPN, auth, etc.)
931
+
932
+ **Spark authentication:** Spark requests are authenticated via `X-API-Key` header (same as local backend). The HF token is only for `.hf.space` targets and is not sent to Spark. For additional security, consider network-level protection (VPN/Tailscale), but API key alone is sufficient for LAN-only access.
933
+
934
+ **Fallback when Spark is unreachable:** No automatic fallback initially; fail fast with a user-visible error message and a quick toggle to switch to Remote/Local. This keeps behaviour predictable—users should always know which backend they are hitting. Automatic fallback could be added later if needed, but explicit is safer for v1.
935
+
936
+ **Important:** Production deployments (Vercel) must NOT set `NEXT_PUBLIC_MODE=local`, otherwise Spark routing could incorrectly activate. Only set this in local development `.env.local` files.
937
+
938
+ **Backend routing summary:**
939
+ | Toggle | Production (Vercel) | Local Mode |
940
+ |--------|---------------------|------------|
941
+ | Neither | CPU HF Space | localhost:8000 |
942
+ | Remote | CPU HF Space | CPU HF Space |
943
+ | Remote + GPU | GPU HF Space | GPU HF Space |
944
+ | Spark | ❌ Invalid | spark-c691.local:8000 |
945
+
946
+ ### 2c.1 Update Backend Router
947
+
948
+ **File:** `visualisable-ai/lib/backend-router.ts`
949
+
950
+ Add Spark URL constant:
951
+ ```typescript
952
+ const SPARK_BACKEND_URL = process.env.NEXT_PUBLIC_SPARK_BACKEND_URL ||
953
+ 'http://spark-c691.local:8000';
954
+ ```
955
+
956
+ Update `BackendConfig.device` type to include Spark:
957
+ ```typescript
958
+ device: 'cpu' | 'gpu' | 'spark';
959
+ ```
960
+
961
+ Add helper for safe WebSocket URL construction:
962
+ ```typescript
963
+ function toWsUrl(httpUrl: string, wsPath: string = '/ws'): string {
964
+ try {
965
+ const url = new URL(httpUrl);
966
+ url.protocol = url.protocol === 'https:' ? 'wss:' : 'ws:';
967
+ url.pathname = url.pathname.replace(/\/$/, '') + wsPath;
968
+ return url.toString();
969
+ } catch {
970
+ // Fallback for malformed URLs
971
+ return httpUrl.replace(/^https:/, 'wss:').replace(/^http:/, 'ws:') + wsPath;
972
+ }
973
+ }
974
+ ```
975
+
976
+ **Note:** All current backends (localhost, HuggingFace Spaces, Spark) use `/ws` as the WebSocket path. If a future backend uses a different path, pass it as the second argument.
977
+
978
+ Update `getBackendForUser` to handle Spark routing:
979
+ ```typescript
980
+ export function getBackendForUser(user: User | null): BackendConfig {
981
+ const isLocalMode = process.env.NEXT_PUBLIC_MODE === 'local';
982
+ const localBackendUrl = process.env.NEXT_PUBLIC_API_URL || 'http://localhost:8000';
983
+
984
+ // Check user settings
985
+ const hasRemoteOverride = user?.unsafeMetadata?.backendOverride === 'remote';
986
+ const hasSparkOverride = user?.unsafeMetadata?.backendOverride === 'spark';
987
+ const hasGPUAccess = user?.unsafeMetadata?.gpuEnabled === true;
988
+
989
+ // SPARK MODE: Only valid in local mode (Spark not reachable from Vercel)
990
+ // Spark toggle is a developer/research feature for local network only
991
+ if (hasSparkOverride && isLocalMode) {
992
+ return {
993
+ url: SPARK_BACKEND_URL,
994
+ wsUrl: toWsUrl(SPARK_BACKEND_URL),
995
+ tier: 'research',
996
+ reason: 'DGX Spark backend (local network)',
997
+ device: 'spark',
998
+ performance: {
999
+ inferenceSpeed: '50-200ms',
1000
+ concurrentUsers: '10+'
1001
+ }
1002
+ };
1003
+ }
1004
+
1005
+ // LOCAL MODE: Check if we should use localhost
1006
+ if (isLocalMode && !hasRemoteOverride) {
1007
+ return {
1008
+ url: localBackendUrl,
1009
+ wsUrl: toWsUrl(localBackendUrl),
1010
+ tier: 'local' as BackendTier,
1011
+ reason: 'Local development',
1012
+ device: 'cpu',
1013
+ performance: {
1014
+ inferenceSpeed: 'Variable (local)',
1015
+ concurrentUsers: 'Unlimited (local)'
1016
+ }
1017
+ };
1018
+ }
1019
+
1020
+ // ... rest of existing logic (GPU HF, CPU HF)
1021
+ }
1022
+ ```
1023
+
1024
+ **Note:** Spark routing is gated by `isLocalMode` - even if a user has `backendOverride: 'spark'` in production, it will fall through to the HuggingFace backends.
1025
+
1026
+ **Optional extra safety:** If you want to ensure Spark is never accidentally chosen server-side (e.g., during SSR in local mode), add a client-side check:
1027
+ ```typescript
1028
+ if (hasSparkOverride && isLocalMode && typeof window !== 'undefined') {
1029
+ // Only route to Spark from client-side code
1030
+ }
1031
+ ```
1032
+ This is optional since SSR typically doesn't make backend calls, but provides defense-in-depth.
1033
+
1034
+ **Belt-and-braces option:** Since `NEXT_PUBLIC_MODE` is baked into the client bundle at build time, you could add a runtime hostname check as additional defense:
1035
+ ```typescript
1036
+ const isLocalHost = typeof window !== 'undefined' &&
1037
+ (window.location.hostname === 'localhost' || window.location.hostname === '127.0.0.1');
1038
+
1039
+ if (hasSparkOverride && isLocalMode && isLocalHost) {
1040
+ // Spark only available when actually running locally
1041
+ }
1042
+ ```
1043
+ This prevents Spark routing even if someone accidentally deploys a local-mode build.
1044
+
1045
+ ### 2c.2 Update Admin UI
1046
+
1047
+ **File:** `visualisable-ai/app/admin/users/page.tsx`
1048
+
1049
+ Add a third toggle for Spark backend with **mutual exclusivity** (enabling Spark clears Remote, and vice versa):
1050
+
1051
+ ```typescript
1052
+ const toggleSparkBackend = async (userId: string, currentValue: boolean) => {
1053
+ const user = users.find(u => u.id === userId);
1054
+ if (!user) return;
1055
+
1056
+ const newValue = !currentValue;
1057
+
1058
+ // Optimistically update UI - clear Remote if enabling Spark
1059
+ setUsers(prevUsers => prevUsers.map(u => {
1060
+ if (u.id === userId) {
1061
+ return {
1062
+ ...u,
1063
+ unsafeMetadata: {
1064
+ ...u.unsafeMetadata,
1065
+ // Mutual exclusivity: Spark and Remote cannot both be set
1066
+ backendOverride: newValue ? 'spark' : undefined
1067
+ }
1068
+ };
1069
+ }
1070
+ return u;
1071
+ }));
1072
+
1073
+ // ... API call to persist (same pattern as toggleRemoteBackend)
1074
+ };
1075
+
1076
+ // Also update toggleRemoteBackend to clear Spark when enabling Remote:
1077
+ const toggleRemoteBackend = async (userId: string, currentValue: boolean) => {
1078
+ // ... existing code ...
1079
+ // Mutual exclusivity: backendOverride can only be 'remote', 'spark', or undefined
1080
+ backendOverride: newValue ? 'remote' : undefined
1081
+ };
1082
+ ```
1083
+
1084
+ **Only show Spark toggle in local mode** (it's not useful in production):
1085
+ ```tsx
1086
+ {isLocalMode && (
1087
+ <th className="px-6 py-3 text-left text-xs font-medium text-gray-400 uppercase tracking-wider">
1088
+ Spark
1089
+ </th>
1090
+ )}
1091
+
1092
+ // In row:
1093
+ {isLocalMode && (
1094
+ <td className="px-6 py-4 whitespace-nowrap">
1095
+ <button
1096
+ onClick={() => toggleSparkBackend(user.id, hasSparkOverride)}
1097
+ className={`relative inline-flex h-6 w-11 items-center rounded-full transition-colors cursor-pointer hover:opacity-80 ${
1098
+ hasSparkOverride ? 'bg-orange-600' : 'bg-gray-700'
1099
+ }`}
1100
+ title="Use DGX Spark backend (requires local network access)"
1101
+ >
1102
+ <span className={`inline-block h-4 w-4 transform rounded-full bg-white transition-transform ${
1103
+ hasSparkOverride ? 'translate-x-6' : 'translate-x-1'
1104
+ }`} />
1105
+ </button>
1106
+ </td>
1107
+ )}
1108
+ ```
1109
+
1110
+ ### 2c.3 Fix Server-Side API Routes
1111
+
1112
+ **Critical:** Some API routes bypass per-user routing by using hardcoded `BACKEND_URL`.
1113
+
1114
+ **Routes already correct** (use `getBackendForUser()` + `getBackendHeaders()`):
1115
+ - `app/api/generate/route.ts` ✅
1116
+ - `app/api/swe-bench/route.ts` ✅
1117
+
1118
+ **Routes to update:**
1119
+ - `app/api/research/attention/analyze/route.ts`
1120
+ - `app/api/proxy/[...path]/route.ts`
1121
+ - `app/api/demos/route.ts`
1122
+ - `app/api/demos/run/route.ts`
1123
+ - `app/api/vocabulary/search/route.ts`
1124
+ - `app/api/vocabulary/browse/route.ts`
1125
+ - `app/api/token/metadata/route.ts`
1126
+ - `app/api/backend/[...path]/route.ts`
1127
+
1128
+ **Pattern to apply:**
1129
+
1130
+ By Phase 2c, `lib/backend-fetch.ts` already exists (created in Phase 0.5). Use the appropriate helper:
1131
+
1132
+ - **`backendFetch(endpoint, options)`** - For simple JSON POST calls (most routes)
1133
+ - **`backendProxy(request, endpointPath)`** - For pass-through proxy routes (added below)
1134
+
1135
+ **For simple JSON routes:**
1136
+ ```typescript
1137
+ import { backendFetch } from '@/lib/backend-fetch';
1138
+
1139
+ export async function POST(request: NextRequest) {
1140
+ const body = await request.json();
1141
+ const response = await backendFetch('/some/endpoint', {
1142
+ method: 'POST',
1143
+ body: JSON.stringify(body)
1144
+ });
1145
+ // ...
1146
+ }
1147
+ ```
1148
+
1149
+ **For proxy routes** (e.g., `/api/proxy/[...path]`, `/api/backend/[...path]`):
1150
+
1151
+ Add `backendProxy` to `lib/backend-fetch.ts` (extending the file created in Phase 0.5):
1152
+
1153
+ ```typescript
1154
+ // lib/backend-fetch.ts - ADD to existing file (imports already present from Phase 0.5)
1155
+ // Add this import at the top:
1156
+ import { NextRequest } from 'next/server';
1157
+
1158
+ /**
1159
+ * Proxy a request to the backend with full pass-through.
1160
+ *
1161
+ * Handles:
1162
+ * - Method forwarding (GET, POST, PUT, DELETE, etc.)
1163
+ * - Query string forwarding
1164
+ * - Body forwarding (including binary)
1165
+ * - Header pass-through (excluding hop-by-hop headers)
1166
+ * - Returns raw Response for streaming
1167
+ *
1168
+ * Use for catch-all proxy routes like /api/proxy/[...path].
1169
+ *
1170
+ * @param request - The incoming Next.js request
1171
+ * @param endpointPath - Path to forward to (must NOT include query string)
1172
+ */
1173
+ export async function backendProxy(
1174
+ request: NextRequest,
1175
+ endpointPath: string
1176
+ ): Promise<Response> {
1177
+ const { userId } = await auth();
1178
+ const user = userId ? await currentUser() : null;
1179
+ const backend = getBackendForUser(user);
1180
+
1181
+ // Build URL with query string from original request
1182
+ // Note: endpointPath should be a clean path without query string
1183
+ const url = new URL(endpointPath, backend.url);
1184
+ url.search = request.nextUrl.search;
1185
+
1186
+ // Headers to exclude:
1187
+ // - hop-by-hop headers (not meant to be forwarded)
1188
+ // - auth headers (we add our own server-side auth, don't leak client tokens)
1189
+ // - proxy/CDN headers (avoid confusing upstream, keep logs clean)
1190
+ // - content-length (let fetch recalculate for streaming body)
1191
+ const excludeHeaders = new Set([
1192
+ 'host', 'connection', 'keep-alive', 'transfer-encoding',
1193
+ 'te', 'trailer', 'upgrade', 'proxy-authorization', 'proxy-authenticate',
1194
+ 'authorization', 'cookie', // Don't forward client auth to backend
1195
+ 'x-forwarded-for', 'x-forwarded-proto', 'x-forwarded-host', // Proxy headers
1196
+ 'cf-connecting-ip', 'cf-ray', 'cf-ipcountry', // Cloudflare headers
1197
+ 'content-length' // Let fetch set this for streaming body
1198
+ ]);
1199
+
1200
+ // Forward headers (except hop-by-hop)
1201
+ const forwardHeaders: HeadersInit = {};
1202
+ request.headers.forEach((value, key) => {
1203
+ if (!excludeHeaders.has(key.toLowerCase())) {
1204
+ forwardHeaders[key] = value;
1205
+ }
1206
+ });
1207
+
1208
+ // Merge with auth headers (auth headers take precedence)
1209
+ // Only attach HF token for HuggingFace Space targets
1210
+ const headers = {
1211
+ ...forwardHeaders,
1212
+ ...getBaseAuthHeaders(),
1213
+ ...(isHfSpace(backend.url) ? getHfAuthHeader() : {}),
1214
+ };
1215
+
1216
+ // Forward body for methods that have one
1217
+ const hasBody = !['GET', 'HEAD'].includes(request.method);
1218
+ const body = hasBody ? request.body : undefined;
1219
+
1220
+ return fetch(url.toString(), {
1221
+ method: request.method,
1222
+ headers,
1223
+ body,
1224
+ // @ts-expect-error: duplex is required for streaming body but not in types
1225
+ duplex: hasBody ? 'half' : undefined,
1226
+ });
1227
+ }
1228
+ ```
1229
+
1230
+ **Usage in proxy routes:**
1231
+ ```typescript
1232
+ import { NextRequest } from 'next/server';
1233
+ import { backendProxy } from '@/lib/backend-fetch';
1234
+
1235
+ // IMPORTANT: Use Node runtime for streaming body support (duplex: 'half')
1236
+ export const runtime = 'nodejs';
1237
+
1238
+ // app/api/proxy/[...path]/route.ts
1239
+ export async function GET(request: NextRequest, { params }: { params: { path: string[] } }) {
1240
+ // params.path is clean (no query string) - query comes from request.nextUrl.search
1241
+ const endpointPath = '/' + params.path.join('/');
1242
+ return backendProxy(request, endpointPath);
1243
+ }
1244
+
1245
+ export async function POST(request: NextRequest, { params }: { params: { path: string[] } }) {
1246
+ const endpointPath = '/' + params.path.join('/');
1247
+ return backendProxy(request, endpointPath);
1248
+ }
1249
+ // ... same for PUT, DELETE, etc.
1250
+ ```
1251
+
1252
+ **Implementation notes:**
1253
+ - **Runtime requirement:** **All** routes using `backendProxy` must use `export const runtime = 'nodejs'` because `request.body` streaming with `duplex: 'half'` requires Node (not Edge). This includes `/api/proxy/[...path]`, `/api/backend/[...path]`, and any other catch-all proxy routes.
1254
+ - **Authentication is centralized:** Both helpers use `getBaseAuthHeaders()` (API key) and conditionally add `getHfAuthHeader()` (HF token) based on `isHfSpace()` check.
1255
+ - **HF token only for HF backends:** The `isHfSpace()` check ensures the HF token is only sent to `.hf.space` URLs. This keeps Spark and localhost logs clean and avoids sending credentials to non-HF targets.
1256
+ - **Streaming works automatically:** `backendProxy` returns the raw `Response` without consuming the body.
1257
+ - **Body handling:** Uses `request.body` directly (ReadableStream) with `duplex: 'half'` for streaming request bodies.
1258
+
1259
+ ### 2c.4 Add Environment Variables
1260
+
1261
+ **File:** `visualisable-ai/.env.local` (local development only)
1262
+
1263
+ ```bash
1264
+ # DGX Spark backend URL (for local network access)
1265
+ NEXT_PUBLIC_SPARK_BACKEND_URL=http://spark-c691.local:8000
1266
+
1267
+ # Enable local mode (shows Spark toggle, allows localhost backend)
1268
+ NEXT_PUBLIC_MODE=local
1269
+ ```
1270
+
1271
+ **File:** `visualisable-ai/.env.example` (document but don't set values)
1272
+
1273
+ ```bash
1274
+ # DGX Spark backend URL (for local network access)
1275
+ # NEXT_PUBLIC_SPARK_BACKEND_URL=http://spark-c691.local:8000
1276
+
1277
+ # Local mode - ONLY set in .env.local, NEVER in production
1278
+ # NEXT_PUBLIC_MODE=local
1279
+ ```
1280
+
1281
+ **⚠️ CRITICAL: Do NOT define `NEXT_PUBLIC_MODE` in Vercel**
1282
+
1283
+ This is a belt-and-braces safety measure:
1284
+ - Only define `NEXT_PUBLIC_MODE=local` in `.env.local` (local development)
1285
+ - **Never** add it to Vercel environment variables
1286
+ - This makes accidental Spark exposure impossible, even if someone toggles user metadata incorrectly
1287
+
1288
+ If `NEXT_PUBLIC_MODE` is undefined in production, Spark routing is disabled regardless of user settings.
1289
+
1290
+ ### 2c.5 Update TierIndicator (Optional)
1291
+
1292
+ **File:** `visualisable-ai/components/TierIndicator.tsx`
1293
+
1294
+ Add Spark-specific display if the component shows current backend:
1295
+ ```typescript
1296
+ if (device === 'spark') {
1297
+ return { icon: <Cpu />, label: 'Spark', color: 'orange' };
1298
+ }
1299
+ ```
1300
+
1301
+ ### 2c.6 Toggle Behavior Notes
1302
+
1303
+ The three toggles should be mutually exclusive for `backendOverride`:
1304
+ - **Remote** → `backendOverride: 'remote'` (uses HuggingFace)
1305
+ - **Spark** → `backendOverride: 'spark'` (uses DGX Spark, local mode only)
1306
+ - **Neither** → `backendOverride: undefined` (uses localhost in local mode)
1307
+
1308
+ **GPU Access** remains independent—it controls which HuggingFace Space to use when Remote is enabled.
1309
+
1310
+ The code in 2c.2 handles mutual exclusivity by using a single `backendOverride` field that can only hold one value.
1311
+
1312
+ ### 2c.7 Verify /models Endpoints (Added in Phase 2)
1313
+
1314
+ The frontend model selector (Phase 2b.4) depends on the `/models` and `/models/current` endpoints added in Phase 2.5. Verify these endpoints work correctly on all backends and return:
1315
+
1316
+ ```json
1317
+ {
1318
+ "models": [
1319
+ {
1320
+ "id": "codegen-350m",
1321
+ "name": "CodeGen 350M",
1322
+ "available": true,
1323
+ "requires_gpu": false
1324
+ },
1325
+ {
1326
+ "id": "devstral-small",
1327
+ "name": "Devstral Small 24B",
1328
+ "available": true,
1329
+ "requires_gpu": true
1330
+ }
1331
+ ]
1332
+ }
1333
+ ```
1334
+
1335
+ **Model availability by backend:**
1336
+ | Model | CPU HF Space | GPU HF Space | Spark |
1337
+ |-------|--------------|--------------|-------|
1338
+ | CodeGen | ✅ available (default) | ✅ available | ✅ available |
1339
+ | Devstral | ❌ unavailable | ✅ available (default) | ✅ available |
1340
+
1341
+ **Production model strategy:**
1342
+ - **CPU HF Space**: CodeGen only (free tier users)
1343
+ - **GPU HF Space**: Devstral as default (GPU-enabled users get Devstral automatically)
1344
+ - **Spark**: Both models available (local development/research)
1345
+
1346
+ **Verify `/models/current` endpoint** (added in Phase 2.5) returns the currently loaded model:
1347
+
1348
+ ```json
1349
+ {
1350
+ "id": "devstral-small",
1351
+ "device": "cuda",
1352
+ "dtype": "bf16"
1353
+ }
1354
+ ```
1355
+
1356
+ This is used for:
1357
+ - Frontend to know which model is active without parsing `/models` list
1358
+ - Debugging to quickly verify which model a backend is running
1359
+ - The model_id acceptance test in Phase 2c validation
1360
+
1361
+ ### 2c.8 Configure GPU HuggingFace Space for Devstral
1362
+
1363
+ **Prerequisites:** The GPU HF Space must have sufficient hardware to run Devstral.
1364
+
1365
+ **Minimum hardware:**
1366
+ - L40S (48GB VRAM) - minimum viable
1367
+ - A100 (80GB VRAM) - recommended for headroom
1368
+
1369
+ **Environment configuration for GPU HF Space:**
1370
+ ```bash
1371
+ DEFAULT_MODEL=devstral-small
1372
+ TORCH_DTYPE=bf16
1373
+ ```
1374
+
1375
+ **How it works:**
1376
+ 1. User has `gpuEnabled=true` in their profile
1377
+ 2. Frontend router sends requests to GPU HF Space URL
1378
+ 3. GPU HF Space has `DEFAULT_MODEL=devstral-small`, so Devstral loads on startup
1379
+ 4. `/models` endpoint returns `devstral-small` with `available: true`
1380
+ 5. User automatically uses Devstral without touching model selector
1381
+
1382
+ **Backend decides default (Approach 1 - recommended):**
1383
+ The simplest approach is to let each backend decide its own default model via `DEFAULT_MODEL` environment variable:
1384
+ - CPU HF Space: `DEFAULT_MODEL=codegen-350m`
1385
+ - GPU HF Space: `DEFAULT_MODEL=devstral-small`
1386
+
1387
+ No frontend logic needed - GPU-enabled users automatically get Devstral because that's what the GPU backend loads.
1388
+
1389
+ **Important: Frontend must not force a model_id**
1390
+
1391
+ For this to work, the frontend must NOT hardcode `model_id=codegen-350m` in API requests. Either:
1392
+ 1. **Omit `model_id`** from requests entirely - backend uses `DEFAULT_MODEL`
1393
+ 2. **Use backend's reported default** - fetch from `/models/current` or `/models` endpoint
1394
+ 3. **Respect user selection** - if user explicitly picks a model, use that
1395
+
1396
+ Check existing API calls (e.g., `/analyze/research/attention`, `/generate`) to ensure they don't always send a static `model_id`. If they do, update them to omit it or use the backend's default.
1397
+
1398
+ **Verification steps (do these in 2c-Step-1):**
1399
+ 1. **Grep for hardcoded model_id:** Search the Next.js app for `model_id`, `codegen`, and `codegen-350m` to find any hardcoded references.
1400
+ 2. **Check backend default behaviour:** Confirm the backend uses `DEFAULT_MODEL` when `model_id` is omitted from requests. Test with a curl that omits `model_id` and verify it uses the expected default.
1401
+
1402
+ ### 2c.9 HuggingFace Space Deployment Mechanics
1403
+
1404
+ **How deployment works:** The backend is deployed to HuggingFace Spaces via GitHub Actions.
1405
+
1406
+ 1. **Repository:** Backend code lives in `visualisable-ai-backend` repo
1407
+ 2. **Trigger:** Push to `main` branch triggers GitHub Actions workflow
1408
+ 3. **Workflow:** `.github/workflows/security-check.yml` (job: `deploy-to-huggingface`) pushes code to both HF Space git remotes
1409
+ 4. **Space rebuild:** HuggingFace automatically rebuilds the Space when it receives the push
1410
+
1411
+ **Current deployment targets:**
1412
+ - **CPU Space:** `visualisable-ai/api` → `https://huggingface.co/spaces/visualisable-ai/api`
1413
+ - **GPU Space:** `visualisable-ai/api-gpu` → `https://huggingface.co/spaces/visualisable-ai/api-gpu`
1414
+
1415
+ **Key files:**
1416
+ - `.github/workflows/security-check.yml` - security checks + deployment workflow
1417
+ - `Dockerfile` - HF Space build configuration (already exists in repo root)
1418
+ - Space settings on HuggingFace - environment variables, hardware tier, visibility
1419
+
1420
+ **To deploy Devstral to GPU HF Space:**
1421
+ 1. Ensure Phase 2 changes (Devstral support) are merged to `main`
1422
+ 2. GitHub Actions deploys to the Space automatically
1423
+ 3. In HuggingFace Space settings:
1424
+ - Set `DEFAULT_MODEL=devstral-small`
1425
+ - Set `TORCH_DTYPE=bf16`
1426
+ - Upgrade hardware tier to L40S (48GB) or A100 (80GB)
1427
+ - Ensure Space is **Private** (from Phase 0)
1428
+ 4. Space rebuilds and loads Devstral on startup
1429
+
1430
+ **Secrets configuration:**
1431
+ - HuggingFace Space variables are set in Space Settings > Variables
1432
+ - GitHub Actions secrets (for pushing to HF) are in repo Settings > Secrets
1433
+ - Vercel env vars (for API routes) are separate from HF Space vars
1434
+
1435
+ ### 2c.10 Recommended Implementation Order
1436
+
1437
+ To reduce risk, implement Phase 2c in two sub-steps:
1438
+
1439
+ **2c-Step-1: Fix per-user routing (CPU HF vs GPU HF)**
1440
+ - Create `lib/backend-fetch.ts` helper
1441
+ - Update all API routes to use `backendFetch`
1442
+ - Test: GPU toggle correctly routes to GPU HuggingFace Space
1443
+ - This is pure production correctness, no new features
1444
+
1445
+ **2c-Step-2: Add Spark as extra backend option**
1446
+ - Add Spark to `backend-router.ts` (gated by local mode)
1447
+ - Add Spark toggle to admin UI (local mode only)
1448
+ - Test: Spark toggle routes to `spark-c691.local:8000`
1449
+ - This is a local-only developer feature
1450
+
1451
+ ### 2c.11 Validation Criteria
1452
+
1453
+ **Step 1 (Production correctness):**
1454
+ - [ ] `lib/backend-fetch.ts` helper created (with `backendProxy` for proxy routes)
1455
+ - [ ] **All proxy routes** have `export const runtime = 'nodejs'`
1456
+ - [ ] **All API routes updated** to use per-user backend routing (no more hardcoded `BACKEND_URL`)
1457
+ - [ ] **Grep verification:** No hardcoded `model_id=codegen-350m` found in frontend code
1458
+ - [ ] **Backend verification:** Backend uses `DEFAULT_MODEL` when `model_id` is omitted (test with curl)
1459
+ - [ ] **Acceptance test:** Enable GPU toggle (with Remote), confirm requests go to GPU HuggingFace Space
1460
+ - [ ] `/models` endpoint exists on backend and returns available models
1461
+ - [ ] GPU HF Space configured with `DEFAULT_MODEL=devstral-small` and sufficient VRAM (L40S minimum)
1462
+ - [ ] **Acceptance test:** GPU-enabled user in production automatically uses Devstral (no model selector interaction needed)
1463
+ - [ ] **Acceptance test (model_id verification):** As a GPU-enabled user in production:
1464
+ 1. Call `/models/current` via your Vercel API route (or hit GPU HF Space directly with auth)
1465
+ 2. Expect: `id=devstral-small`, `device=cuda`, `dtype=bf16`
1466
+ 3. This proves no hidden `model_id=codegen-350m` is being sent and Devstral is active
1467
+
1468
+ **Step 2 (Spark local-only feature):**
1469
+ - [ ] `NEXT_PUBLIC_SPARK_BACKEND_URL` environment variable added
1470
+ - [ ] Backend router recognizes `backendOverride: 'spark'` (only in local mode)
1471
+ - [ ] Admin UI shows Spark toggle (only in local mode)
1472
+ - [ ] Spark toggle is mutually exclusive with Remote toggle
1473
+ - [ ] TierIndicator shows correct status for Spark connection
1474
+ - [ ] **Acceptance test (local mode):** Enable Spark toggle, confirm requests go to `spark-c691.local:8000`
1475
+ - [ ] **Acceptance test (local mode):** Switch between Local/Remote/Spark, confirm correct backend is used each time
1476
+ - [ ] **Acceptance test (production):** Spark toggle has no effect (falls through to HuggingFace)
1477
+
1478
+ ---
1479
+
1480
+ ## Phase 3: Deploy Devstral to DGX Spark
1481
+
1482
+ **Goal:** Run Devstral on DGX Spark with GPU acceleration (BF16).
1483
+
1484
+ ### 3.1 Update Spark Environment
1485
+
1486
+ ```bash
1487
+ # On Spark, update .env.spark
1488
+ DEFAULT_MODEL=devstral-small
1489
+ TORCH_DTYPE=bf16
1490
+ MAX_CONTEXT=8192
1491
+ BATCH_SIZE=1
1492
+ ```
1493
+
1494
+ ### 3.2 Rebuild and Deploy
1495
+
1496
+ ```bash
1497
+ cd /srv/projects/visualisable-ai-backend
1498
+ git pull
1499
+ docker compose -f docker/compose.spark.yml --env-file .env.spark up -d --build
1500
+ ```
1501
+
1502
+ ### 3.3 Monitor First Load
1503
+
1504
+ First load will download ~48GB model weights. Monitor with:
1505
+
1506
+ ```bash
1507
+ # Watch logs
1508
+ docker compose -f docker/compose.spark.yml logs -f
1509
+
1510
+ # Check health (should return fast even during download)
1511
+ watch -n 5 'curl -s http://spark-c691.local:8000/health'
1512
+
1513
+ # Check readiness (will fail until model loaded)
1514
+ watch -n 10 'curl -s http://spark-c691.local:8000/ready'
1515
+ ```
1516
+
1517
+ **Note:** First download can take a significant amount of time depending on network speed. Disk usage will spike in `/srv/models-cache/huggingface` during download (~48GB for Devstral weights). Ensure sufficient disk space is available before starting.
1518
+
1519
+ ### 3.4 Verify Model on GPU
1520
+
1521
+ Use the `/debug/device` endpoint (added in Phase 1.5) to verify the model is on GPU:
1522
+
1523
+ ```bash
1524
+ curl -s http://spark-c691.local:8000/debug/device | python -m json.tool
1525
+ ```
1526
+
1527
+ Expected response should show `model_device: "cuda:0"` (or similar CUDA device).
1528
+
1529
+ **Why not `python -c` exec?** Importing the module in a separate process creates a fresh manager instance with no model loaded—it won't reflect the state of the running Uvicorn process. An HTTP endpoint queries the actual running service.
1530
+
1531
+ ### 3.5 Validation Criteria
1532
+
1533
+ - [ ] `/health` returns 200 fast even during model download
1534
+ - [ ] `/ready` returns 200 after model is loaded
1535
+ - [ ] Devstral loads on GPU (verified via deterministic check, not just logs)
1536
+ - [ ] Memory usage is ~48GB VRAM (BF16)
1537
+ - [ ] Inference is fast (GPU-accelerated, <5s for small prompts)
1538
+ - [ ] Analysis endpoint works with Devstral
1539
+ - [ ] Frontend displays 40 layers correctly with proper stage labels
1540
+
1541
+ ---
1542
+
1543
+ ## Phase 4: Future Enhancements (Optional)
1544
+
1545
+ **Note:** Devstral on GPU HuggingFace Space is now a **required** part of Phase 2c (for GPU-enabled production users). This phase covers additional optional enhancements.
1546
+
1547
+ ### 4.1 Runtime Model Switching
1548
+
1549
+ **Current approach:** One-model-per-deployment. Each backend loads a single model on startup via `DEFAULT_MODEL` environment variable. This is simpler and keeps memory predictable.
1550
+
1551
+ **Future option:** Add `POST /models/load` endpoint for runtime model switching:
1552
+ ```python
1553
+ @app.post("/models/load")
1554
+ def load_model(model_id: str):
1555
+ """Load a different model at runtime."""
1556
+ # Unload current model
1557
+ # Load new model
1558
+ # Return new model info
1559
+ ```
1560
+
1561
+ **Trade-offs:**
1562
+ - Useful for research (switch models without redeploying)
1563
+ - Adds complexity: queueing, load state management, eviction, edge cases (requests arriving mid-load)
1564
+ - Memory management becomes more complex with multiple large models
1565
+
1566
+ **Recommendation:** Keep one-model-per-deployment for v1. Add runtime switching only if there's a clear need.
1567
+
1568
+ ### 4.2 Quantized Devstral Variant
1569
+
1570
+ **Not applicable for this project.** This is PhD research requiring full-precision BF16 for accurate attention pattern analysis. Quantization introduces numerical artifacts that would compromise research validity.
1571
+
1572
+ For reference, if quantization were acceptable:
1573
+ - 4-bit GPTQ or AWQ quantization reduces VRAM to ~12-16GB
1574
+ - Allows running on smaller GPU tiers (T4, L4)
1575
+ - Trade-off: quality loss makes this unsuitable for research purposes
1576
+
1577
+ ### 4.3 Additional Deployment Targets
1578
+
1579
+ Other optional deployment options:
1580
+ - **Third HF Space** for specific use cases (e.g., research-only access)
1581
+ - **Self-hosted Kubernetes** with auto-scaling
1582
+ - **Modal/RunPod** for burst capacity
1583
+
1584
+ ### 4.4 Entrypoint Consistency
1585
+
1586
+ The codebase has two service paths:
1587
+ - **Spark backend**: `backend.model_service:app` on port 8000
1588
+ - **HuggingFace wrapper**: `app:app` on port 7860
1589
+
1590
+ If adding new deployment targets, ensure they use consistent entrypoints and expose the same API surface.
1591
+
1592
+ ---
1593
+
1594
+ ## Rollback Procedures
1595
+
1596
+ If a deployment fails or causes issues, use these rollback procedures:
1597
+
1598
+ ### HuggingFace Space Rollback
1599
+
1600
+ **Option A: Revert via GitHub**
1601
+ 1. Revert the problematic commit on `main` branch
1602
+ 2. Push the revert - GitHub Actions will redeploy the previous version
1603
+ 3. In HF Space settings, change `DEFAULT_MODEL` back if needed
1604
+
1605
+ **Option B: Manual Space revert**
1606
+ 1. Go to HuggingFace Space > Files > History
1607
+ 2. Find the last known good commit
1608
+ 3. Click "Revert to this version"
1609
+ 4. Update environment variables if needed
1610
+
1611
+ **Option C: Change model without redeploying**
1612
+ 1. In HF Space settings, change `DEFAULT_MODEL=codegen-350m`
1613
+ 2. Restart the Space (Settings > Restart)
1614
+ 3. Space will reload with CodeGen instead of Devstral
1615
+
1616
+ ### DGX Spark Rollback
1617
+
1618
+ **Quick rollback (change model):**
1619
+ ```bash
1620
+ # On Spark host
1621
+ cd /srv/projects/visualisable-ai-backend
1622
+
1623
+ # Edit .env.spark to change DEFAULT_MODEL
1624
+ vim .env.spark
1625
+ # Change: DEFAULT_MODEL=codegen-350m
1626
+
1627
+ # Restart container
1628
+ docker compose -f docker/compose.spark.yml --env-file .env.spark up -d
1629
+ ```
1630
+
1631
+ **Full rollback (previous code version):**
1632
+ ```bash
1633
+ # On Spark host
1634
+ cd /srv/projects/visualisable-ai-backend
1635
+
1636
+ # Find the last known good commit
1637
+ git log --oneline -10
1638
+
1639
+ # Reset to that commit
1640
+ git checkout <commit-hash>
1641
+
1642
+ # Rebuild and restart
1643
+ docker compose -f docker/compose.spark.yml --env-file .env.spark up -d --build
1644
+ ```
1645
+
1646
+ **Rollback to previous Docker image (if tagged):**
1647
+ ```bash
1648
+ # If you tagged the previous working image
1649
+ docker compose -f docker/compose.spark.yml --env-file .env.spark down
1650
+ docker run -d --gpus all -p 8000:8000 --env-file .env.spark visualisable-ai-backend:last-known-good
1651
+ ```
1652
+
1653
+ ---
1654
+
1655
+ ## Monitoring
1656
+
1657
+ Lightweight monitoring approach for v1:
1658
+
1659
+ ### Health Checks
1660
+
1661
+ All backends expose `/health` (process alive) and `/ready` (model loaded):
1662
+
1663
+ ```bash
1664
+ # Quick status check
1665
+ curl -s http://spark-c691.local:8000/health | jq
1666
+ curl -s http://spark-c691.local:8000/ready | jq
1667
+ curl -s http://spark-c691.local:8000/debug/device | jq
1668
+ ```
1669
+
1670
+ ### Uptime Monitoring
1671
+
1672
+ For Spark (local network), use a simple cron job or uptime check:
1673
+
1674
+ ```bash
1675
+ # Add to crontab on a machine that can reach Spark
1676
+ */5 * * * * curl -sf http://spark-c691.local:8000/health > /dev/null || echo "Spark down" | mail -s "Alert: Spark unhealthy" you@example.com
1677
+ ```
1678
+
1679
+ For HuggingFace Spaces:
1680
+ - Use HuggingFace's built-in Space status monitoring
1681
+ - Or set up an external uptime monitor (UptimeRobot, Pingdom, etc.) to check the Space URL
1682
+
1683
+ ### Frontend Status Indicator
1684
+
1685
+ In the app, show backend connection status based on `/health` and `/ready`:
1686
+ - **Connected** (green): `/health` returns 200, `/ready` returns 200
1687
+ - **Loading** (yellow): `/health` returns 200, `/ready` returns 503
1688
+ - **Unreachable** (red): `/health` fails or times out
1689
+
1690
+ This gives users visibility into backend state without needing server-side monitoring.
1691
+
1692
+ ---
1693
+
1694
+ ## Summary: Files to Create/Modify
1695
+
1696
+ ### Phase 0 (Secure GPU HF Space + Verify Basic Routing)
1697
+ | File | Action |
1698
+ |------|--------|
1699
+ | `visualisable-ai/lib/backend-auth.server.ts` | CREATE (getBaseAuthHeaders, getHfAuthHeader, isHfSpace) |
1700
+ | `visualisable-ai/lib/backend-router.ts` | MODIFY (remove secrets from getBackendHeaders) |
1701
+ | **Vercel Environment** | ADD `HF_TOKEN` (server-side only) |
1702
+ | **HuggingFace GPU Space** | CONFIGURE (set to Private, configure sleep timeout) |
1703
+
1704
+ ### Phase 0.5 (Fix Critical API Routing + Prove GPU Routing)
1705
+ | File | Action |
1706
+ |------|--------|
1707
+ | `visualisable-ai/lib/backend-fetch.ts` | CREATE (per-user backend fetch helper) |
1708
+ | `visualisable-ai/app/api/research/attention/analyze/route.ts` | MODIFY (use backendFetch) |
1709
+
1710
+ ### Phase 1 (Infrastructure)
1711
+ | File | Action |
1712
+ |------|--------|
1713
+ | `Dockerfile` | CREATE |
1714
+ | `docker/compose.spark.yml` | CREATE |
1715
+ | `.env.spark.example` | CREATE |
1716
+ | `.gitignore` | MODIFY (add .env.spark, runs/) |
1717
+ | `backend/model_service.py` | MODIFY (ensure /health is fast, add /ready, add /debug/device) |
1718
+
1719
+ ### Phase 2 (Devstral Backend Support)
1720
+ | File | Action |
1721
+ |------|--------|
1722
+ | `backend/model_adapter.py` | MODIFY (add MistralAdapter) |
1723
+ | `backend/model_config.py` | MODIFY (add devstral-small) |
1724
+ | `backend/model_service.py` | MODIFY (fix layer classification, wire env vars, add /models and /models/current endpoints) |
1725
+
1726
+ ### Phase 2b (Frontend Dynamic Handling)
1727
+ | File | Action |
1728
+ |------|--------|
1729
+ | `components/research/VerticalPipeline.tsx` | MODIFY (dynamic layers, vocab) |
1730
+ | `components/research/SpreadsheetGrid.tsx` | MODIFY (dynamic head_dim, if applicable) |
1731
+
1732
+ ### Phase 2c (Frontend Routing + GPU HF Devstral)
1733
+ | File | Action |
1734
+ |------|--------|
1735
+ | `visualisable-ai/lib/backend-router.ts` | MODIFY (add Spark backend option) |
1736
+ | `visualisable-ai/app/admin/users/page.tsx` | MODIFY (add Spark toggle) |
1737
+ | `visualisable-ai/app/api/proxy/[...path]/route.ts` | MODIFY (use backendProxy + runtime='nodejs') |
1738
+ | `visualisable-ai/app/api/backend/[...path]/route.ts` | MODIFY (use backendProxy + runtime='nodejs') |
1739
+ | `visualisable-ai/app/api/demos/route.ts` | MODIFY (use backendFetch) |
1740
+ | `visualisable-ai/app/api/demos/run/route.ts` | MODIFY (use backendFetch) |
1741
+ | `visualisable-ai/app/api/vocabulary/*.ts` | MODIFY (use backendFetch) |
1742
+ | `visualisable-ai/app/api/token/metadata/route.ts` | MODIFY (use backendFetch) |
1743
+ | *(Note: `/api/research/attention/analyze` already updated in Phase 0.5)* | |
1744
+ | `visualisable-ai/.env.local` | MODIFY (add NEXT_PUBLIC_SPARK_BACKEND_URL, NEXT_PUBLIC_MODE) |
1745
+ | `visualisable-ai/.env.example` | MODIFY (document env vars, warn about NEXT_PUBLIC_MODE) |
1746
+ | `visualisable-ai/components/TierIndicator.tsx` | MODIFY (optional: add Spark indicator) |
1747
+ | **GPU HF Space** | CONFIGURE (DEFAULT_MODEL=devstral-small, upgrade to L40S/A100) |
1748
+
1749
+ ### Phase 3 (Spark Deployment)
1750
+ | File | Action |
1751
+ |------|--------|
1752
+ | `.env.spark` | MODIFY (change DEFAULT_MODEL to devstral-small, TORCH_DTYPE=bf16) |
1753
+
1754
+ ---
1755
+
1756
+ ## Quick Checklist
1757
+
1758
+ Before marking each phase complete, verify:
1759
+
1760
+ ### Phase 0 (Secure GPU HF Space)
1761
+ - [ ] GPU HF Space set to Private
1762
+ - [ ] `HF_TOKEN` added to Vercel (server-side only, no `NEXT_PUBLIC_`)
1763
+ - [ ] `lib/backend-auth.server.ts` created with `getBaseAuthHeaders()`, `getHfAuthHeader()`, `isHfSpace()`
1764
+ - [ ] `getBackendHeaders()` in backend-router.ts cleaned up (no secrets)
1765
+ - [ ] Sleep timeout configured (5 minutes)
1766
+ - [ ] Direct unauthenticated request to GPU Space returns 401
1767
+
1768
+ ### Phase 0.5 (Fix Critical Routing)
1769
+ - [ ] `lib/backend-fetch.ts` created with `backendFetch()` (minimal helper)
1770
+ - [ ] At least one critical endpoint uses `backendFetch`
1771
+ - [ ] GPU-enabled user's analyze request reaches GPU HF Space (verified)
1772
+ - [ ] Free tier user's analyze request still goes to CPU HF Space
1773
+ - [ ] (Note: `backendProxy()` added later in Phase 2c for proxy routes)
1774
+
1775
+ ### Phase 1
1776
+ - [ ] `/health` returns fast (< 100ms) even while model is loading
1777
+ - [ ] `/ready` endpoint exists and returns model load status
1778
+ - [ ] `.env.spark` is gitignored
1779
+ - [ ] Multi-branch guidance documented (ports + compose -p)
1780
+
1781
+ ### Phase 2
1782
+ - [ ] MistralAdapter handles layer access correctly
1783
+ - [ ] Layer classification uses percentages, not hardcoded indices
1784
+ - [ ] Env vars (TORCH_DTYPE, MAX_CONTEXT, BATCH_SIZE) are wired into loader
1785
+ - [ ] `requires_gpu: True` for Devstral to guide users to Spark
1786
+ - [ ] `/models` endpoint returns list of available models
1787
+ - [ ] `/models/current` endpoint returns currently loaded model info
1788
+
1789
+ ### Phase 2b
1790
+ - [ ] Frontend stage boundaries are percentage-based
1791
+ - [ ] Vocab size is dynamic, not hardcoded 51,200
1792
+ - [ ] head_dim calculated from hidden_size/num_heads (if used)
1793
+
1794
+ ### Phase 2c
1795
+ - [ ] **All API routes use per-user routing** (no hardcoded BACKEND_URL)
1796
+ - [ ] **All proxy routes** have `export const runtime = 'nodejs'`
1797
+ - [ ] GPU toggle correctly routes to GPU HuggingFace Space
1798
+ - [ ] GPU HF Space has `DEFAULT_MODEL=devstral-small` and sufficient VRAM
1799
+ - [ ] `/models/current` endpoint exists and returns current model info
1800
+ - [ ] **GPU Devstral proof:** `/models/current` returns `id=devstral-small, device=cuda, dtype=bf16`
1801
+ - [ ] GPU-enabled users automatically get Devstral in production
1802
+ - [ ] Spark backend URL configurable via environment variable (local mode only)
1803
+ - [ ] `NEXT_PUBLIC_MODE` only defined in `.env.local`, never in Vercel
1804
+ - [ ] Admin UI has Spark toggle (mutually exclusive with Remote, local mode only)
1805
+ - [ ] Model selector shows available models based on connected backend
1806
+
1807
+ ### Phase 3
1808
+ - [ ] TORCH_DTYPE=bf16 in .env.spark
1809
+ - [ ] Model loads on GPU (check logs)
1810
+ - [ ] Inference is GPU-accelerated (fast)
1811
+ - [ ] Frontend renders 40 layers correctly
1812
+
1813
+ ---
1814
+
1815
+ ## Current Status
1816
+
1817
+ - [ ] **Phase 0**: Secure GPU HF Space + verify basic routing
1818
+ - [ ] **Phase 0.5**: Fix critical API route routing (prove GPU routing works)
1819
+ - [ ] **Phase 1**: Deploy CodeGen to DGX Spark
1820
+ - [ ] **Phase 2**: Add Devstral backend support
1821
+ - [ ] **Phase 2b**: Frontend dynamic layer handling
1822
+ - [ ] **Phase 2c**: Wire Spark into frontend backend router + Deploy Devstral to GPU HF Space
1823
+ - [ ] **Phase 3**: Deploy Devstral to DGX Spark
1824
+ - [ ] **Phase 4**: Future enhancements (optional)