api / docs /devstral-spark-plan-full.md
gary-boon
Add Devstral + DGX Spark implementation plan
ab4534a
# Adding Devstral Model Support + DGX Spark Deployment (Full Plan)
> This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach.
## Overview
Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.
## Devstral Model Specifications
| Parameter | Devstral | CodeGen (current) | Code Llama |
|-----------|----------|-------------------|------------|
| Parameters | 24B | 350M | 7B |
| Layers | 40 | 20 | 32 |
| Attention Heads | 32 | 16 | 32 |
| KV Heads (GQA) | 8 | N/A (MHA) | 32 |
| Hidden Size | 5120 | 1024 | 4096 |
| Vocab Size | 131,072 | 51,200 | 32,000 |
| Context Length | 128K | 2K | 16K |
| Min VRAM (BF16)* | ~48GB | 2GB | 14GB |
| Architecture | `mistral` | `gpt_neox` | `llama` |
*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.
## Deployment Environment Summary
| Environment | Backend Location | Frontend | Use Case |
|-------------|-----------------|----------|----------|
| Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) |
| DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models |
| Production | HuggingFace Spaces | Vercel | Public access (CodeGen) |
---
## Backend Model Support
### Add Devstral to Model Registry
**File:** `backend/model_config.py`
```python
"devstral-small": {
"hf_path": "mistralai/Devstral-Small-2507",
"display_name": "Devstral Small 24B",
"architecture": "mistral",
"size": "24B",
"num_layers": 40,
"num_heads": 32,
"num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
"vocab_size": 131072,
"context_length": 131072,
"attention_type": "grouped_query",
"requires_gpu": True,
"min_vram_gb": 48.0,
"min_ram_gb": 96.0
}
```
### Create MistralAdapter
**File:** `backend/model_adapter.py`
```python
class MistralAdapter(ModelAdapter):
"""Adapter for Mistral-based models (Devstral, Mistral, etc.)"""
def _get_layers(self):
"""Defensive access: Mistral layers may be nested differently"""
if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
return self.model.model.layers # MistralForCausalLM wrapper
elif hasattr(self.model, 'layers'):
return self.model.layers # Direct model access
raise AttributeError("Cannot find transformer layers in Mistral model")
def get_num_layers(self) -> int:
return self.model.config.num_hidden_layers
def get_num_heads(self) -> int:
return self.model.config.num_attention_heads
def get_num_kv_heads(self) -> Optional[int]:
return getattr(self.model.config, 'num_key_value_heads', None)
def get_layer_module(self, layer_idx: int):
return self._get_layers()[layer_idx]
def get_attention_module(self, layer_idx: int):
return self._get_layers()[layer_idx].self_attn
def get_mlp_module(self, layer_idx: int):
return self._get_layers()[layer_idx].mlp
def get_qkv_projections(self, layer_idx: int):
attn = self.get_attention_module(layer_idx)
return attn.q_proj, attn.k_proj, attn.v_proj
```
### Fix Hardcoded Layer Classification
**File:** `backend/model_service.py` (lines ~1505-1514)
```python
# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
layer_fraction = (layer_idx + 1) / n_layers
if layer_idx == 0:
layer_pattern = {"type": "positional", ...}
elif layer_fraction <= 0.25:
layer_pattern = {"type": "previous_token", ...}
elif layer_fraction <= 0.75:
layer_pattern = {"type": "induction", ...}
else:
layer_pattern = {"type": "semantic", ...}
```
---
## Frontend Dynamic Layer Handling
### Fix Hardcoded Layer Boundaries
**File:** `components/research/VerticalPipeline.tsx`
```typescript
const getStageInfo = (layerIdx: number, totalLayers: number) => {
if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
const fraction = layerIdx / totalLayers;
if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
return { color: 'purple', label: 'LATE' };
};
```
---
## DGX Spark Docker Deployment
### Dockerfile
```dockerfile
# Bump with care, retest CUDA + torch compatibility
FROM nvcr.io/nvidia/pytorch:24.01-py3
WORKDIR /app
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Docker Compose for Spark
**File:** `docker/compose.spark.yml`
```yaml
services:
visualisable-ai-backend:
build:
context: ..
dockerfile: Dockerfile
ports:
- "${PORT:-8000}:8000"
shm_size: "8gb"
volumes:
- ..:/app
- /srv/models:/srv/models:ro
- /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
- ../runs:/app/runs
environment:
- HF_HOME=/srv/models-cache/huggingface
- TRANSFORMERS_CACHE=/srv/models-cache/huggingface
- DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
- API_KEY=${API_KEY}
- HF_TOKEN=${HF_TOKEN}
- HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
- MAX_CONTEXT=${MAX_CONTEXT:-8192}
- BATCH_SIZE=${BATCH_SIZE:-1}
- TORCH_DTYPE=${TORCH_DTYPE:-bf16}
gpus: all
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 3s
retries: 5
restart: unless-stopped
```
**Notes:**
- `/health` MUST return immediately (process up), not wait for model load
- Add `/ready` endpoint for model readiness
- Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...`
---
## Files to Modify/Create
| File | Action | Description |
|------|--------|-------------|
| `backend/model_config.py` | MODIFY | Add Devstral entry |
| `backend/model_adapter.py` | MODIFY | Add MistralAdapter |
| `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds |
| `Dockerfile` | CREATE | Docker image |
| `docker/compose.spark.yml` | CREATE | Spark compose config |
| `.env.spark.example` | CREATE | Environment template |
| `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries |
---
## Hardware Requirements Summary
| Model | Deployment | Hardware |
|-------|------------|----------|
| CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU |
| Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM |
| Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |