Spaces:
Sleeping
Sleeping
File size: 6,841 Bytes
ab4534a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# Adding Devstral Model Support + DGX Spark Deployment (Full Plan)
> This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach.
## Overview
Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.
## Devstral Model Specifications
| Parameter | Devstral | CodeGen (current) | Code Llama |
|-----------|----------|-------------------|------------|
| Parameters | 24B | 350M | 7B |
| Layers | 40 | 20 | 32 |
| Attention Heads | 32 | 16 | 32 |
| KV Heads (GQA) | 8 | N/A (MHA) | 32 |
| Hidden Size | 5120 | 1024 | 4096 |
| Vocab Size | 131,072 | 51,200 | 32,000 |
| Context Length | 128K | 2K | 16K |
| Min VRAM (BF16)* | ~48GB | 2GB | 14GB |
| Architecture | `mistral` | `gpt_neox` | `llama` |
*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.
## Deployment Environment Summary
| Environment | Backend Location | Frontend | Use Case |
|-------------|-----------------|----------|----------|
| Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) |
| DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models |
| Production | HuggingFace Spaces | Vercel | Public access (CodeGen) |
---
## Backend Model Support
### Add Devstral to Model Registry
**File:** `backend/model_config.py`
```python
"devstral-small": {
"hf_path": "mistralai/Devstral-Small-2507",
"display_name": "Devstral Small 24B",
"architecture": "mistral",
"size": "24B",
"num_layers": 40,
"num_heads": 32,
"num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
"vocab_size": 131072,
"context_length": 131072,
"attention_type": "grouped_query",
"requires_gpu": True,
"min_vram_gb": 48.0,
"min_ram_gb": 96.0
}
```
### Create MistralAdapter
**File:** `backend/model_adapter.py`
```python
class MistralAdapter(ModelAdapter):
"""Adapter for Mistral-based models (Devstral, Mistral, etc.)"""
def _get_layers(self):
"""Defensive access: Mistral layers may be nested differently"""
if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
return self.model.model.layers # MistralForCausalLM wrapper
elif hasattr(self.model, 'layers'):
return self.model.layers # Direct model access
raise AttributeError("Cannot find transformer layers in Mistral model")
def get_num_layers(self) -> int:
return self.model.config.num_hidden_layers
def get_num_heads(self) -> int:
return self.model.config.num_attention_heads
def get_num_kv_heads(self) -> Optional[int]:
return getattr(self.model.config, 'num_key_value_heads', None)
def get_layer_module(self, layer_idx: int):
return self._get_layers()[layer_idx]
def get_attention_module(self, layer_idx: int):
return self._get_layers()[layer_idx].self_attn
def get_mlp_module(self, layer_idx: int):
return self._get_layers()[layer_idx].mlp
def get_qkv_projections(self, layer_idx: int):
attn = self.get_attention_module(layer_idx)
return attn.q_proj, attn.k_proj, attn.v_proj
```
### Fix Hardcoded Layer Classification
**File:** `backend/model_service.py` (lines ~1505-1514)
```python
# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
layer_fraction = (layer_idx + 1) / n_layers
if layer_idx == 0:
layer_pattern = {"type": "positional", ...}
elif layer_fraction <= 0.25:
layer_pattern = {"type": "previous_token", ...}
elif layer_fraction <= 0.75:
layer_pattern = {"type": "induction", ...}
else:
layer_pattern = {"type": "semantic", ...}
```
---
## Frontend Dynamic Layer Handling
### Fix Hardcoded Layer Boundaries
**File:** `components/research/VerticalPipeline.tsx`
```typescript
const getStageInfo = (layerIdx: number, totalLayers: number) => {
if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
const fraction = layerIdx / totalLayers;
if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
return { color: 'purple', label: 'LATE' };
};
```
---
## DGX Spark Docker Deployment
### Dockerfile
```dockerfile
# Bump with care, retest CUDA + torch compatibility
FROM nvcr.io/nvidia/pytorch:24.01-py3
WORKDIR /app
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Docker Compose for Spark
**File:** `docker/compose.spark.yml`
```yaml
services:
visualisable-ai-backend:
build:
context: ..
dockerfile: Dockerfile
ports:
- "${PORT:-8000}:8000"
shm_size: "8gb"
volumes:
- ..:/app
- /srv/models:/srv/models:ro
- /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
- ../runs:/app/runs
environment:
- HF_HOME=/srv/models-cache/huggingface
- TRANSFORMERS_CACHE=/srv/models-cache/huggingface
- DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
- API_KEY=${API_KEY}
- HF_TOKEN=${HF_TOKEN}
- HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
- MAX_CONTEXT=${MAX_CONTEXT:-8192}
- BATCH_SIZE=${BATCH_SIZE:-1}
- TORCH_DTYPE=${TORCH_DTYPE:-bf16}
gpus: all
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 3s
retries: 5
restart: unless-stopped
```
**Notes:**
- `/health` MUST return immediately (process up), not wait for model load
- Add `/ready` endpoint for model readiness
- Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...`
---
## Files to Modify/Create
| File | Action | Description |
|------|--------|-------------|
| `backend/model_config.py` | MODIFY | Add Devstral entry |
| `backend/model_adapter.py` | MODIFY | Add MistralAdapter |
| `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds |
| `Dockerfile` | CREATE | Docker image |
| `docker/compose.spark.yml` | CREATE | Spark compose config |
| `.env.spark.example` | CREATE | Environment template |
| `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries |
---
## Hardware Requirements Summary
| Model | Deployment | Hardware |
|-------|------------|----------|
| CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU |
| Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM |
| Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |
|