Spaces:
Sleeping
Sleeping
Adding Devstral Model Support + DGX Spark Deployment (Full Plan)
This is the complete reference plan. See
devstral-spark-plan-phased.mdfor the incremental implementation approach.
Overview
Add support for mistralai/Devstral-Small-2507 (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.
Devstral Model Specifications
| Parameter | Devstral | CodeGen (current) | Code Llama |
|---|---|---|---|
| Parameters | 24B | 350M | 7B |
| Layers | 40 | 20 | 32 |
| Attention Heads | 32 | 16 | 32 |
| KV Heads (GQA) | 8 | N/A (MHA) | 32 |
| Hidden Size | 5120 | 1024 | 4096 |
| Vocab Size | 131,072 | 51,200 | 32,000 |
| Context Length | 128K | 2K | 16K |
| Min VRAM (BF16)* | ~48GB | 2GB | 14GB |
| Architecture | mistral |
gpt_neox |
llama |
*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.
Deployment Environment Summary
| Environment | Backend Location | Frontend | Use Case |
|---|---|---|---|
| Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) |
| DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models |
| Production | HuggingFace Spaces | Vercel | Public access (CodeGen) |
Backend Model Support
Add Devstral to Model Registry
File: backend/model_config.py
"devstral-small": {
"hf_path": "mistralai/Devstral-Small-2507",
"display_name": "Devstral Small 24B",
"architecture": "mistral",
"size": "24B",
"num_layers": 40,
"num_heads": 32,
"num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
"vocab_size": 131072,
"context_length": 131072,
"attention_type": "grouped_query",
"requires_gpu": True,
"min_vram_gb": 48.0,
"min_ram_gb": 96.0
}
Create MistralAdapter
File: backend/model_adapter.py
class MistralAdapter(ModelAdapter):
"""Adapter for Mistral-based models (Devstral, Mistral, etc.)"""
def _get_layers(self):
"""Defensive access: Mistral layers may be nested differently"""
if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
return self.model.model.layers # MistralForCausalLM wrapper
elif hasattr(self.model, 'layers'):
return self.model.layers # Direct model access
raise AttributeError("Cannot find transformer layers in Mistral model")
def get_num_layers(self) -> int:
return self.model.config.num_hidden_layers
def get_num_heads(self) -> int:
return self.model.config.num_attention_heads
def get_num_kv_heads(self) -> Optional[int]:
return getattr(self.model.config, 'num_key_value_heads', None)
def get_layer_module(self, layer_idx: int):
return self._get_layers()[layer_idx]
def get_attention_module(self, layer_idx: int):
return self._get_layers()[layer_idx].self_attn
def get_mlp_module(self, layer_idx: int):
return self._get_layers()[layer_idx].mlp
def get_qkv_projections(self, layer_idx: int):
attn = self.get_attention_module(layer_idx)
return attn.q_proj, attn.k_proj, attn.v_proj
Fix Hardcoded Layer Classification
File: backend/model_service.py (lines ~1505-1514)
# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
layer_fraction = (layer_idx + 1) / n_layers
if layer_idx == 0:
layer_pattern = {"type": "positional", ...}
elif layer_fraction <= 0.25:
layer_pattern = {"type": "previous_token", ...}
elif layer_fraction <= 0.75:
layer_pattern = {"type": "induction", ...}
else:
layer_pattern = {"type": "semantic", ...}
Frontend Dynamic Layer Handling
Fix Hardcoded Layer Boundaries
File: components/research/VerticalPipeline.tsx
const getStageInfo = (layerIdx: number, totalLayers: number) => {
if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
const fraction = layerIdx / totalLayers;
if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
return { color: 'purple', label: 'LATE' };
};
DGX Spark Docker Deployment
Dockerfile
# Bump with care, retest CUDA + torch compatibility
FROM nvcr.io/nvidia/pytorch:24.01-py3
WORKDIR /app
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
Docker Compose for Spark
File: docker/compose.spark.yml
services:
visualisable-ai-backend:
build:
context: ..
dockerfile: Dockerfile
ports:
- "${PORT:-8000}:8000"
shm_size: "8gb"
volumes:
- ..:/app
- /srv/models:/srv/models:ro
- /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
- ../runs:/app/runs
environment:
- HF_HOME=/srv/models-cache/huggingface
- TRANSFORMERS_CACHE=/srv/models-cache/huggingface
- DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
- API_KEY=${API_KEY}
- HF_TOKEN=${HF_TOKEN}
- HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
- MAX_CONTEXT=${MAX_CONTEXT:-8192}
- BATCH_SIZE=${BATCH_SIZE:-1}
- TORCH_DTYPE=${TORCH_DTYPE:-bf16}
gpus: all
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 3s
retries: 5
restart: unless-stopped
Notes:
/healthMUST return immediately (process up), not wait for model load- Add
/readyendpoint for model readiness - Multiple branches: use
PORT=8001 docker compose -p visai-branch-a ...
Files to Modify/Create
| File | Action | Description |
|---|---|---|
backend/model_config.py |
MODIFY | Add Devstral entry |
backend/model_adapter.py |
MODIFY | Add MistralAdapter |
backend/model_service.py |
MODIFY | Fix hardcoded layer thresholds |
Dockerfile |
CREATE | Docker image |
docker/compose.spark.yml |
CREATE | Spark compose config |
.env.spark.example |
CREATE | Environment template |
components/research/VerticalPipeline.tsx |
MODIFY | Dynamic layer boundaries |
Hardware Requirements Summary
| Model | Deployment | Hardware |
|---|---|---|
| CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU |
| Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM |
| Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |