# Adding Devstral Model Support + DGX Spark Deployment (Full Plan) > This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach. ## Overview Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark. ## Devstral Model Specifications | Parameter | Devstral | CodeGen (current) | Code Llama | |-----------|----------|-------------------|------------| | Parameters | 24B | 350M | 7B | | Layers | 40 | 20 | 32 | | Attention Heads | 32 | 16 | 32 | | KV Heads (GQA) | 8 | N/A (MHA) | 32 | | Hidden Size | 5120 | 1024 | 4096 | | Vocab Size | 131,072 | 51,200 | 32,000 | | Context Length | 128K | 2K | 16K | | Min VRAM (BF16)* | ~48GB | 2GB | 14GB | | Architecture | `mistral` | `gpt_neox` | `llama` | *VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation. ## Deployment Environment Summary | Environment | Backend Location | Frontend | Use Case | |-------------|-----------------|----------|----------| | Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) | | DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models | | Production | HuggingFace Spaces | Vercel | Public access (CodeGen) | --- ## Backend Model Support ### Add Devstral to Model Registry **File:** `backend/model_config.py` ```python "devstral-small": { "hf_path": "mistralai/Devstral-Small-2507", "display_name": "Devstral Small 24B", "architecture": "mistral", "size": "24B", "num_layers": 40, "num_heads": 32, "num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio) "vocab_size": 131072, "context_length": 131072, "attention_type": "grouped_query", "requires_gpu": True, "min_vram_gb": 48.0, "min_ram_gb": 96.0 } ``` ### Create MistralAdapter **File:** `backend/model_adapter.py` ```python class MistralAdapter(ModelAdapter): """Adapter for Mistral-based models (Devstral, Mistral, etc.)""" def _get_layers(self): """Defensive access: Mistral layers may be nested differently""" if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'): return self.model.model.layers # MistralForCausalLM wrapper elif hasattr(self.model, 'layers'): return self.model.layers # Direct model access raise AttributeError("Cannot find transformer layers in Mistral model") def get_num_layers(self) -> int: return self.model.config.num_hidden_layers def get_num_heads(self) -> int: return self.model.config.num_attention_heads def get_num_kv_heads(self) -> Optional[int]: return getattr(self.model.config, 'num_key_value_heads', None) def get_layer_module(self, layer_idx: int): return self._get_layers()[layer_idx] def get_attention_module(self, layer_idx: int): return self._get_layers()[layer_idx].self_attn def get_mlp_module(self, layer_idx: int): return self._get_layers()[layer_idx].mlp def get_qkv_projections(self, layer_idx: int): attn = self.get_attention_module(layer_idx) return attn.q_proj, attn.k_proj, attn.v_proj ``` ### Fix Hardcoded Layer Classification **File:** `backend/model_service.py` (lines ~1505-1514) ```python # Fixed (percentage-based, 1-indexed fraction for transformer blocks): layer_fraction = (layer_idx + 1) / n_layers if layer_idx == 0: layer_pattern = {"type": "positional", ...} elif layer_fraction <= 0.25: layer_pattern = {"type": "previous_token", ...} elif layer_fraction <= 0.75: layer_pattern = {"type": "induction", ...} else: layer_pattern = {"type": "semantic", ...} ``` --- ## Frontend Dynamic Layer Handling ### Fix Hardcoded Layer Boundaries **File:** `components/research/VerticalPipeline.tsx` ```typescript const getStageInfo = (layerIdx: number, totalLayers: number) => { if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' }; const fraction = layerIdx / totalLayers; if (fraction <= 0.25) return { color: 'green', label: 'EARLY' }; if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' }; return { color: 'purple', label: 'LATE' }; }; ``` --- ## DGX Spark Docker Deployment ### Dockerfile ```dockerfile # Bump with care, retest CUDA + torch compatibility FROM nvcr.io/nvidia/pytorch:24.01-py3 WORKDIR /app RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/* COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8000 CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"] ``` ### Docker Compose for Spark **File:** `docker/compose.spark.yml` ```yaml services: visualisable-ai-backend: build: context: .. dockerfile: Dockerfile ports: - "${PORT:-8000}:8000" shm_size: "8gb" volumes: - ..:/app - /srv/models:/srv/models:ro - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw - ../runs:/app/runs environment: - HF_HOME=/srv/models-cache/huggingface - TRANSFORMERS_CACHE=/srv/models-cache/huggingface - DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small} - API_KEY=${API_KEY} - HF_TOKEN=${HF_TOKEN} - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN} - MAX_CONTEXT=${MAX_CONTEXT:-8192} - BATCH_SIZE=${BATCH_SIZE:-1} - TORCH_DTYPE=${TORCH_DTYPE:-bf16} gpus: all healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 3s retries: 5 restart: unless-stopped ``` **Notes:** - `/health` MUST return immediately (process up), not wait for model load - Add `/ready` endpoint for model readiness - Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...` --- ## Files to Modify/Create | File | Action | Description | |------|--------|-------------| | `backend/model_config.py` | MODIFY | Add Devstral entry | | `backend/model_adapter.py` | MODIFY | Add MistralAdapter | | `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds | | `Dockerfile` | CREATE | Docker image | | `docker/compose.spark.yml` | CREATE | Spark compose config | | `.env.spark.example` | CREATE | Environment template | | `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries | --- ## Hardware Requirements Summary | Model | Deployment | Hardware | |-------|------------|----------| | CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU | | Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM | | Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |