Spaces:
Runtime error
Runtime error
| # Adding Devstral Model Support + DGX Spark Deployment (Full Plan) | |
| > This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach. | |
| ## Overview | |
| Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark. | |
| ## Devstral Model Specifications | |
| | Parameter | Devstral | CodeGen (current) | Code Llama | | |
| |-----------|----------|-------------------|------------| | |
| | Parameters | 24B | 350M | 7B | | |
| | Layers | 40 | 20 | 32 | | |
| | Attention Heads | 32 | 16 | 32 | | |
| | KV Heads (GQA) | 8 | N/A (MHA) | 32 | | |
| | Hidden Size | 5120 | 1024 | 4096 | | |
| | Vocab Size | 131,072 | 51,200 | 32,000 | | |
| | Context Length | 128K | 2K | 16K | | |
| | Min VRAM (BF16)* | ~48GB | 2GB | 14GB | | |
| | Architecture | `mistral` | `gpt_neox` | `llama` | | |
| *VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation. | |
| ## Deployment Environment Summary | |
| | Environment | Backend Location | Frontend | Use Case | | |
| |-------------|-----------------|----------|----------| | |
| | Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) | | |
| | DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models | | |
| | Production | HuggingFace Spaces | Vercel | Public access (CodeGen) | | |
| --- | |
| ## Backend Model Support | |
| ### Add Devstral to Model Registry | |
| **File:** `backend/model_config.py` | |
| ```python | |
| "devstral-small": { | |
| "hf_path": "mistralai/Devstral-Small-2507", | |
| "display_name": "Devstral Small 24B", | |
| "architecture": "mistral", | |
| "size": "24B", | |
| "num_layers": 40, | |
| "num_heads": 32, | |
| "num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio) | |
| "vocab_size": 131072, | |
| "context_length": 131072, | |
| "attention_type": "grouped_query", | |
| "requires_gpu": True, | |
| "min_vram_gb": 48.0, | |
| "min_ram_gb": 96.0 | |
| } | |
| ``` | |
| ### Create MistralAdapter | |
| **File:** `backend/model_adapter.py` | |
| ```python | |
| class MistralAdapter(ModelAdapter): | |
| """Adapter for Mistral-based models (Devstral, Mistral, etc.)""" | |
| def _get_layers(self): | |
| """Defensive access: Mistral layers may be nested differently""" | |
| if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'): | |
| return self.model.model.layers # MistralForCausalLM wrapper | |
| elif hasattr(self.model, 'layers'): | |
| return self.model.layers # Direct model access | |
| raise AttributeError("Cannot find transformer layers in Mistral model") | |
| def get_num_layers(self) -> int: | |
| return self.model.config.num_hidden_layers | |
| def get_num_heads(self) -> int: | |
| return self.model.config.num_attention_heads | |
| def get_num_kv_heads(self) -> Optional[int]: | |
| return getattr(self.model.config, 'num_key_value_heads', None) | |
| def get_layer_module(self, layer_idx: int): | |
| return self._get_layers()[layer_idx] | |
| def get_attention_module(self, layer_idx: int): | |
| return self._get_layers()[layer_idx].self_attn | |
| def get_mlp_module(self, layer_idx: int): | |
| return self._get_layers()[layer_idx].mlp | |
| def get_qkv_projections(self, layer_idx: int): | |
| attn = self.get_attention_module(layer_idx) | |
| return attn.q_proj, attn.k_proj, attn.v_proj | |
| ``` | |
| ### Fix Hardcoded Layer Classification | |
| **File:** `backend/model_service.py` (lines ~1505-1514) | |
| ```python | |
| # Fixed (percentage-based, 1-indexed fraction for transformer blocks): | |
| layer_fraction = (layer_idx + 1) / n_layers | |
| if layer_idx == 0: | |
| layer_pattern = {"type": "positional", ...} | |
| elif layer_fraction <= 0.25: | |
| layer_pattern = {"type": "previous_token", ...} | |
| elif layer_fraction <= 0.75: | |
| layer_pattern = {"type": "induction", ...} | |
| else: | |
| layer_pattern = {"type": "semantic", ...} | |
| ``` | |
| --- | |
| ## Frontend Dynamic Layer Handling | |
| ### Fix Hardcoded Layer Boundaries | |
| **File:** `components/research/VerticalPipeline.tsx` | |
| ```typescript | |
| const getStageInfo = (layerIdx: number, totalLayers: number) => { | |
| if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' }; | |
| const fraction = layerIdx / totalLayers; | |
| if (fraction <= 0.25) return { color: 'green', label: 'EARLY' }; | |
| if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' }; | |
| return { color: 'purple', label: 'LATE' }; | |
| }; | |
| ``` | |
| --- | |
| ## DGX Spark Docker Deployment | |
| ### Dockerfile | |
| ```dockerfile | |
| # Bump with care, retest CUDA + torch compatibility | |
| FROM nvcr.io/nvidia/pytorch:24.01-py3 | |
| WORKDIR /app | |
| RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/* | |
| COPY requirements.txt . | |
| RUN pip install --no-cache-dir -r requirements.txt | |
| COPY . . | |
| EXPOSE 8000 | |
| CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"] | |
| ``` | |
| ### Docker Compose for Spark | |
| **File:** `docker/compose.spark.yml` | |
| ```yaml | |
| services: | |
| visualisable-ai-backend: | |
| build: | |
| context: .. | |
| dockerfile: Dockerfile | |
| ports: | |
| - "${PORT:-8000}:8000" | |
| shm_size: "8gb" | |
| volumes: | |
| - ..:/app | |
| - /srv/models:/srv/models:ro | |
| - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw | |
| - ../runs:/app/runs | |
| environment: | |
| - HF_HOME=/srv/models-cache/huggingface | |
| - TRANSFORMERS_CACHE=/srv/models-cache/huggingface | |
| - DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small} | |
| - API_KEY=${API_KEY} | |
| - HF_TOKEN=${HF_TOKEN} | |
| - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN} | |
| - MAX_CONTEXT=${MAX_CONTEXT:-8192} | |
| - BATCH_SIZE=${BATCH_SIZE:-1} | |
| - TORCH_DTYPE=${TORCH_DTYPE:-bf16} | |
| gpus: all | |
| healthcheck: | |
| test: ["CMD", "curl", "-f", "http://localhost:8000/health"] | |
| interval: 30s | |
| timeout: 3s | |
| retries: 5 | |
| restart: unless-stopped | |
| ``` | |
| **Notes:** | |
| - `/health` MUST return immediately (process up), not wait for model load | |
| - Add `/ready` endpoint for model readiness | |
| - Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...` | |
| --- | |
| ## Files to Modify/Create | |
| | File | Action | Description | | |
| |------|--------|-------------| | |
| | `backend/model_config.py` | MODIFY | Add Devstral entry | | |
| | `backend/model_adapter.py` | MODIFY | Add MistralAdapter | | |
| | `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds | | |
| | `Dockerfile` | CREATE | Docker image | | |
| | `docker/compose.spark.yml` | CREATE | Spark compose config | | |
| | `.env.spark.example` | CREATE | Environment template | | |
| | `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries | | |
| --- | |
| ## Hardware Requirements Summary | |
| | Model | Deployment | Hardware | | |
| |-------|------------|----------| | |
| | CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU | | |
| | Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM | | |
| | Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM | | |