Spaces:

visualisable-ai
/

api

Runtime error

App Files Files Community

api / docs /devstral-spark-plan-full.md

gary-boon

Add Devstral + DGX Spark implementation plan

ab4534a 1 day ago

preview code

raw

history blame contribute delete

6.84 kB

	# Adding Devstral Model Support + DGX Spark Deployment (Full Plan)

	> This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach.

	## Overview

	Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.

	## Devstral Model Specifications

	\| Parameter \| Devstral \| CodeGen (current) \| Code Llama \|
	\|-----------\|----------\|-------------------\|------------\|
	\| Parameters \| 24B \| 350M \| 7B \|
	\| Layers \| 40 \| 20 \| 32 \|
	\| Attention Heads \| 32 \| 16 \| 32 \|
	\| KV Heads (GQA) \| 8 \| N/A (MHA) \| 32 \|
	\| Hidden Size \| 5120 \| 1024 \| 4096 \|
	\| Vocab Size \| 131,072 \| 51,200 \| 32,000 \|
	\| Context Length \| 128K \| 2K \| 16K \|
	\| Min VRAM (BF16)* \| ~48GB \| 2GB \| 14GB \|
	\| Architecture \| `mistral` \| `gpt_neox` \| `llama` \|

	*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.

	## Deployment Environment Summary

	\| Environment \| Backend Location \| Frontend \| Use Case \|
	\|-------------\|-----------------\|----------\|----------\|
	\| Local Dev (Mac) \| localhost:8000 \| localhost:3000 \| CodeGen only (350M) \|
	\| DGX Spark \| spark-c691.local:8000 \| localhost:3000 or Vercel \| Devstral/larger models \|
	\| Production \| HuggingFace Spaces \| Vercel \| Public access (CodeGen) \|

	---

	## Backend Model Support

	### Add Devstral to Model Registry

	File: `backend/model_config.py`

	```python
	"devstral-small": {
	"hf_path": "mistralai/Devstral-Small-2507",
	"display_name": "Devstral Small 24B",
	"architecture": "mistral",
	"size": "24B",
	"num_layers": 40,
	"num_heads": 32,
	"num_kv_heads": 8, # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
	"vocab_size": 131072,
	"context_length": 131072,
	"attention_type": "grouped_query",
	"requires_gpu": True,
	"min_vram_gb": 48.0,
	"min_ram_gb": 96.0
	}
	```

	### Create MistralAdapter

	File: `backend/model_adapter.py`

	```python
	class MistralAdapter(ModelAdapter):
	"""Adapter for Mistral-based models (Devstral, Mistral, etc.)"""

	def _get_layers(self):
	"""Defensive access: Mistral layers may be nested differently"""
	if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
	return self.model.model.layers # MistralForCausalLM wrapper
	elif hasattr(self.model, 'layers'):
	return self.model.layers # Direct model access
	raise AttributeError("Cannot find transformer layers in Mistral model")

	def get_num_layers(self) -> int:
	return self.model.config.num_hidden_layers

	def get_num_heads(self) -> int:
	return self.model.config.num_attention_heads

	def get_num_kv_heads(self) -> Optional[int]:
	return getattr(self.model.config, 'num_key_value_heads', None)

	def get_layer_module(self, layer_idx: int):
	return self._get_layers()[layer_idx]

	def get_attention_module(self, layer_idx: int):
	return self._get_layers()[layer_idx].self_attn

	def get_mlp_module(self, layer_idx: int):
	return self._get_layers()[layer_idx].mlp

	def get_qkv_projections(self, layer_idx: int):
	attn = self.get_attention_module(layer_idx)
	return attn.q_proj, attn.k_proj, attn.v_proj
	```

	### Fix Hardcoded Layer Classification

	File: `backend/model_service.py` (lines ~1505-1514)

	```python
	# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
	layer_fraction = (layer_idx + 1) / n_layers
	if layer_idx == 0:
	layer_pattern = {"type": "positional", ...}
	elif layer_fraction <= 0.25:
	layer_pattern = {"type": "previous_token", ...}
	elif layer_fraction <= 0.75:
	layer_pattern = {"type": "induction", ...}
	else:
	layer_pattern = {"type": "semantic", ...}
	```

	---

	## Frontend Dynamic Layer Handling

	### Fix Hardcoded Layer Boundaries

	File: `components/research/VerticalPipeline.tsx`

	```typescript
	const getStageInfo = (layerIdx: number, totalLayers: number) => {
	if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
	const fraction = layerIdx / totalLayers;
	if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
	if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
	return { color: 'purple', label: 'LATE' };
	};
	```

	---

	## DGX Spark Docker Deployment

	### Dockerfile

	```dockerfile
	# Bump with care, retest CUDA + torch compatibility
	FROM nvcr.io/nvidia/pytorch:24.01-py3

	WORKDIR /app

	RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

	COPY requirements.txt .
	RUN pip install --no-cache-dir -r requirements.txt

	COPY . .

	EXPOSE 8000

	CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
	```

	### Docker Compose for Spark

	File: `docker/compose.spark.yml`

	```yaml
	services:
	visualisable-ai-backend:
	build:
	context: ..
	dockerfile: Dockerfile
	ports:
	- "${PORT:-8000}:8000"
	shm_size: "8gb"
	volumes:
	- ..:/app
	- /srv/models:/srv/models:ro
	- /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
	- ../runs:/app/runs
	environment:
	- HF_HOME=/srv/models-cache/huggingface
	- TRANSFORMERS_CACHE=/srv/models-cache/huggingface
	- DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
	- API_KEY=${API_KEY}
	- HF_TOKEN=${HF_TOKEN}
	- HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
	- MAX_CONTEXT=${MAX_CONTEXT:-8192}
	- BATCH_SIZE=${BATCH_SIZE:-1}
	- TORCH_DTYPE=${TORCH_DTYPE:-bf16}
	gpus: all
	healthcheck:
	test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
	interval: 30s
	timeout: 3s
	retries: 5
	restart: unless-stopped
	```

	Notes:
	- `/health` MUST return immediately (process up), not wait for model load
	- Add `/ready` endpoint for model readiness
	- Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...`

	---

	## Files to Modify/Create

	\| File \| Action \| Description \|
	\|------\|--------\|-------------\|
	\| `backend/model_config.py` \| MODIFY \| Add Devstral entry \|
	\| `backend/model_adapter.py` \| MODIFY \| Add MistralAdapter \|
	\| `backend/model_service.py` \| MODIFY \| Fix hardcoded layer thresholds \|
	\| `Dockerfile` \| CREATE \| Docker image \|
	\| `docker/compose.spark.yml` \| CREATE \| Spark compose config \|
	\| `.env.spark.example` \| CREATE \| Environment template \|
	\| `components/research/VerticalPipeline.tsx` \| MODIFY \| Dynamic layer boundaries \|

	---

	## Hardware Requirements Summary

	\| Model \| Deployment \| Hardware \|
	\|-------\|------------\|----------\|
	\| CodeGen 350M \| Mac Studio / HuggingFace \| CPU or any GPU \|
	\| Code Llama 7B \| Mac Studio (MPS) / HuggingFace \| 14GB+ VRAM \|
	\| Devstral 24B \| Mac Studio (CPU) / DGX Spark \| 96GB+ RAM or 48GB+ VRAM \|