api / docs /devstral-spark-plan-full.md
gary-boon
Add Devstral + DGX Spark implementation plan
ab4534a

Adding Devstral Model Support + DGX Spark Deployment (Full Plan)

This is the complete reference plan. See devstral-spark-plan-phased.md for the incremental implementation approach.

Overview

Add support for mistralai/Devstral-Small-2507 (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.

Devstral Model Specifications

Parameter Devstral CodeGen (current) Code Llama
Parameters 24B 350M 7B
Layers 40 20 32
Attention Heads 32 16 32
KV Heads (GQA) 8 N/A (MHA) 32
Hidden Size 5120 1024 4096
Vocab Size 131,072 51,200 32,000
Context Length 128K 2K 16K
Min VRAM (BF16)* ~48GB 2GB 14GB
Architecture mistral gpt_neox llama

*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.

Deployment Environment Summary

Environment Backend Location Frontend Use Case
Local Dev (Mac) localhost:8000 localhost:3000 CodeGen only (350M)
DGX Spark spark-c691.local:8000 localhost:3000 or Vercel Devstral/larger models
Production HuggingFace Spaces Vercel Public access (CodeGen)

Backend Model Support

Add Devstral to Model Registry

File: backend/model_config.py

"devstral-small": {
    "hf_path": "mistralai/Devstral-Small-2507",
    "display_name": "Devstral Small 24B",
    "architecture": "mistral",
    "size": "24B",
    "num_layers": 40,
    "num_heads": 32,
    "num_kv_heads": 8,  # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
    "vocab_size": 131072,
    "context_length": 131072,
    "attention_type": "grouped_query",
    "requires_gpu": True,
    "min_vram_gb": 48.0,
    "min_ram_gb": 96.0
}

Create MistralAdapter

File: backend/model_adapter.py

class MistralAdapter(ModelAdapter):
    """Adapter for Mistral-based models (Devstral, Mistral, etc.)"""

    def _get_layers(self):
        """Defensive access: Mistral layers may be nested differently"""
        if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
            return self.model.model.layers  # MistralForCausalLM wrapper
        elif hasattr(self.model, 'layers'):
            return self.model.layers  # Direct model access
        raise AttributeError("Cannot find transformer layers in Mistral model")

    def get_num_layers(self) -> int:
        return self.model.config.num_hidden_layers

    def get_num_heads(self) -> int:
        return self.model.config.num_attention_heads

    def get_num_kv_heads(self) -> Optional[int]:
        return getattr(self.model.config, 'num_key_value_heads', None)

    def get_layer_module(self, layer_idx: int):
        return self._get_layers()[layer_idx]

    def get_attention_module(self, layer_idx: int):
        return self._get_layers()[layer_idx].self_attn

    def get_mlp_module(self, layer_idx: int):
        return self._get_layers()[layer_idx].mlp

    def get_qkv_projections(self, layer_idx: int):
        attn = self.get_attention_module(layer_idx)
        return attn.q_proj, attn.k_proj, attn.v_proj

Fix Hardcoded Layer Classification

File: backend/model_service.py (lines ~1505-1514)

# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
layer_fraction = (layer_idx + 1) / n_layers
if layer_idx == 0:
    layer_pattern = {"type": "positional", ...}
elif layer_fraction <= 0.25:
    layer_pattern = {"type": "previous_token", ...}
elif layer_fraction <= 0.75:
    layer_pattern = {"type": "induction", ...}
else:
    layer_pattern = {"type": "semantic", ...}

Frontend Dynamic Layer Handling

Fix Hardcoded Layer Boundaries

File: components/research/VerticalPipeline.tsx

const getStageInfo = (layerIdx: number, totalLayers: number) => {
  if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
  const fraction = layerIdx / totalLayers;
  if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
  if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
  return { color: 'purple', label: 'LATE' };
};

DGX Spark Docker Deployment

Dockerfile

# Bump with care, retest CUDA + torch compatibility
FROM nvcr.io/nvidia/pytorch:24.01-py3

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose for Spark

File: docker/compose.spark.yml

services:
  visualisable-ai-backend:
    build:
      context: ..
      dockerfile: Dockerfile
    ports:
      - "${PORT:-8000}:8000"
    shm_size: "8gb"
    volumes:
      - ..:/app
      - /srv/models:/srv/models:ro
      - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
      - ../runs:/app/runs
    environment:
      - HF_HOME=/srv/models-cache/huggingface
      - TRANSFORMERS_CACHE=/srv/models-cache/huggingface
      - DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
      - API_KEY=${API_KEY}
      - HF_TOKEN=${HF_TOKEN}
      - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
      - MAX_CONTEXT=${MAX_CONTEXT:-8192}
      - BATCH_SIZE=${BATCH_SIZE:-1}
      - TORCH_DTYPE=${TORCH_DTYPE:-bf16}
    gpus: all
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 3s
      retries: 5
    restart: unless-stopped

Notes:

  • /health MUST return immediately (process up), not wait for model load
  • Add /ready endpoint for model readiness
  • Multiple branches: use PORT=8001 docker compose -p visai-branch-a ...

Files to Modify/Create

File Action Description
backend/model_config.py MODIFY Add Devstral entry
backend/model_adapter.py MODIFY Add MistralAdapter
backend/model_service.py MODIFY Fix hardcoded layer thresholds
Dockerfile CREATE Docker image
docker/compose.spark.yml CREATE Spark compose config
.env.spark.example CREATE Environment template
components/research/VerticalPipeline.tsx MODIFY Dynamic layer boundaries

Hardware Requirements Summary

Model Deployment Hardware
CodeGen 350M Mac Studio / HuggingFace CPU or any GPU
Code Llama 7B Mac Studio (MPS) / HuggingFace 14GB+ VRAM
Devstral 24B Mac Studio (CPU) / DGX Spark 96GB+ RAM or 48GB+ VRAM