Spaces:

visualisable-ai
/

api

Sleeping

api

File size: 6,841 Bytes

ab4534a

# Adding Devstral Model Support + DGX Spark Deployment (Full Plan)

> This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach.

## Overview

Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.

## Devstral Model Specifications

| Parameter | Devstral | CodeGen (current) | Code Llama |
|-----------|----------|-------------------|------------|
| Parameters | 24B | 350M | 7B |
| Layers | 40 | 20 | 32 |
| Attention Heads | 32 | 16 | 32 |
| KV Heads (GQA) | 8 | N/A (MHA) | 32 |
| Hidden Size | 5120 | 1024 | 4096 |
| Vocab Size | 131,072 | 51,200 | 32,000 |
| Context Length | 128K | 2K | 16K |
| Min VRAM (BF16)* | ~48GB | 2GB | 14GB |
| Architecture | `mistral` | `gpt_neox` | `llama` |

*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.

## Deployment Environment Summary

| Environment | Backend Location | Frontend | Use Case |
|-------------|-----------------|----------|----------|
| Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) |
| DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models |
| Production | HuggingFace Spaces | Vercel | Public access (CodeGen) |

---

## Backend Model Support

### Add Devstral to Model Registry

**File:** `backend/model_config.py`

```python
"devstral-small": {
    "hf_path": "mistralai/Devstral-Small-2507",
    "display_name": "Devstral Small 24B",
    "architecture": "mistral",
    "size": "24B",
    "num_layers": 40,
    "num_heads": 32,
    "num_kv_heads": 8,  # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
    "vocab_size": 131072,
    "context_length": 131072,
    "attention_type": "grouped_query",
    "requires_gpu": True,
    "min_vram_gb": 48.0,
    "min_ram_gb": 96.0
}
```

### Create MistralAdapter

**File:** `backend/model_adapter.py`

```python
class MistralAdapter(ModelAdapter):
    """Adapter for Mistral-based models (Devstral, Mistral, etc.)"""

    def _get_layers(self):
        """Defensive access: Mistral layers may be nested differently"""
        if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
            return self.model.model.layers  # MistralForCausalLM wrapper
        elif hasattr(self.model, 'layers'):
            return self.model.layers  # Direct model access
        raise AttributeError("Cannot find transformer layers in Mistral model")

    def get_num_layers(self) -> int:
        return self.model.config.num_hidden_layers

    def get_num_heads(self) -> int:
        return self.model.config.num_attention_heads

    def get_num_kv_heads(self) -> Optional[int]:
        return getattr(self.model.config, 'num_key_value_heads', None)

    def get_layer_module(self, layer_idx: int):
        return self._get_layers()[layer_idx]

    def get_attention_module(self, layer_idx: int):
        return self._get_layers()[layer_idx].self_attn

    def get_mlp_module(self, layer_idx: int):
        return self._get_layers()[layer_idx].mlp

    def get_qkv_projections(self, layer_idx: int):
        attn = self.get_attention_module(layer_idx)
        return attn.q_proj, attn.k_proj, attn.v_proj
```

### Fix Hardcoded Layer Classification

**File:** `backend/model_service.py` (lines ~1505-1514)

```python
# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
layer_fraction = (layer_idx + 1) / n_layers
if layer_idx == 0:
    layer_pattern = {"type": "positional", ...}
elif layer_fraction <= 0.25:
    layer_pattern = {"type": "previous_token", ...}
elif layer_fraction <= 0.75:
    layer_pattern = {"type": "induction", ...}
else:
    layer_pattern = {"type": "semantic", ...}
```

---

## Frontend Dynamic Layer Handling

### Fix Hardcoded Layer Boundaries

**File:** `components/research/VerticalPipeline.tsx`

```typescript
const getStageInfo = (layerIdx: number, totalLayers: number) => {
  if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
  const fraction = layerIdx / totalLayers;
  if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
  if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
  return { color: 'purple', label: 'LATE' };
};
```

---

## DGX Spark Docker Deployment

### Dockerfile

```dockerfile
# Bump with care, retest CUDA + torch compatibility
FROM nvcr.io/nvidia/pytorch:24.01-py3

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Docker Compose for Spark

**File:** `docker/compose.spark.yml`

```yaml
services:
  visualisable-ai-backend:
    build:
      context: ..
      dockerfile: Dockerfile
    ports:
      - "${PORT:-8000}:8000"
    shm_size: "8gb"
    volumes:
      - ..:/app
      - /srv/models:/srv/models:ro
      - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
      - ../runs:/app/runs
    environment:
      - HF_HOME=/srv/models-cache/huggingface
      - TRANSFORMERS_CACHE=/srv/models-cache/huggingface
      - DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
      - API_KEY=${API_KEY}
      - HF_TOKEN=${HF_TOKEN}
      - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
      - MAX_CONTEXT=${MAX_CONTEXT:-8192}
      - BATCH_SIZE=${BATCH_SIZE:-1}
      - TORCH_DTYPE=${TORCH_DTYPE:-bf16}
    gpus: all
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 3s
      retries: 5
    restart: unless-stopped
```

**Notes:**
- `/health` MUST return immediately (process up), not wait for model load
- Add `/ready` endpoint for model readiness
- Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...`

---

## Files to Modify/Create

| File | Action | Description |
|------|--------|-------------|
| `backend/model_config.py` | MODIFY | Add Devstral entry |
| `backend/model_adapter.py` | MODIFY | Add MistralAdapter |
| `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds |
| `Dockerfile` | CREATE | Docker image |
| `docker/compose.spark.yml` | CREATE | Spark compose config |
| `.env.spark.example` | CREATE | Environment template |
| `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries |

---

## Hardware Requirements Summary

| Model | Deployment | Hardware |
|-------|------------|----------|
| CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU |
| Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM |
| Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |