File size: 6,841 Bytes
ab4534a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Adding Devstral Model Support + DGX Spark Deployment (Full Plan)

> This is the complete reference plan. See `devstral-spark-plan-phased.md` for the incremental implementation approach.

## Overview

Add support for `mistralai/Devstral-Small-2507` (24B parameter Mistral-based code model) to the Research Attention Analyzer, alongside creating Docker deployment infrastructure for running the backend on the DGX Spark.

## Devstral Model Specifications

| Parameter | Devstral | CodeGen (current) | Code Llama |
|-----------|----------|-------------------|------------|
| Parameters | 24B | 350M | 7B |
| Layers | 40 | 20 | 32 |
| Attention Heads | 32 | 16 | 32 |
| KV Heads (GQA) | 8 | N/A (MHA) | 32 |
| Hidden Size | 5120 | 1024 | 4096 |
| Vocab Size | 131,072 | 51,200 | 32,000 |
| Context Length | 128K | 2K | 16K |
| Min VRAM (BF16)* | ~48GB | 2GB | 14GB |
| Architecture | `mistral` | `gpt_neox` | `llama` |

*VRAM is a planning guide. Actual usage varies with max context, KV cache, batch size, and attention implementation.

## Deployment Environment Summary

| Environment | Backend Location | Frontend | Use Case |
|-------------|-----------------|----------|----------|
| Local Dev (Mac) | localhost:8000 | localhost:3000 | CodeGen only (350M) |
| DGX Spark | spark-c691.local:8000 | localhost:3000 or Vercel | Devstral/larger models |
| Production | HuggingFace Spaces | Vercel | Public access (CodeGen) |

---

## Backend Model Support

### Add Devstral to Model Registry

**File:** `backend/model_config.py`

```python
"devstral-small": {
    "hf_path": "mistralai/Devstral-Small-2507",
    "display_name": "Devstral Small 24B",
    "architecture": "mistral",
    "size": "24B",
    "num_layers": 40,
    "num_heads": 32,
    "num_kv_heads": 8,  # GQA: 32 Q heads, 8 KV heads (4:1 ratio)
    "vocab_size": 131072,
    "context_length": 131072,
    "attention_type": "grouped_query",
    "requires_gpu": True,
    "min_vram_gb": 48.0,
    "min_ram_gb": 96.0
}
```

### Create MistralAdapter

**File:** `backend/model_adapter.py`

```python
class MistralAdapter(ModelAdapter):
    """Adapter for Mistral-based models (Devstral, Mistral, etc.)"""

    def _get_layers(self):
        """Defensive access: Mistral layers may be nested differently"""
        if hasattr(self.model, 'model') and hasattr(self.model.model, 'layers'):
            return self.model.model.layers  # MistralForCausalLM wrapper
        elif hasattr(self.model, 'layers'):
            return self.model.layers  # Direct model access
        raise AttributeError("Cannot find transformer layers in Mistral model")

    def get_num_layers(self) -> int:
        return self.model.config.num_hidden_layers

    def get_num_heads(self) -> int:
        return self.model.config.num_attention_heads

    def get_num_kv_heads(self) -> Optional[int]:
        return getattr(self.model.config, 'num_key_value_heads', None)

    def get_layer_module(self, layer_idx: int):
        return self._get_layers()[layer_idx]

    def get_attention_module(self, layer_idx: int):
        return self._get_layers()[layer_idx].self_attn

    def get_mlp_module(self, layer_idx: int):
        return self._get_layers()[layer_idx].mlp

    def get_qkv_projections(self, layer_idx: int):
        attn = self.get_attention_module(layer_idx)
        return attn.q_proj, attn.k_proj, attn.v_proj
```

### Fix Hardcoded Layer Classification

**File:** `backend/model_service.py` (lines ~1505-1514)

```python
# Fixed (percentage-based, 1-indexed fraction for transformer blocks):
layer_fraction = (layer_idx + 1) / n_layers
if layer_idx == 0:
    layer_pattern = {"type": "positional", ...}
elif layer_fraction <= 0.25:
    layer_pattern = {"type": "previous_token", ...}
elif layer_fraction <= 0.75:
    layer_pattern = {"type": "induction", ...}
else:
    layer_pattern = {"type": "semantic", ...}
```

---

## Frontend Dynamic Layer Handling

### Fix Hardcoded Layer Boundaries

**File:** `components/research/VerticalPipeline.tsx`

```typescript
const getStageInfo = (layerIdx: number, totalLayers: number) => {
  if (layerIdx === 0) return { color: 'yellow', label: 'EMBEDDING' };
  const fraction = layerIdx / totalLayers;
  if (fraction <= 0.25) return { color: 'green', label: 'EARLY' };
  if (fraction <= 0.75) return { color: 'blue', label: 'MIDDLE' };
  return { color: 'purple', label: 'LATE' };
};
```

---

## DGX Spark Docker Deployment

### Dockerfile

```dockerfile
# Bump with care, retest CUDA + torch compatibility
FROM nvcr.io/nvidia/pytorch:24.01-py3

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["python", "-m", "uvicorn", "backend.model_service:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Docker Compose for Spark

**File:** `docker/compose.spark.yml`

```yaml
services:
  visualisable-ai-backend:
    build:
      context: ..
      dockerfile: Dockerfile
    ports:
      - "${PORT:-8000}:8000"
    shm_size: "8gb"
    volumes:
      - ..:/app
      - /srv/models:/srv/models:ro
      - /srv/models-cache/huggingface:/srv/models-cache/huggingface:rw
      - ../runs:/app/runs
    environment:
      - HF_HOME=/srv/models-cache/huggingface
      - TRANSFORMERS_CACHE=/srv/models-cache/huggingface
      - DEFAULT_MODEL=${DEFAULT_MODEL:-devstral-small}
      - API_KEY=${API_KEY}
      - HF_TOKEN=${HF_TOKEN}
      - HUGGINGFACE_HUB_TOKEN=${HF_TOKEN}
      - MAX_CONTEXT=${MAX_CONTEXT:-8192}
      - BATCH_SIZE=${BATCH_SIZE:-1}
      - TORCH_DTYPE=${TORCH_DTYPE:-bf16}
    gpus: all
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 3s
      retries: 5
    restart: unless-stopped
```

**Notes:**
- `/health` MUST return immediately (process up), not wait for model load
- Add `/ready` endpoint for model readiness
- Multiple branches: use `PORT=8001 docker compose -p visai-branch-a ...`

---

## Files to Modify/Create

| File | Action | Description |
|------|--------|-------------|
| `backend/model_config.py` | MODIFY | Add Devstral entry |
| `backend/model_adapter.py` | MODIFY | Add MistralAdapter |
| `backend/model_service.py` | MODIFY | Fix hardcoded layer thresholds |
| `Dockerfile` | CREATE | Docker image |
| `docker/compose.spark.yml` | CREATE | Spark compose config |
| `.env.spark.example` | CREATE | Environment template |
| `components/research/VerticalPipeline.tsx` | MODIFY | Dynamic layer boundaries |

---

## Hardware Requirements Summary

| Model | Deployment | Hardware |
|-------|------------|----------|
| CodeGen 350M | Mac Studio / HuggingFace | CPU or any GPU |
| Code Llama 7B | Mac Studio (MPS) / HuggingFace | 14GB+ VRAM |
| Devstral 24B | Mac Studio (CPU) / DGX Spark | 96GB+ RAM or 48GB+ VRAM |