Spaces:

visualisable-ai
/

api

Paused

gary-boon Claude Opus 4.5 commited on Dec 17, 2025

Commit

decb5ab

1 Parent(s): 9056859

Limit QKV matrices to top 5 heads per layer to reduce response size

The QKV visualization fix caused response sizes to explode because we were
sending Q/K/V matrices for all 32 heads × 40 layers = 1280 sets of matrices.
For a 50 token sequence with head_dim=128, that's ~25 million floats per
token step, causing 504 Gateway Timeouts in streaming.

Now only the top 5 heads (by attention weight) per layer retain their QKV
matrices. This reduces the QKV data by ~85% while still providing meaningful
visualization data for the most important attention patterns.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show

backend/model_service.py +15 -0

backend/model_service.py CHANGED Viewed

@@ -1882,6 +1882,14 @@ async def analyze_research_attention(request: Dict[str, Any], authenticated: boo
                     # Sort by max_weight (return all heads, frontend will decide how many to display)
                     critical_heads.sort(key=lambda h: h["max_weight"], reverse=True)
                     # Detect layer-level pattern (percentage-based for any layer count)
                     layer_pattern = None
                     layer_fraction = (layer_idx + 1) / n_layers  # 1-indexed fraction
@@ -2423,6 +2431,13 @@ async def analyze_research_attention_stream(request: Dict[str, Any], authenticat
                         critical_heads.sort(key=lambda h: h["max_weight"], reverse=True)
                         layer_pattern = None
                         layer_fraction = (layer_idx + 1) / n_layers
                         if layer_idx == 0:

                     # Sort by max_weight (return all heads, frontend will decide how many to display)
                     critical_heads.sort(key=lambda h: h["max_weight"], reverse=True)
+                    # Only keep QKV matrices for top 5 heads to avoid massive response sizes
+                    # (40 layers × 32 heads × 3 matrices × seq_len × head_dim is too much data)
+                    for i, head in enumerate(critical_heads):
+                        if i >= 5:  # Keep QKV only for top 5 heads
+                            head["q_matrix"] = None
+                            head["k_matrix"] = None
+                            head["v_matrix"] = None
                     # Detect layer-level pattern (percentage-based for any layer count)
                     layer_pattern = None
                     layer_fraction = (layer_idx + 1) / n_layers  # 1-indexed fraction
                         critical_heads.sort(key=lambda h: h["max_weight"], reverse=True)
+                        # Only keep QKV matrices for top 5 heads to avoid massive response sizes
+                        for i, head in enumerate(critical_heads):
+                            if i >= 5:
+                                head["q_matrix"] = None
+                                head["k_matrix"] = None
+                                head["v_matrix"] = None
                         layer_pattern = None
                         layer_fraction = (layer_idx + 1) / n_layers
                         if layer_idx == 0: