Spaces:

visualisable-ai
/

api

Paused

gary-boon Claude Opus 4.5 commited on Dec 24, 2025

Commit

a94eb19

1 Parent(s): 959074d

Add per-step memory cleanup for large model support

- Delete outputs, logits, probs tensors after each generation step
- Run garbage collection every 8 steps for large models
- Clear MPS cache on Apple Silicon to release GPU memory

This prevents memory accumulation during generation with models
like Devstral that have 40 layers × 32 heads, which previously
caused RAM exhaustion on longer token sequences (32+).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show

backend/model_service.py +32 -0

backend/model_service.py CHANGED Viewed

@@ -1990,6 +1990,22 @@ async def analyze_research_attention(request: Dict[str, Any], authenticated: boo
                 if next_token_id == manager.tokenizer.eos_token_id:
                     break
         # Clean up hooks after generation
         for hook in hooks:
             hook.remove()
@@ -2523,6 +2539,22 @@ async def analyze_research_attention_stream(request: Dict[str, Any], authenticat
                     if next_token_id == manager.tokenizer.eos_token_id:
                         break
             # Clean up hooks
             for hook in hooks:
                 hook.remove()

                 if next_token_id == manager.tokenizer.eos_token_id:
                     break
+                # Free memory from this step's outputs to prevent accumulation
+                # This is critical for large models like Devstral (40 layers, 32 heads)
+                del outputs
+                del logits
+                del probs
+                if 'layer_attn' in dir():
+                    del layer_attn
+                if 'current_hidden' in dir():
+                    del current_hidden
+                # Periodic garbage collection for large models (every 8 steps)
+                if (step + 1) % 8 == 0:
+                    gc.collect()
+                    if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+                        torch.mps.empty_cache() if hasattr(torch.mps, 'empty_cache') else None
         # Clean up hooks after generation
         for hook in hooks:
             hook.remove()
                     if next_token_id == manager.tokenizer.eos_token_id:
                         break
+                    # Free memory from this step's outputs to prevent accumulation
+                    # This is critical for large models like Devstral (40 layers, 32 heads)
+                    del outputs
+                    del logits
+                    del probs
+                    if 'layer_attn' in dir():
+                        del layer_attn
+                    if 'current_hidden' in dir():
+                        del current_hidden
+                    # Periodic garbage collection for large models (every 8 steps)
+                    if (step + 1) % 8 == 0:
+                        gc.collect()
+                        if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+                            torch.mps.empty_cache() if hasattr(torch.mps, 'empty_cache') else None
             # Clean up hooks
             for hook in hooks:
                 hook.remove()