Spaces:

KinetoLabs
/

SmokeScan

Paused

KinetoLabs Claude Opus 4.5 commited on Jan 11

Commit

b85b1e0

1 Parent(s): 6fc2368

Reduce vLLM memory for A100 24GB compatibility

- gpu_memory_utilization: 0.80 → 0.55 (leaves ~10GB for embedding+reranker)
- max_model_len: 16384 → 8192 (reduces KV cache memory)

Fixes OOM error when loading all 3 models on 22.3GB usable VRAM.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show

models/real.py +2 -2

models/real.py CHANGED Viewed

@@ -82,8 +82,8 @@ class RealModelStack:
             model=settings.vision_model,
             tensor_parallel_size=settings.vllm_tensor_parallel_size,  # 1 for single GPU
             trust_remote_code=True,
-            gpu_memory_utilization=0.80,  # Can use more on single GPU
-            max_model_len=settings.vllm_max_model_len,
         )
         # Load processor for chat template formatting

             model=settings.vision_model,
             tensor_parallel_size=settings.vllm_tensor_parallel_size,  # 1 for single GPU
             trust_remote_code=True,
+            gpu_memory_utilization=0.55,  # Leave ~10GB for embedding + reranker
+            max_model_len=8192,  # Reduced to save KV cache memory
         )
         # Load processor for chat template formatting