Spaces:
Paused
Paused
Commit
·
b85b1e0
1
Parent(s):
6fc2368
Reduce vLLM memory for A100 24GB compatibility
Browse files- gpu_memory_utilization: 0.80 → 0.55 (leaves ~10GB for embedding+reranker)
- max_model_len: 16384 → 8192 (reduces KV cache memory)
Fixes OOM error when loading all 3 models on 22.3GB usable VRAM.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- models/real.py +2 -2
models/real.py
CHANGED
|
@@ -82,8 +82,8 @@ class RealModelStack:
|
|
| 82 |
model=settings.vision_model,
|
| 83 |
tensor_parallel_size=settings.vllm_tensor_parallel_size, # 1 for single GPU
|
| 84 |
trust_remote_code=True,
|
| 85 |
-
gpu_memory_utilization=0.
|
| 86 |
-
max_model_len=
|
| 87 |
)
|
| 88 |
|
| 89 |
# Load processor for chat template formatting
|
|
|
|
| 82 |
model=settings.vision_model,
|
| 83 |
tensor_parallel_size=settings.vllm_tensor_parallel_size, # 1 for single GPU
|
| 84 |
trust_remote_code=True,
|
| 85 |
+
gpu_memory_utilization=0.55, # Leave ~10GB for embedding + reranker
|
| 86 |
+
max_model_len=8192, # Reduced to save KV cache memory
|
| 87 |
)
|
| 88 |
|
| 89 |
# Load processor for chat template formatting
|