KinetoLabs Claude Opus 4.5 commited on
Commit
b85b1e0
·
1 Parent(s): 6fc2368

Reduce vLLM memory for A100 24GB compatibility

Browse files

- gpu_memory_utilization: 0.80 → 0.55 (leaves ~10GB for embedding+reranker)
- max_model_len: 16384 → 8192 (reduces KV cache memory)

Fixes OOM error when loading all 3 models on 22.3GB usable VRAM.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. models/real.py +2 -2
models/real.py CHANGED
@@ -82,8 +82,8 @@ class RealModelStack:
82
  model=settings.vision_model,
83
  tensor_parallel_size=settings.vllm_tensor_parallel_size, # 1 for single GPU
84
  trust_remote_code=True,
85
- gpu_memory_utilization=0.80, # Can use more on single GPU
86
- max_model_len=settings.vllm_max_model_len,
87
  )
88
 
89
  # Load processor for chat template formatting
 
82
  model=settings.vision_model,
83
  tensor_parallel_size=settings.vllm_tensor_parallel_size, # 1 for single GPU
84
  trust_remote_code=True,
85
+ gpu_memory_utilization=0.55, # Leave ~10GB for embedding + reranker
86
+ max_model_len=8192, # Reduced to save KV cache memory
87
  )
88
 
89
  # Load processor for chat template formatting