Spaces:

KinetoLabs
/

SmokeScan

Paused

KinetoLabs Claude Opus 4.5 commited on Jan 11

Commit

7d5c713

1 Parent(s): b2fe3f4

Reduce context/memory to minimize NCCL overhead on L4s

Changes:
- max_model_len: 16384 → 8192 (half context length)
- gpu_memory_utilization: 0.70 → 0.50 (less memory pressure)

L4 GPUs lack NVLINK, making tensor parallelism communication
slow via PCIe. Reducing these values minimizes NCCL overhead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (2) hide show

config/settings.py +1 -1
models/real.py +1 -1

config/settings.py CHANGED Viewed

@@ -25,7 +25,7 @@ class Settings(BaseSettings):
     # vLLM configuration
     vllm_tensor_parallel_size: int = 4  # Use all 4 L4 GPUs
-    vllm_max_model_len: int = 16384  # Reduced from 32768 for memory safety
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

     # vLLM configuration
     vllm_tensor_parallel_size: int = 4  # Use all 4 L4 GPUs
+    vllm_max_model_len: int = 8192  # Reduced to minimize NCCL overhead on L4s
     # ChromaDB
     chroma_persist_dir: str = "./chroma_db"

models/real.py CHANGED Viewed

@@ -92,7 +92,7 @@ class RealModelStack:
             tensor_parallel_size=settings.vllm_tensor_parallel_size,
             trust_remote_code=True,
             # dtype removed - FP8 model auto-detects native quantization
-            gpu_memory_utilization=0.70,  # Official model card recommendation
             max_model_len=settings.vllm_max_model_len,
             # enforce_eager removed - let vLLM default (False) per official
         )

             tensor_parallel_size=settings.vllm_tensor_parallel_size,
             trust_remote_code=True,
             # dtype removed - FP8 model auto-detects native quantization
+            gpu_memory_utilization=0.50,  # Reduced to minimize NCCL overhead on L4s
             max_model_len=settings.vllm_max_model_len,
             # enforce_eager removed - let vLLM default (False) per official
         )