Spaces:
Running
Running
[node] estimation
Browse files
README.md
CHANGED
|
@@ -19,8 +19,12 @@ An interactive Gradio application for estimating H100 GPU node requirements and
|
|
| 19 |
|
| 20 |
## Features
|
| 21 |
|
| 22 |
-
- **Model Support**: Supports
|
|
|
|
|
|
|
|
|
|
| 23 |
- **Smart Estimation**: Calculates memory requirements including model weights, KV cache, and operational overhead
|
|
|
|
| 24 |
- **Use Case Optimization**: Provides different estimates for inference, training, and fine-tuning scenarios
|
| 25 |
- **Precision Support**: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
|
| 26 |
- **Interactive Visualizations**: Memory breakdown charts and node utilization graphs
|
|
@@ -81,6 +85,9 @@ python app.py
|
|
| 81 |
| LLaMA-3-70B | 4096/1024 | 4 | Inference | FP16 | 1 |
|
| 82 |
| Qwen2.5-72B | 8192/2048 | 2 | Fine-tuning | BF16 | 1 |
|
| 83 |
| Nemotron-4-340B | 2048/1024 | 1 | Inference | INT8 | 1-2 |
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## CUDA Recommendations
|
| 86 |
|
|
@@ -110,17 +117,33 @@ The application provides tailored CUDA version recommendations:
|
|
| 110 |
## Technical Details
|
| 111 |
|
| 112 |
### Supported Models
|
|
|
|
| 113 |
- **LLaMA**: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
|
| 114 |
- **Nemotron**: 4-15B, 4-340B
|
| 115 |
- **Qwen2**: 0.5B, 1.5B, 7B, 72B
|
| 116 |
- **Qwen2.5**: 0.5B, 1.5B, 7B, 14B, 32B, 72B
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
### Precision Impact
|
| 119 |
- **FP32**: Full precision (4 bytes per parameter)
|
| 120 |
- **FP16/BF16**: Half precision (2 bytes per parameter)
|
| 121 |
- **INT8**: 8-bit quantization (1 byte per parameter)
|
| 122 |
- **INT4**: 4-bit quantization (0.5 bytes per parameter)
|
| 123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
## Limitations
|
| 125 |
|
| 126 |
- Estimates are approximate and may vary based on:
|
|
|
|
| 19 |
|
| 20 |
## Features
|
| 21 |
|
| 22 |
+
- **Comprehensive Model Support**: Supports 30+ models including:
|
| 23 |
+
- **Text Models**: LLaMA-2/3/3.1, Nemotron-4, Qwen2/2.5
|
| 24 |
+
- **Vision-Language**: Qwen-VL, Qwen2-VL, NVIDIA VILA series
|
| 25 |
+
- **Audio Models**: Qwen-Audio, Qwen2-Audio
|
| 26 |
- **Smart Estimation**: Calculates memory requirements including model weights, KV cache, and operational overhead
|
| 27 |
+
- **Multimodal Support**: Handles vision-language and audio-language models with specialized memory calculations
|
| 28 |
- **Use Case Optimization**: Provides different estimates for inference, training, and fine-tuning scenarios
|
| 29 |
- **Precision Support**: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
|
| 30 |
- **Interactive Visualizations**: Memory breakdown charts and node utilization graphs
|
|
|
|
| 85 |
| LLaMA-3-70B | 4096/1024 | 4 | Inference | FP16 | 1 |
|
| 86 |
| Qwen2.5-72B | 8192/2048 | 2 | Fine-tuning | BF16 | 1 |
|
| 87 |
| Nemotron-4-340B | 2048/1024 | 1 | Inference | INT8 | 1-2 |
|
| 88 |
+
| Qwen2-VL-7B | 1024/256 | 1 | Inference | FP16 | 1 |
|
| 89 |
+
| VILA-1.5-13B | 2048/512 | 2 | Inference | BF16 | 1 |
|
| 90 |
+
| Qwen2-Audio-7B | 1024/256 | 1 | Inference | FP16 | 1 |
|
| 91 |
|
| 92 |
## CUDA Recommendations
|
| 93 |
|
|
|
|
| 117 |
## Technical Details
|
| 118 |
|
| 119 |
### Supported Models
|
| 120 |
+
#### Text Models
|
| 121 |
- **LLaMA**: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
|
| 122 |
- **Nemotron**: 4-15B, 4-340B
|
| 123 |
- **Qwen2**: 0.5B, 1.5B, 7B, 72B
|
| 124 |
- **Qwen2.5**: 0.5B, 1.5B, 7B, 14B, 32B, 72B
|
| 125 |
|
| 126 |
+
#### Vision-Language Models
|
| 127 |
+
- **Qwen-VL**: Base, Chat, Plus, Max variants
|
| 128 |
+
- **Qwen2-VL**: 2B, 7B, 72B
|
| 129 |
+
- **NVIDIA VILA**: 1.5-3B, 1.5-8B, 1.5-13B, 1.5-40B
|
| 130 |
+
|
| 131 |
+
#### Audio Models
|
| 132 |
+
- **Qwen-Audio**: Base, Chat variants
|
| 133 |
+
- **Qwen2-Audio**: 7B
|
| 134 |
+
|
| 135 |
### Precision Impact
|
| 136 |
- **FP32**: Full precision (4 bytes per parameter)
|
| 137 |
- **FP16/BF16**: Half precision (2 bytes per parameter)
|
| 138 |
- **INT8**: 8-bit quantization (1 byte per parameter)
|
| 139 |
- **INT4**: 4-bit quantization (0.5 bytes per parameter)
|
| 140 |
|
| 141 |
+
### Multimodal Considerations
|
| 142 |
+
- **Vision Models**: Process images as token sequences (typically 256-1024 tokens per image)
|
| 143 |
+
- **Audio Models**: Handle audio segments with frame-based tokenization
|
| 144 |
+
- **Memory Overhead**: Additional memory for vision/audio encoders and cross-modal attention
|
| 145 |
+
- **Token Estimation**: Consider multimodal inputs when calculating token counts
|
| 146 |
+
|
| 147 |
## Limitations
|
| 148 |
|
| 149 |
- Estimates are approximate and may vary based on:
|
app.py
CHANGED
|
@@ -27,6 +27,23 @@ MODEL_SPECS = {
|
|
| 27 |
"Qwen2.5-14B": {"params": 14e9, "base_memory_gb": 28},
|
| 28 |
"Qwen2.5-32B": {"params": 32e9, "base_memory_gb": 64},
|
| 29 |
"Qwen2.5-72B": {"params": 72e9, "base_memory_gb": 144},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
}
|
| 31 |
|
| 32 |
# H100 specifications
|
|
@@ -265,6 +282,7 @@ def create_interface():
|
|
| 265 |
with gr.Blocks(title="H100 Node Estimator", theme=gr.themes.Soft()) as demo:
|
| 266 |
gr.Markdown("# 🚀 H100 Node & CUDA Version Estimator")
|
| 267 |
gr.Markdown("Get recommendations for H100 node count and CUDA version based on your model and workload requirements.")
|
|
|
|
| 268 |
|
| 269 |
with gr.Row():
|
| 270 |
with gr.Column(scale=1):
|
|
@@ -274,7 +292,7 @@ def create_interface():
|
|
| 274 |
choices=list(MODEL_SPECS.keys()),
|
| 275 |
value="LLaMA-3-8B",
|
| 276 |
label="Model",
|
| 277 |
-
info="Select the model you want to run"
|
| 278 |
)
|
| 279 |
|
| 280 |
input_tokens = gr.Number(
|
|
@@ -340,6 +358,9 @@ def create_interface():
|
|
| 340 |
["LLaMA-3-70B", 4096, 1024, 4, "inference", "FP16"],
|
| 341 |
["Qwen2.5-72B", 8192, 2048, 2, "fine_tuning", "BF16"],
|
| 342 |
["Nemotron-4-340B", 2048, 1024, 1, "inference", "INT8"],
|
|
|
|
|
|
|
|
|
|
| 343 |
]
|
| 344 |
|
| 345 |
gr.Examples(
|
|
@@ -352,6 +373,8 @@ def create_interface():
|
|
| 352 |
|
| 353 |
gr.Markdown("""
|
| 354 |
## ℹ️ Notes
|
|
|
|
|
|
|
| 355 |
- Estimates are approximate and may vary based on actual implementation details
|
| 356 |
- Memory calculations include model weights, KV cache, and operational overhead
|
| 357 |
- Consider network bandwidth and storage requirements for multi-node setups
|
|
|
|
| 27 |
"Qwen2.5-14B": {"params": 14e9, "base_memory_gb": 28},
|
| 28 |
"Qwen2.5-32B": {"params": 32e9, "base_memory_gb": 64},
|
| 29 |
"Qwen2.5-72B": {"params": 72e9, "base_memory_gb": 144},
|
| 30 |
+
# Qwen Vision Language Models
|
| 31 |
+
"Qwen-VL": {"params": 9.6e9, "base_memory_gb": 20},
|
| 32 |
+
"Qwen-VL-Chat": {"params": 9.6e9, "base_memory_gb": 20},
|
| 33 |
+
"Qwen-VL-Plus": {"params": 12e9, "base_memory_gb": 25},
|
| 34 |
+
"Qwen-VL-Max": {"params": 30e9, "base_memory_gb": 65},
|
| 35 |
+
"Qwen2-VL-2B": {"params": 2e9, "base_memory_gb": 5},
|
| 36 |
+
"Qwen2-VL-7B": {"params": 8e9, "base_memory_gb": 18},
|
| 37 |
+
"Qwen2-VL-72B": {"params": 72e9, "base_memory_gb": 150},
|
| 38 |
+
# NVIDIA VILA Series
|
| 39 |
+
"VILA-1.5-3B": {"params": 3e9, "base_memory_gb": 7},
|
| 40 |
+
"VILA-1.5-8B": {"params": 8e9, "base_memory_gb": 18},
|
| 41 |
+
"VILA-1.5-13B": {"params": 13e9, "base_memory_gb": 28},
|
| 42 |
+
"VILA-1.5-40B": {"params": 40e9, "base_memory_gb": 85},
|
| 43 |
+
# Qwen Audio Models
|
| 44 |
+
"Qwen-Audio": {"params": 8e9, "base_memory_gb": 18},
|
| 45 |
+
"Qwen-Audio-Chat": {"params": 8e9, "base_memory_gb": 18},
|
| 46 |
+
"Qwen2-Audio-7B": {"params": 8e9, "base_memory_gb": 18},
|
| 47 |
}
|
| 48 |
|
| 49 |
# H100 specifications
|
|
|
|
| 282 |
with gr.Blocks(title="H100 Node Estimator", theme=gr.themes.Soft()) as demo:
|
| 283 |
gr.Markdown("# 🚀 H100 Node & CUDA Version Estimator")
|
| 284 |
gr.Markdown("Get recommendations for H100 node count and CUDA version based on your model and workload requirements.")
|
| 285 |
+
gr.Markdown("**Now supports multimodal models**: LLaMA, Nemotron, Qwen2/2.5, Qwen-VL, VILA, and Qwen-Audio series!")
|
| 286 |
|
| 287 |
with gr.Row():
|
| 288 |
with gr.Column(scale=1):
|
|
|
|
| 292 |
choices=list(MODEL_SPECS.keys()),
|
| 293 |
value="LLaMA-3-8B",
|
| 294 |
label="Model",
|
| 295 |
+
info="Select the model you want to run (includes text, vision-language, and audio models)"
|
| 296 |
)
|
| 297 |
|
| 298 |
input_tokens = gr.Number(
|
|
|
|
| 358 |
["LLaMA-3-70B", 4096, 1024, 4, "inference", "FP16"],
|
| 359 |
["Qwen2.5-72B", 8192, 2048, 2, "fine_tuning", "BF16"],
|
| 360 |
["Nemotron-4-340B", 2048, 1024, 1, "inference", "INT8"],
|
| 361 |
+
["Qwen2-VL-7B", 1024, 256, 1, "inference", "FP16"],
|
| 362 |
+
["VILA-1.5-13B", 2048, 512, 2, "inference", "BF16"],
|
| 363 |
+
["Qwen2-Audio-7B", 1024, 256, 1, "inference", "FP16"],
|
| 364 |
]
|
| 365 |
|
| 366 |
gr.Examples(
|
|
|
|
| 373 |
|
| 374 |
gr.Markdown("""
|
| 375 |
## ℹ️ Notes
|
| 376 |
+
- **Multimodal Models**: Vision-language and audio models may require additional memory for image/audio processing
|
| 377 |
+
- **Token Estimation**: For multimodal models, consider image patches (~256-1024 tokens per image) and audio frames
|
| 378 |
- Estimates are approximate and may vary based on actual implementation details
|
| 379 |
- Memory calculations include model weights, KV cache, and operational overhead
|
| 380 |
- Consider network bandwidth and storage requirements for multi-node setups
|