Spaces:

nvidia
/

tp-1-dgx-node-estimator

Running

App Files Files Community

huckiyang commited on Jun 16, 2025

Commit

720cbb9

1 Parent(s): 991a47c

[node] estimation

Browse files

Files changed (2) hide show

README.md +24 -1
app.py +24 -1

README.md CHANGED Viewed

@@ -19,8 +19,12 @@ An interactive Gradio application for estimating H100 GPU node requirements and
 ## Features
-- **Model Support**: Supports popular models including LLaMA-2/3/3.1, Nemotron-4, and Qwen2/2.5 variants
 - **Smart Estimation**: Calculates memory requirements including model weights, KV cache, and operational overhead
 - **Use Case Optimization**: Provides different estimates for inference, training, and fine-tuning scenarios
 - **Precision Support**: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
 - **Interactive Visualizations**: Memory breakdown charts and node utilization graphs
@@ -81,6 +85,9 @@ python app.py
 | LLaMA-3-70B | 4096/1024 | 4 | Inference | FP16 | 1 |
 | Qwen2.5-72B | 8192/2048 | 2 | Fine-tuning | BF16 | 1 |
 | Nemotron-4-340B | 2048/1024 | 1 | Inference | INT8 | 1-2 |
 ## CUDA Recommendations
@@ -110,17 +117,33 @@ The application provides tailored CUDA version recommendations:
 ## Technical Details
 ### Supported Models
 - **LLaMA**: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
 - **Nemotron**: 4-15B, 4-340B
 - **Qwen2**: 0.5B, 1.5B, 7B, 72B
 - **Qwen2.5**: 0.5B, 1.5B, 7B, 14B, 32B, 72B
 ### Precision Impact
 - **FP32**: Full precision (4 bytes per parameter)
 - **FP16/BF16**: Half precision (2 bytes per parameter)
 - **INT8**: 8-bit quantization (1 byte per parameter)
 - **INT4**: 4-bit quantization (0.5 bytes per parameter)
 ## Limitations
 - Estimates are approximate and may vary based on:

 ## Features
+- **Comprehensive Model Support**: Supports 30+ models including:
+  - **Text Models**: LLaMA-2/3/3.1, Nemotron-4, Qwen2/2.5
+  - **Vision-Language**: Qwen-VL, Qwen2-VL, NVIDIA VILA series
+  - **Audio Models**: Qwen-Audio, Qwen2-Audio
 - **Smart Estimation**: Calculates memory requirements including model weights, KV cache, and operational overhead
+- **Multimodal Support**: Handles vision-language and audio-language models with specialized memory calculations
 - **Use Case Optimization**: Provides different estimates for inference, training, and fine-tuning scenarios
 - **Precision Support**: Handles different precision formats (FP32, FP16, BF16, INT8, INT4)
 - **Interactive Visualizations**: Memory breakdown charts and node utilization graphs
 | LLaMA-3-70B | 4096/1024 | 4 | Inference | FP16 | 1 |
 | Qwen2.5-72B | 8192/2048 | 2 | Fine-tuning | BF16 | 1 |
 | Nemotron-4-340B | 2048/1024 | 1 | Inference | INT8 | 1-2 |
+| Qwen2-VL-7B | 1024/256 | 1 | Inference | FP16 | 1 |
+| VILA-1.5-13B | 2048/512 | 2 | Inference | BF16 | 1 |
+| Qwen2-Audio-7B | 1024/256 | 1 | Inference | FP16 | 1 |
 ## CUDA Recommendations
 ## Technical Details
 ### Supported Models
+#### Text Models
 - **LLaMA**: 2-7B, 2-13B, 2-70B, 3-8B, 3-70B, 3.1-8B, 3.1-70B, 3.1-405B
 - **Nemotron**: 4-15B, 4-340B
 - **Qwen2**: 0.5B, 1.5B, 7B, 72B
 - **Qwen2.5**: 0.5B, 1.5B, 7B, 14B, 32B, 72B
+#### Vision-Language Models
+- **Qwen-VL**: Base, Chat, Plus, Max variants
+- **Qwen2-VL**: 2B, 7B, 72B
+- **NVIDIA VILA**: 1.5-3B, 1.5-8B, 1.5-13B, 1.5-40B
+#### Audio Models
+- **Qwen-Audio**: Base, Chat variants
+- **Qwen2-Audio**: 7B
 ### Precision Impact
 - **FP32**: Full precision (4 bytes per parameter)
 - **FP16/BF16**: Half precision (2 bytes per parameter)
 - **INT8**: 8-bit quantization (1 byte per parameter)
 - **INT4**: 4-bit quantization (0.5 bytes per parameter)
+### Multimodal Considerations
+- **Vision Models**: Process images as token sequences (typically 256-1024 tokens per image)
+- **Audio Models**: Handle audio segments with frame-based tokenization
+- **Memory Overhead**: Additional memory for vision/audio encoders and cross-modal attention
+- **Token Estimation**: Consider multimodal inputs when calculating token counts
 ## Limitations
 - Estimates are approximate and may vary based on:

app.py CHANGED Viewed

@@ -27,6 +27,23 @@ MODEL_SPECS = {
     "Qwen2.5-14B": {"params": 14e9, "base_memory_gb": 28},
     "Qwen2.5-32B": {"params": 32e9, "base_memory_gb": 64},
     "Qwen2.5-72B": {"params": 72e9, "base_memory_gb": 144},
 }
 # H100 specifications
@@ -265,6 +282,7 @@ def create_interface():
     with gr.Blocks(title="H100 Node Estimator", theme=gr.themes.Soft()) as demo:
         gr.Markdown("# 🚀 H100 Node & CUDA Version Estimator")
         gr.Markdown("Get recommendations for H100 node count and CUDA version based on your model and workload requirements.")
         with gr.Row():
             with gr.Column(scale=1):
@@ -274,7 +292,7 @@ def create_interface():
                     choices=list(MODEL_SPECS.keys()),
                     value="LLaMA-3-8B",
                     label="Model",
-                    info="Select the model you want to run"
                 )
                 input_tokens = gr.Number(
@@ -340,6 +358,9 @@ def create_interface():
             ["LLaMA-3-70B", 4096, 1024, 4, "inference", "FP16"],
             ["Qwen2.5-72B", 8192, 2048, 2, "fine_tuning", "BF16"],
             ["Nemotron-4-340B", 2048, 1024, 1, "inference", "INT8"],
         ]
         gr.Examples(
@@ -352,6 +373,8 @@ def create_interface():
         gr.Markdown("""
         ## ℹ️ Notes
         - Estimates are approximate and may vary based on actual implementation details
         - Memory calculations include model weights, KV cache, and operational overhead
         - Consider network bandwidth and storage requirements for multi-node setups

     "Qwen2.5-14B": {"params": 14e9, "base_memory_gb": 28},
     "Qwen2.5-32B": {"params": 32e9, "base_memory_gb": 64},
     "Qwen2.5-72B": {"params": 72e9, "base_memory_gb": 144},
+    # Qwen Vision Language Models
+    "Qwen-VL": {"params": 9.6e9, "base_memory_gb": 20},
+    "Qwen-VL-Chat": {"params": 9.6e9, "base_memory_gb": 20},
+    "Qwen-VL-Plus": {"params": 12e9, "base_memory_gb": 25},
+    "Qwen-VL-Max": {"params": 30e9, "base_memory_gb": 65},
+    "Qwen2-VL-2B": {"params": 2e9, "base_memory_gb": 5},
+    "Qwen2-VL-7B": {"params": 8e9, "base_memory_gb": 18},
+    "Qwen2-VL-72B": {"params": 72e9, "base_memory_gb": 150},
+    # NVIDIA VILA Series
+    "VILA-1.5-3B": {"params": 3e9, "base_memory_gb": 7},
+    "VILA-1.5-8B": {"params": 8e9, "base_memory_gb": 18},
+    "VILA-1.5-13B": {"params": 13e9, "base_memory_gb": 28},
+    "VILA-1.5-40B": {"params": 40e9, "base_memory_gb": 85},
+    # Qwen Audio Models
+    "Qwen-Audio": {"params": 8e9, "base_memory_gb": 18},
+    "Qwen-Audio-Chat": {"params": 8e9, "base_memory_gb": 18},
+    "Qwen2-Audio-7B": {"params": 8e9, "base_memory_gb": 18},
 }
 # H100 specifications
     with gr.Blocks(title="H100 Node Estimator", theme=gr.themes.Soft()) as demo:
         gr.Markdown("# 🚀 H100 Node & CUDA Version Estimator")
         gr.Markdown("Get recommendations for H100 node count and CUDA version based on your model and workload requirements.")
+        gr.Markdown("**Now supports multimodal models**: LLaMA, Nemotron, Qwen2/2.5, Qwen-VL, VILA, and Qwen-Audio series!")
         with gr.Row():
             with gr.Column(scale=1):
                     choices=list(MODEL_SPECS.keys()),
                     value="LLaMA-3-8B",
                     label="Model",
+                    info="Select the model you want to run (includes text, vision-language, and audio models)"
                 )
                 input_tokens = gr.Number(
             ["LLaMA-3-70B", 4096, 1024, 4, "inference", "FP16"],
             ["Qwen2.5-72B", 8192, 2048, 2, "fine_tuning", "BF16"],
             ["Nemotron-4-340B", 2048, 1024, 1, "inference", "INT8"],
+            ["Qwen2-VL-7B", 1024, 256, 1, "inference", "FP16"],
+            ["VILA-1.5-13B", 2048, 512, 2, "inference", "BF16"],
+            ["Qwen2-Audio-7B", 1024, 256, 1, "inference", "FP16"],
         ]
         gr.Examples(
         gr.Markdown("""
         ## ℹ️ Notes
+        - **Multimodal Models**: Vision-language and audio models may require additional memory for image/audio processing
+        - **Token Estimation**: For multimodal models, consider image patches (~256-1024 tokens per image) and audio frames
         - Estimates are approximate and may vary based on actual implementation details
         - Memory calculations include model weights, KV cache, and operational overhead
         - Consider network bandwidth and storage requirements for multi-node setups