Spaces:
Running
Running
vllm
#33
by zhiminy - opened
- Dockerfile +8 -16
- README.md +51 -74
- app.py +463 -494
- backend-cli.py +1 -3
- moe-cap-results +0 -0
- requirements.txt +35 -5
- src/backend/run_eval_suite.py +8 -22
- src/backend/tasks/arena_hard/task.py +1 -1
- src/backend/tasks/measurement_task_utils.py +8 -21
- src/display/about.py +4 -7
- src/display/utils.py +40 -64
- src/leaderboard/read_evals.py +10 -11
- src/populate.py +1 -1
- src/utils.py +32 -826
Dockerfile
CHANGED
|
@@ -1,16 +1,8 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
huggingface_hub>=0.20.0 \
|
| 10 |
-
uvicorn>=0.23.0 \
|
| 11 |
-
fastapi \
|
| 12 |
-
spaces
|
| 13 |
-
|
| 14 |
-
COPY app.py /app/app.py
|
| 15 |
-
|
| 16 |
-
CMD ["python", "/app/app.py"]
|
|
|
|
| 1 |
+
# Use specific python image
|
| 2 |
+
FROM registry.hf.space/sparse-generative-ai-open-moe-llm-leaderboard:latest
|
| 3 |
+
|
| 4 |
+
RUN pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity --no-cache-dir
|
| 5 |
+
# To fix pydantic version
|
| 6 |
+
RUN pip install pydantic==2.6.4 --no-cache-dir
|
| 7 |
+
# To fix selfcheck (selfchatgpt) dataset missing
|
| 8 |
+
RUN python -m spacy download en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,108 +1,85 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🔥
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: true
|
| 10 |
license: apache-2.0
|
| 11 |
fullWidth: true
|
| 12 |
-
allow_embedding: true
|
| 13 |
tags:
|
| 14 |
-
- leaderboard
|
| 15 |
---
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
-
- MoE-CAP has been accepted to NeurIPS 2025 Dataset and Benchmark Track 🎉 See you in San Diego, US.
|
| 23 |
|
| 24 |
-
##
|
| 25 |
-
Python: >= 3.9
|
| 26 |
|
| 27 |
-
|
| 28 |
-
```bash
|
| 29 |
-
git clone https://github.com/sparse-generative-ai/MoE-CAP.git
|
| 30 |
-
cd MoE-CAP
|
| 31 |
-
pip install -e .
|
| 32 |
-
```
|
| 33 |
-
Then you can import `moe_cap` directly.
|
| 34 |
|
| 35 |
-
|
| 36 |
-
### SGLang
|
| 37 |
-
1. Launch our sglang custom server (e.g. H100)
|
| 38 |
-
```bash
|
| 39 |
-
python -m moe_cap.systems.sglang \
|
| 40 |
-
--model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
|
| 41 |
-
--port 30000 \
|
| 42 |
-
--expert-distribution-recorder-mode stat \
|
| 43 |
-
--tp-size 8 \
|
| 44 |
-
--reasoning-parser deepseek-r1
|
| 45 |
-
```
|
| 46 |
|
| 47 |
-
2.
|
| 48 |
-
```bash
|
| 49 |
-
python -m moe_cap.runner.sglang_profile \
|
| 50 |
-
--config-file configs/gsm8k_qwen3_235b_a22b.yaml \
|
| 51 |
-
--output_dir outputs/sglang/
|
| 52 |
-
```
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
--port 8000 \
|
| 61 |
-
--host 0.0.0.0 \
|
| 62 |
-
--tensor-parallel-size 8 \
|
| 63 |
-
--reasoning-parser deepseek_r1 \
|
| 64 |
-
--max-num-batched-tokens 131072 # Set max-num-batched-tokens large referring to vLLM tuning guide.
|
| 65 |
-
# V1's mixed prefill-decode batching makes separate profiling difficult.
|
| 66 |
-
```
|
| 67 |
|
| 68 |
```bash
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
## Cite our paper
|
| 98 |
-
```bibtex
|
| 99 |
-
@misc{jiang2025moecapbenchmarkingcostaccuracy,
|
| 100 |
-
title={MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems},
|
| 101 |
-
author={Yinsicheng Jiang and Yao Fu and Yeqi Huang and Ping Nie and Zhan Lu and Leyang Xue and Congjie He and Man-Kit Sit and Jilong Xue and Li Dong and Ziming Miao and Dayou Du and Tairan Xu and Kai Zou and Edoardo Ponti and Luo Mai},
|
| 102 |
-
year={2025},
|
| 103 |
-
eprint={2412.07067},
|
| 104 |
-
archivePrefix={arXiv},
|
| 105 |
-
primaryClass={cs.LG},
|
| 106 |
-
url={https://arxiv.org/abs/2412.07067},
|
| 107 |
-
}
|
| 108 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
+
title: OPEN-MOE-LLM-LEADERBOARD
|
| 3 |
emoji: 🔥
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: indigo
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.26.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: true
|
| 10 |
license: apache-2.0
|
| 11 |
fullWidth: true
|
|
|
|
| 12 |
tags:
|
| 13 |
+
- leaderboard
|
| 14 |
---
|
| 15 |
|
| 16 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 17 |
|
| 18 |
+
# Contributing to Open-MOE-LLM-Leaderboard
|
| 19 |
|
| 20 |
+
Thank you for your interest in contributing to the Open-MOE-LLM-Leaderboard project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yao Fu via email at [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk).
|
|
|
|
| 21 |
|
| 22 |
+
## What We're Looking For in Contributions
|
|
|
|
| 23 |
|
| 24 |
+
We are looking for contributions in several key areas to enhance the Open-MOE-LLM-Leaderboard project:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
|
| 31 |
+
|
| 32 |
+
4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
|
| 33 |
+
|
| 34 |
+
Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
|
| 35 |
+
|
| 36 |
+
Your contributions are crucial to the success and improvement of the Open-MOE-LLM-Leaderboard project. We look forward to collaborating with you.
|
| 37 |
|
| 38 |
+
|
| 39 |
+
## Development Setup
|
| 40 |
+
|
| 41 |
+
To start contributing, set up your development environment as follows:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
+
conda create -n leaderboard python=3.10
|
| 45 |
+
conda activate leaderboard
|
| 46 |
+
pip install -r requirements.txt
|
| 47 |
+
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity
|
| 48 |
+
pip install pydantic==2.6.4 # Resolves a dependency conflict with moe-infinity
|
| 49 |
+
python -m spacy download en # Required for selfcheckgpt
|
| 50 |
```
|
| 51 |
|
| 52 |
+
## Architecture Overview
|
| 53 |
|
| 54 |
+
The Open-MOE-LLM-Leaderboard project uses the following architecture:
|
| 55 |
|
| 56 |
+
- **User Interface (Gradio)** ->upload-> **HuggingFace Dataset (Request)** ->download-> **Backend GPU Server** ->upload-> **HuggingFace Dataset (Result)** ->download-> **User Interface (Gradio)**
|
| 57 |
|
| 58 |
+
In brief:
|
| 59 |
+
1. Users submit model benchmarking requests through the Gradio interface ([app.py](./app.py)). These requests are then recorded in a HuggingFace dataset ([sparse-generative-ai/requests](https://huggingface.co/datasets/sparse-generative-ai/requests)).
|
| 60 |
+
2. The backend ([backend-cli.py](./backend-cli.py)), running on a GPU server, processes these requests, performs the benchmarking tasks, and uploads the results to another HuggingFace dataset ([sparse-generative-ai/results](https://huggingface.co/datasets/sparse-generative-ai/results)).
|
| 61 |
+
3. Finally, the Gradio interface retrieves and displays these results to the users.
|
| 62 |
|
| 63 |
+
## Running the Gradio Interface
|
| 64 |
|
| 65 |
+
To launch the Gradio interface, execute:
|
| 66 |
|
| 67 |
+
```bash
|
| 68 |
+
python app.py
|
| 69 |
+
```
|
| 70 |
|
| 71 |
+
Then, open your browser and navigate to http://127.0.0.1:7860.
|
| 72 |
|
| 73 |
+
## Running the Backend
|
| 74 |
|
| 75 |
+
To start the backend process, use:
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
python backend-cli.py --debug
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
For additional details, please consult the [backend-cli.py](./backend-cli.py) script.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
|
| 85 |
+
We look forward to your contributions and are here to help guide you through the process. Thank you for supporting the Open-MOE-LLM-Leaderboard project!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -1,527 +1,496 @@
|
|
| 1 |
#!/usr/bin/env python
|
| 2 |
import os
|
| 3 |
-
import
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
RESULT_DIR = os.environ.get("MOECAP_RESULT_DIR")
|
| 9 |
-
if not RESULT_DIR:
|
| 10 |
-
# For testing purposes, you can uncomment the line below:
|
| 11 |
-
# RESULT_DIR = "generic_result_dir"
|
| 12 |
-
# If you are running locally without this env var,
|
| 13 |
-
# ensure you handle this error or set the var.
|
| 14 |
-
pass
|
| 15 |
|
| 16 |
import gradio as gr
|
| 17 |
import pandas as pd
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
|
| 36 |
-
def
|
| 37 |
-
|
| 38 |
-
if max_tick == 0:
|
| 39 |
-
return baseline + 40
|
| 40 |
-
return baseline + (max_tick - min(val, max_tick)) / max_tick * (100 - baseline)
|
| 41 |
|
| 42 |
|
| 43 |
-
def
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
layout_settings = dict(
|
| 47 |
-
height=750,
|
| 48 |
-
autosize=True,
|
| 49 |
-
margin=dict(t=80, b=100, l=80, r=80),
|
| 50 |
-
paper_bgcolor='white',
|
| 51 |
-
plot_bgcolor='white',
|
| 52 |
-
)
|
| 53 |
|
| 54 |
-
if
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
|
| 59 |
-
font=dict(size=16, color="black"), # Ensure text is black
|
| 60 |
-
xanchor='center', yanchor='middle'
|
| 61 |
-
)
|
| 62 |
-
fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
|
| 63 |
-
return fig
|
| 64 |
-
|
| 65 |
-
if len(selected_rows_data) > 3:
|
| 66 |
-
fig = go.Figure()
|
| 67 |
-
fig.add_annotation(
|
| 68 |
-
text="Error: Please select no more than 3 rows!",
|
| 69 |
-
xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
|
| 70 |
-
font=dict(size=18, color="red"),
|
| 71 |
-
xanchor='center', yanchor='middle'
|
| 72 |
)
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
datasets = [row.get('Dataset', '') for row in selected_rows_data]
|
| 77 |
-
unique_datasets = set(datasets)
|
| 78 |
-
if len(unique_datasets) > 1:
|
| 79 |
-
fig = go.Figure()
|
| 80 |
-
fig.add_annotation(
|
| 81 |
-
text="Error: Please select rows from the same dataset!",
|
| 82 |
-
xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
|
| 83 |
-
font=dict(size=18, color="red"),
|
| 84 |
-
xanchor='center', yanchor='middle'
|
| 85 |
)
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
model_name = row.get('Model', 'Unknown')
|
| 94 |
-
if isinstance(model_name, str) and 'href' in model_name:
|
| 95 |
-
try:
|
| 96 |
-
model_name = model_name.split('>', 1)[1].split('<', 1)[0]
|
| 97 |
-
except:
|
| 98 |
-
pass
|
| 99 |
-
|
| 100 |
-
method = row.get('Method', '')
|
| 101 |
-
if isinstance(model_name, str) and '/' in model_name:
|
| 102 |
-
legend_name = model_name.split('/')[-1]
|
| 103 |
-
else:
|
| 104 |
-
legend_name = str(model_name)
|
| 105 |
-
|
| 106 |
-
if method and method not in ['Unknown', '-', '']:
|
| 107 |
-
legend_name = f"{legend_name}-{method}"
|
| 108 |
-
|
| 109 |
-
acc = row.get('Accuracy(%)', 0)
|
| 110 |
-
cost = row.get('Cost($)', 0)
|
| 111 |
-
throughput = row.get('Decoding T/s', 0)
|
| 112 |
-
|
| 113 |
-
try:
|
| 114 |
-
acc = float(acc) if acc not in [None, '-', ''] else 0
|
| 115 |
-
cost = float(cost) if cost not in [None, '-', ''] else 0
|
| 116 |
-
throughput = float(throughput) if throughput not in [None, '-', ''] else 0
|
| 117 |
-
except:
|
| 118 |
-
acc, cost, throughput = 0, 0, 0
|
| 119 |
-
|
| 120 |
-
data[legend_name] = {
|
| 121 |
-
'accuracy': acc / 100.0 if acc > 1 else acc,
|
| 122 |
-
'cost': cost,
|
| 123 |
-
'throughput': throughput
|
| 124 |
-
}
|
| 125 |
-
|
| 126 |
-
throughputs = [v['throughput'] for v in data.values()]
|
| 127 |
-
costs = [v['cost'] for v in data.values()]
|
| 128 |
-
accs = [v['accuracy'] for v in data.values()]
|
| 129 |
-
|
| 130 |
-
tp_min, tp_max = (min(throughputs), max(throughputs)) if throughputs else (0, 1)
|
| 131 |
-
cost_max = max(costs) if costs else 1
|
| 132 |
-
acc_min, acc_max = (min(accs), 1.0) if accs else (0, 1)
|
| 133 |
-
|
| 134 |
-
baseline = 20
|
| 135 |
-
categories = ['Throughput (T/s)', 'Cost ($)', 'Accuracy', 'Throughput (T/s)']
|
| 136 |
|
| 137 |
-
fig = go.Figure()
|
| 138 |
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
-
hovertext = [
|
| 149 |
-
f"Throughput: {raw_vals[0]:.2f} T/s",
|
| 150 |
-
f"Cost: ${raw_vals[1]:.2f}",
|
| 151 |
-
f"Accuracy: {raw_vals[2]*100:.2f}%",
|
| 152 |
-
f"Throughput: {raw_vals[0]:.2f} T/s"
|
| 153 |
-
]
|
| 154 |
-
|
| 155 |
-
fig.add_trace(go.Scatterpolar(
|
| 156 |
-
r=norm_vals,
|
| 157 |
-
theta=categories,
|
| 158 |
-
fill='toself',
|
| 159 |
-
name=system,
|
| 160 |
-
text=hovertext,
|
| 161 |
-
hoverinfo='text+name',
|
| 162 |
-
line=dict(width=2)
|
| 163 |
-
))
|
| 164 |
-
|
| 165 |
-
fig.update_layout(
|
| 166 |
-
title=dict(text=f"CAP Radar Plot: {dataset_name}", x=0.5, xanchor='center', font=dict(size=20, color="black")),
|
| 167 |
-
polar=dict(
|
| 168 |
-
radialaxis=dict(
|
| 169 |
-
visible=True,
|
| 170 |
-
range=[0, 100],
|
| 171 |
-
tickfont=dict(size=12, color="black"),
|
| 172 |
-
gridcolor='lightgray', # Add this
|
| 173 |
-
linecolor='gray', # Add this
|
| 174 |
-
showline=True # Add this
|
| 175 |
-
),
|
| 176 |
-
angularaxis=dict(
|
| 177 |
-
tickfont=dict(size=14, color="black"),
|
| 178 |
-
rotation=90,
|
| 179 |
-
direction='clockwise',
|
| 180 |
-
gridcolor='lightgray', # Add this
|
| 181 |
-
linecolor='gray', # Add this
|
| 182 |
-
showline=True # Add this
|
| 183 |
-
),
|
| 184 |
-
bgcolor="white"
|
| 185 |
-
),
|
| 186 |
-
legend=dict(orientation='h', yanchor='bottom', y=-0.15, xanchor='center', x=0.5, font=dict(size=13, color="black")),
|
| 187 |
-
**layout_settings
|
| 188 |
-
)
|
| 189 |
-
|
| 190 |
-
return fig
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
def json_to_row(path: str, metrics: dict) -> dict:
|
| 194 |
-
model_name = metrics.get("model_name")
|
| 195 |
-
if not model_name:
|
| 196 |
-
model_name = "unknown-model"
|
| 197 |
-
|
| 198 |
-
dataset = metrics.get("dataset", "Unknown")
|
| 199 |
-
method = metrics.get("method", "Unknown")
|
| 200 |
-
precision = metrics.get("precision", "Unknown")
|
| 201 |
-
model_type = metrics.get("model_type", "Unknown")
|
| 202 |
-
e2e_s = metrics.get("e2e_s", None)
|
| 203 |
-
batch_size = metrics.get("batch_size", None)
|
| 204 |
-
gpu_type = metrics.get("gpu_type", "")
|
| 205 |
-
cost = metrics.get("cost", None)
|
| 206 |
-
|
| 207 |
-
em = metrics.get("exact_match")
|
| 208 |
-
correct = metrics.get("correct")
|
| 209 |
-
total = metrics.get("total")
|
| 210 |
-
if isinstance(correct, (int, float)) and isinstance(total, (int, float)) and total > 0:
|
| 211 |
-
acc = correct / total
|
| 212 |
-
else:
|
| 213 |
-
acc = em
|
| 214 |
-
|
| 215 |
-
def pct(x):
|
| 216 |
-
return round(x * 100, 2) if isinstance(x, (int, float)) else None
|
| 217 |
-
|
| 218 |
-
if isinstance(model_name, str) and "/" in model_name:
|
| 219 |
-
hf_url = f"https://huggingface.co/{model_name}"
|
| 220 |
-
model_cell = f"<a href='{hf_url}' target='_blank' style='color: #0366d6; text-decoration: none;'>{model_name}</a>"
|
| 221 |
-
else:
|
| 222 |
-
model_cell = model_name
|
| 223 |
-
|
| 224 |
-
row = {
|
| 225 |
-
"Model": model_cell,
|
| 226 |
-
"Dataset": dataset,
|
| 227 |
-
"Method": method,
|
| 228 |
-
"Model type": model_type,
|
| 229 |
-
"Precision": precision,
|
| 230 |
-
"E2E(s)": f2(e2e_s),
|
| 231 |
-
"GPU": gpu_type,
|
| 232 |
-
"Accuracy(%)": pct(acc),
|
| 233 |
-
"Cost($)": cost,
|
| 234 |
-
"Decoding T/s": f2(metrics.get("decoding_throughput")),
|
| 235 |
-
"Prefill T/s": f2(metrics.get("prefill_tp")),
|
| 236 |
-
"Prefill<br>S-MBU(%)": pct(metrics.get("prefill_smbu")),
|
| 237 |
-
"Prefill<br>S-MFU(%)": pct(metrics.get("prefill_smfu")),
|
| 238 |
-
"Decoding<br>S-MBU(%)": pct(metrics.get("decoding_smbu")),
|
| 239 |
-
"Decoding<br>S-MFU(%)": pct(metrics.get("decoding_smfu")),
|
| 240 |
-
"TTFT(s)": f2(metrics.get("ttft")),
|
| 241 |
-
"TPOT(s)": f2(metrics.get("tpot")),
|
| 242 |
-
"Batch size": batch_size,
|
| 243 |
-
}
|
| 244 |
-
return row
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
def load_from_dir(dir_path: str, selected_tasks=None, selected_frameworks=None, selected_model_types=None, selected_precisions=None, search_keyword="", force_refresh=False):
|
| 248 |
-
if not dir_path:
|
| 249 |
-
return "<p style='color:black'>Result Directory not set.</p>", []
|
| 250 |
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
print(f"Fetching from {pattern} (mode={dl_mode})...")
|
| 255 |
-
ds = load_dataset("json", data_files={"train": pattern}, split="train", download_mode=dl_mode)
|
| 256 |
-
except Exception as e:
|
| 257 |
-
print(f"Error loading dataset: {e}")
|
| 258 |
-
return "<p style='color:black'>No files loaded or Dataset not found.</p>", []
|
| 259 |
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
rows.append(json_to_row(f"{dir_path}#{i}", metrics))
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
|
|
|
| 267 |
|
| 268 |
-
|
|
|
|
|
|
|
| 269 |
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
if selected_tasks:
|
| 275 |
-
df = df[df["Dataset"].astype(str).str.lower().isin([x.lower() for x in selected_tasks])]
|
| 276 |
-
if selected_frameworks:
|
| 277 |
-
df = df[df["Method"].astype(str).str.lower().isin([str(x).lower() for x in selected_frameworks])]
|
| 278 |
-
if selected_model_types:
|
| 279 |
-
df = df[df["Model type"].astype(str).str.lower().isin([str(x).lower() for x in selected_model_types])]
|
| 280 |
-
if selected_precisions:
|
| 281 |
-
df = df[df["Precision"].astype(str).str.lower().isin([str(x).lower() for x in selected_precisions])]
|
| 282 |
-
if search_keyword and search_keyword.strip():
|
| 283 |
-
df = df[df.astype(str).apply(lambda row: row.str.lower().str.contains(search_keyword.strip().lower()).any(), axis=1)]
|
| 284 |
-
|
| 285 |
-
if df.empty:
|
| 286 |
-
return "<p style='color:black'>No records found.</p>", []
|
| 287 |
-
|
| 288 |
-
df = df.fillna("-")
|
| 289 |
-
df.insert(0, 'Row #', range(len(df)))
|
| 290 |
-
|
| 291 |
-
table_html = f'<div class="table-container">{df.to_html(escape=False, index=False, classes="metrics-table")}</div>'
|
| 292 |
-
df_without_rownum = df.drop('Row #', axis=1)
|
| 293 |
-
return table_html, df_without_rownum.to_dict('records')
|
| 294 |
|
|
|
|
|
|
|
|
|
|
| 295 |
|
| 296 |
-
def
|
| 297 |
-
|
|
|
|
| 298 |
|
|
|
|
|
|
|
|
|
|
| 299 |
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
--
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
}
|
| 357 |
-
|
| 358 |
-
/* Remove background color from text elements to prevent "dark blocks" */
|
| 359 |
-
.filter-section label,
|
| 360 |
-
.filter-section span,
|
| 361 |
-
.filter-section p {
|
| 362 |
-
background-color: transparent !important;
|
| 363 |
-
}
|
| 364 |
-
|
| 365 |
-
/* 4. BUTTON FIXES - TARGET BY ID FOR SPECIFICITY */
|
| 366 |
-
#gen_btn {
|
| 367 |
-
background-color: #0366d6 !important;
|
| 368 |
-
color: white !important;
|
| 369 |
-
border: none !important;
|
| 370 |
-
}
|
| 371 |
-
#gen_btn:hover {
|
| 372 |
-
opacity: 0.9;
|
| 373 |
-
}
|
| 374 |
-
|
| 375 |
-
/* 5. INPUTS & CHECKBOXES */
|
| 376 |
-
/* Re-apply white background to inputs specifically */
|
| 377 |
-
.filter-section input,
|
| 378 |
-
.filter-section textarea,
|
| 379 |
-
.filter-section select {
|
| 380 |
-
background-color: #ffffff !important;
|
| 381 |
-
border: 1px solid #d1d5da !important;
|
| 382 |
-
color: #24292e !important;
|
| 383 |
-
}
|
| 384 |
-
|
| 385 |
-
/* --- FIX FOR CHECKBOXES --- */
|
| 386 |
-
/* Use explicit styling for the checked state to ensure visibility */
|
| 387 |
-
.filter-section input[type="checkbox"] {
|
| 388 |
-
appearance: none !important;
|
| 389 |
-
-webkit-appearance: none !important;
|
| 390 |
-
width: 16px !important;
|
| 391 |
-
height: 16px !important;
|
| 392 |
-
background-color: white !important;
|
| 393 |
-
border: 1px solid #d1d5da !important;
|
| 394 |
-
border-radius: 3px !important;
|
| 395 |
-
position: relative !important;
|
| 396 |
-
cursor: pointer !important;
|
| 397 |
-
}
|
| 398 |
-
|
| 399 |
-
.filter-section input[type="checkbox"]:checked {
|
| 400 |
-
background-color: #0366d6 !important;
|
| 401 |
-
border-color: #0366d6 !important;
|
| 402 |
-
/* Draw the checkmark using an SVG data URI */
|
| 403 |
-
background-image: url("data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M12.207 4.793a1 1 0 010 1.414l-5 5a1 1 0 01-1.414 0l-2-2a1 1 0 011.414-1.414L6.5 9.086l4.293-4.293a1 1 0 011.414 0z'/%3e%3c/svg%3e") !important;
|
| 404 |
-
background-size: 100% 100% !important;
|
| 405 |
-
background-position: center !important;
|
| 406 |
-
background-repeat: no-repeat !important;
|
| 407 |
-
}
|
| 408 |
-
|
| 409 |
-
.filter-section label span {
|
| 410 |
-
color: #24292e !important;
|
| 411 |
-
}
|
| 412 |
-
|
| 413 |
-
/* 6. SEARCH BOX */
|
| 414 |
-
.search-box {
|
| 415 |
-
background: white !important;
|
| 416 |
-
padding: 16px !important;
|
| 417 |
-
border-radius: 6px;
|
| 418 |
-
border: 2px solid #e1e4e8 !important;
|
| 419 |
-
margin-bottom: 16px;
|
| 420 |
-
}
|
| 421 |
-
|
| 422 |
-
/* 7. TABLE STYLING */
|
| 423 |
-
.table-container {
|
| 424 |
-
overflow-x: auto;
|
| 425 |
-
max-height: 75vh;
|
| 426 |
-
border: 2px solid #e1e4e8;
|
| 427 |
-
border-radius: 6px;
|
| 428 |
-
background: white !important;
|
| 429 |
-
}
|
| 430 |
-
table.metrics-table {
|
| 431 |
-
width: 100%; border-collapse: collapse; background: white !important;
|
| 432 |
-
}
|
| 433 |
-
table.metrics-table th, table.metrics-table td {
|
| 434 |
-
padding: 10px 14px; border: 1px solid #e1e4e8;
|
| 435 |
-
white-space: nowrap; font-size: 13px; color: #24292e !important;
|
| 436 |
-
}
|
| 437 |
-
table.metrics-table th {
|
| 438 |
-
background: #f6f8fa !important; font-weight: 600; position: sticky; top: 0;
|
| 439 |
-
}
|
| 440 |
-
.metrics-table th:first-child, .metrics-table td:first-child {
|
| 441 |
-
background-color: #f0f0f0 !important; text-align: center;
|
| 442 |
-
}
|
| 443 |
-
|
| 444 |
-
/* 8. PLOT CONTAINER - FORCE WHITE BACKGROUND */
|
| 445 |
-
.plot-container {
|
| 446 |
-
width: 100% !important;
|
| 447 |
-
background-color: white !important;
|
| 448 |
-
}
|
| 449 |
-
.plot-container > div, .plot-container .plotly {
|
| 450 |
-
background-color: white !important;
|
| 451 |
-
}
|
| 452 |
-
|
| 453 |
-
/* 9. LINKS */
|
| 454 |
-
a { color: #0366d6 !important; text-decoration: none; }
|
| 455 |
-
a:hover { text-decoration: underline; }
|
| 456 |
-
"""
|
| 457 |
-
|
| 458 |
-
with gr.Blocks(title="MoE-CAP Dashboard", css=row_css, theme=gr.themes.Default()) as demo:
|
| 459 |
-
gr.Markdown("# MoE-CAP Dashboard")
|
| 460 |
-
|
| 461 |
-
with gr.Row():
|
| 462 |
-
# Left Sidebar
|
| 463 |
-
with gr.Column(scale=2):
|
| 464 |
-
with gr.Group(elem_classes="search-box"):
|
| 465 |
-
search_input = gr.Textbox(label="🔍 Search", placeholder="Search...", lines=1)
|
| 466 |
-
|
| 467 |
-
with gr.Group(elem_classes="filter-section"):
|
| 468 |
-
gr.Markdown("### 🎛️ Filters")
|
| 469 |
-
dir_path = gr.State(RESULT_DIR)
|
| 470 |
-
|
| 471 |
-
task_filter = gr.CheckboxGroup(
|
| 472 |
-
label="📊 Tasks",
|
| 473 |
-
choices=[("GSM8K", "gsm8k"), ("LongBench", "longbench"), ("MMLU", "mmlu"), ("NuminaMath", "numinamath"), ("RULER", "ruler")],
|
| 474 |
-
value=["gsm8k", "longbench", "mmlu", "numinamath", "ruler"]
|
| 475 |
)
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
"
|
| 483 |
-
"### Metrics\n- **E2E(s)**: Latency | **Cost($)** | **T/s**: Throughput | **S-MBU/MFU**: Utilization | **TPOT**, **TTFT**",
|
| 484 |
-
elem_classes="info-section"
|
| 485 |
)
|
| 486 |
|
| 487 |
-
gr.
|
| 488 |
-
"
|
| 489 |
-
|
|
|
|
|
|
|
|
|
|
| 490 |
)
|
| 491 |
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 495 |
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
precision_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
|
| 518 |
-
|
| 519 |
-
generate_btn.click(fn=parse_and_generate_plot, inputs=[df_data_state, row_indices_input], outputs=[radar_plot])
|
| 520 |
-
|
| 521 |
-
gr.Timer(60.0).tick(fn=auto_refresh_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
|
| 522 |
|
| 523 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 524 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 525 |
if __name__ == "__main__":
|
| 526 |
-
|
| 527 |
-
|
|
|
|
|
|
| 1 |
#!/usr/bin/env python
|
| 2 |
import os
|
| 3 |
+
import datetime
|
| 4 |
+
import socket
|
| 5 |
+
import base64
|
| 6 |
+
from threading import Thread
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
import gradio as gr
|
| 9 |
import pandas as pd
|
| 10 |
+
import time
|
| 11 |
+
from apscheduler.schedulers.background import BackgroundScheduler
|
| 12 |
+
|
| 13 |
+
from huggingface_hub import snapshot_download
|
| 14 |
+
|
| 15 |
+
from src.display.about import (
|
| 16 |
+
CITATION_BUTTON_LABEL,
|
| 17 |
+
CITATION_BUTTON_TEXT,
|
| 18 |
+
EVALUATION_QUEUE_TEXT,
|
| 19 |
+
INTRODUCTION_TEXT,
|
| 20 |
+
LLM_BENCHMARKS_TEXT,
|
| 21 |
+
LLM_BENCHMARKS_DETAILS,
|
| 22 |
+
FAQ_TEXT,
|
| 23 |
+
TITLE,
|
| 24 |
+
ACKNOWLEDGEMENT_TEXT,
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
from src.display.css_html_js import custom_css
|
| 28 |
+
|
| 29 |
+
from src.display.utils import (
|
| 30 |
+
BENCHMARK_COLS,
|
| 31 |
+
COLS,
|
| 32 |
+
EVAL_COLS,
|
| 33 |
+
EVAL_TYPES,
|
| 34 |
+
TYPES,
|
| 35 |
+
AutoEvalColumn,
|
| 36 |
+
ModelType,
|
| 37 |
+
InferenceFramework,
|
| 38 |
+
fields,
|
| 39 |
+
WeightType,
|
| 40 |
+
Precision,
|
| 41 |
+
GPUType
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, H4_TOKEN, IS_PUBLIC, \
|
| 45 |
+
QUEUE_REPO, REPO_ID, RESULTS_REPO, DEBUG_QUEUE_REPO, DEBUG_RESULTS_REPO
|
| 46 |
+
from src.populate import get_evaluation_queue_df, get_leaderboard_df
|
| 47 |
+
from src.submission.submit import add_new_eval
|
| 48 |
+
from src.utils import get_dataset_summary_table
|
| 49 |
+
|
| 50 |
+
def get_args():
|
| 51 |
+
import argparse
|
| 52 |
+
|
| 53 |
+
parser = argparse.ArgumentParser(description="Run the LLM Leaderboard")
|
| 54 |
+
parser.add_argument("--debug", action="store_true", help="Run in debug mode")
|
| 55 |
+
return parser.parse_args()
|
| 56 |
+
|
| 57 |
+
args = get_args()
|
| 58 |
+
if args.debug:
|
| 59 |
+
print("Running in debug mode")
|
| 60 |
+
QUEUE_REPO = DEBUG_QUEUE_REPO
|
| 61 |
+
RESULTS_REPO = DEBUG_RESULTS_REPO
|
| 62 |
+
|
| 63 |
+
def ui_snapshot_download(repo_id, local_dir, repo_type, tqdm_class, etag_timeout):
|
| 64 |
+
try:
|
| 65 |
+
print(local_dir)
|
| 66 |
+
snapshot_download(
|
| 67 |
+
repo_id=repo_id, local_dir=local_dir, repo_type=repo_type, tqdm_class=tqdm_class, etag_timeout=etag_timeout
|
| 68 |
+
)
|
| 69 |
+
except Exception as e:
|
| 70 |
+
restart_space()
|
| 71 |
|
| 72 |
|
| 73 |
+
def restart_space():
|
| 74 |
+
API.restart_space(repo_id=REPO_ID, token=H4_TOKEN)
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
|
| 77 |
+
def init_space():
|
| 78 |
+
# dataset_df = get_dataset_summary_table(file_path="blog/Hallucination-Leaderboard-Summary.csv")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
if socket.gethostname() not in {"neuromancer"}:
|
| 81 |
+
# sync model_type with open-llm-leaderboard
|
| 82 |
+
ui_snapshot_download(
|
| 83 |
+
repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
)
|
| 85 |
+
ui_snapshot_download(
|
| 86 |
+
repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
)
|
| 88 |
+
raw_data, original_df = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, "", COLS, BENCHMARK_COLS)
|
| 89 |
+
|
| 90 |
+
finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df = get_evaluation_queue_df(
|
| 91 |
+
EVAL_REQUESTS_PATH, EVAL_COLS
|
| 92 |
+
)
|
| 93 |
+
# return dataset_df, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df
|
| 94 |
+
return None, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
|
|
|
| 96 |
|
| 97 |
+
def add_benchmark_columns(shown_columns):
|
| 98 |
+
benchmark_columns = []
|
| 99 |
+
for benchmark in BENCHMARK_COLS:
|
| 100 |
+
if benchmark in shown_columns:
|
| 101 |
+
for c in COLS:
|
| 102 |
+
if benchmark in c and benchmark != c:
|
| 103 |
+
benchmark_columns.append(c)
|
| 104 |
+
return benchmark_columns
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
# Searching and filtering
|
| 108 |
+
def update_table(
|
| 109 |
+
hidden_df: pd.DataFrame, columns: list, type_query: list, precision_query: list, size_query: list, query: str
|
| 110 |
+
):
|
| 111 |
+
filtered_df = filter_models(hidden_df, type_query, size_query, precision_query)
|
| 112 |
+
filtered_df = filter_queries(query, filtered_df)
|
| 113 |
+
benchmark_columns = add_benchmark_columns(columns)
|
| 114 |
+
df = select_columns(filtered_df, columns + benchmark_columns)
|
| 115 |
+
return df
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def search_table(df: pd.DataFrame, query: str) -> pd.DataFrame:
|
| 119 |
+
return df[(df[AutoEvalColumn.dummy.name].str.contains(query, case=False))]
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def select_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
|
| 123 |
+
# always_here_cols = [AutoEvalColumn.model_type_symbol.name, AutoEvalColumn.model.name]
|
| 124 |
+
|
| 125 |
+
always_here_cols = [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
|
| 126 |
+
dummy_col = [AutoEvalColumn.dummy.name]
|
| 127 |
+
|
| 128 |
+
# We use COLS to maintain sorting
|
| 129 |
+
filtered_df = df[
|
| 130 |
+
# always_here_cols + [c for c in COLS if c in df.columns and c in columns] + [AutoEvalColumn.dummy.name]
|
| 131 |
+
always_here_cols
|
| 132 |
+
+ [c for c in COLS if c in df.columns and c in columns]
|
| 133 |
+
+ dummy_col
|
| 134 |
+
]
|
| 135 |
+
return filtered_df
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
def filter_queries(query: str, filtered_df: pd.DataFrame):
|
| 139 |
+
final_df = []
|
| 140 |
+
if query != "":
|
| 141 |
+
queries = [q.strip() for q in query.split(";")]
|
| 142 |
+
for _q in queries:
|
| 143 |
+
_q = _q.strip()
|
| 144 |
+
if _q != "":
|
| 145 |
+
temp_filtered_df = search_table(filtered_df, _q)
|
| 146 |
+
if len(temp_filtered_df) > 0:
|
| 147 |
+
final_df.append(temp_filtered_df)
|
| 148 |
+
if len(final_df) > 0:
|
| 149 |
+
filtered_df = pd.concat(final_df)
|
| 150 |
+
subset = [AutoEvalColumn.model.name, AutoEvalColumn.precision.name, AutoEvalColumn.revision.name]
|
| 151 |
+
filtered_df = filtered_df.drop_duplicates(subset=subset)
|
| 152 |
+
return filtered_df
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def filter_models(df: pd.DataFrame, type_query: list, size_query: list, precision_query: list) -> pd.DataFrame:
|
| 156 |
+
# Show all models
|
| 157 |
+
filtered_df = df
|
| 158 |
+
|
| 159 |
+
type_emoji = [t[0] for t in type_query]
|
| 160 |
+
filtered_df = filtered_df.loc[df[AutoEvalColumn.model_type_symbol.name].isin(type_emoji)]
|
| 161 |
+
filtered_df = filtered_df.loc[df[AutoEvalColumn.precision.name].isin(precision_query + ["None"])]
|
| 162 |
+
|
| 163 |
+
# numeric_interval = pd.IntervalIndex(sorted([NUMERIC_INTERVALS[s] for s in size_query]))
|
| 164 |
+
# params_column = pd.to_numeric(df[AutoEvalColumn.params.name], errors="coerce")
|
| 165 |
+
# mask = params_column.apply(lambda x: any(numeric_interval.contains(x)))
|
| 166 |
+
# filtered_df = filtered_df.loc[mask]
|
| 167 |
+
|
| 168 |
+
return filtered_df
|
| 169 |
+
|
| 170 |
+
shown_columns = None
|
| 171 |
+
dataset_df, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df = init_space()
|
| 172 |
+
leaderboard_df = original_df.copy()
|
| 173 |
+
|
| 174 |
+
# def update_leaderboard_table():
|
| 175 |
+
# global leaderboard_df, shown_columns
|
| 176 |
+
# print("Updating leaderboard table")
|
| 177 |
+
# return leaderboard_df[
|
| 178 |
+
# [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
|
| 179 |
+
# + shown_columns.value
|
| 180 |
+
# + [AutoEvalColumn.dummy.name]
|
| 181 |
+
# ] if not leaderboard_df.empty else leaderboard_df
|
| 182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
+
# def update_hidden_leaderboard_table():
|
| 185 |
+
# global original_df
|
| 186 |
+
# return original_df[COLS] if original_df.empty is False else original_df
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
+
# def update_dataset_table():
|
| 189 |
+
# global dataset_df
|
| 190 |
+
# return dataset_df
|
|
|
|
| 191 |
|
| 192 |
+
# def update_finish_table():
|
| 193 |
+
# global finished_eval_queue_df
|
| 194 |
+
# return finished_eval_queue_df
|
| 195 |
|
| 196 |
+
# def update_running_table():
|
| 197 |
+
# global running_eval_queue_df
|
| 198 |
+
# return running_eval_queue_df
|
| 199 |
|
| 200 |
+
# def update_pending_table():
|
| 201 |
+
# global pending_eval_queue_df
|
| 202 |
+
# return pending_eval_queue_df
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
+
# def update_finish_num():
|
| 205 |
+
# global finished_eval_queue_df
|
| 206 |
+
# return len(finished_eval_queue_df)
|
| 207 |
|
| 208 |
+
# def update_running_num():
|
| 209 |
+
# global running_eval_queue_df
|
| 210 |
+
# return len(running_eval_queue_df)
|
| 211 |
|
| 212 |
+
# def update_pending_num():
|
| 213 |
+
# global pending_eval_queue_df
|
| 214 |
+
# return len(pending_eval_queue_df)
|
| 215 |
|
| 216 |
+
# triggered only once at startup => read query parameter if it exists
|
| 217 |
+
def load_query(request: gr.Request):
|
| 218 |
+
query = request.query_params.get("query") or ""
|
| 219 |
+
return query
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
def get_image_html(url, image_path):
|
| 223 |
+
with open(image_path, "rb") as image_file:
|
| 224 |
+
encoded_string = base64.b64encode(image_file.read()).decode()
|
| 225 |
+
return f'<a href="{url}" target="_blank"><img src="data:image/jpg;base64,{encoded_string}" alt="NetMind.AI Logo" style="width:100pt;"></a>'
|
| 226 |
+
|
| 227 |
+
|
| 228 |
+
# Prepare the HTML content with the image
|
| 229 |
+
image_html = get_image_html("https://netmind.ai/home", "./src/display/imgs/Netmind.AI_LOGO.jpg")
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
demo = gr.Blocks(css=custom_css)
|
| 233 |
+
with demo:
|
| 234 |
+
gr.HTML(TITLE)
|
| 235 |
+
gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
|
| 236 |
+
gr.HTML(ACKNOWLEDGEMENT_TEXT.format(image_html=image_html))
|
| 237 |
+
|
| 238 |
+
with gr.Tabs(elem_classes="tab-buttons") as tabs:
|
| 239 |
+
with gr.TabItem("open-moe-llm-leaderboard", elem_id="llm-benchmark-tab-table", id=0):
|
| 240 |
+
with gr.Row():
|
| 241 |
+
with gr.Column():
|
| 242 |
+
with gr.Row():
|
| 243 |
+
search_bar = gr.Textbox(
|
| 244 |
+
placeholder=" 🔍 Model search (separate multiple queries with `;`)",
|
| 245 |
+
show_label=False,
|
| 246 |
+
elem_id="search-bar"
|
| 247 |
+
)
|
| 248 |
+
with gr.Row():
|
| 249 |
+
shown_columns = gr.CheckboxGroup(
|
| 250 |
+
choices=[
|
| 251 |
+
c.name
|
| 252 |
+
for c in fields(AutoEvalColumn)
|
| 253 |
+
if not c.hidden and not c.never_hidden and not c.dummy
|
| 254 |
+
],
|
| 255 |
+
value=[
|
| 256 |
+
c.name
|
| 257 |
+
for c in fields(AutoEvalColumn)
|
| 258 |
+
if c.displayed_by_default and not c.hidden and not c.never_hidden
|
| 259 |
+
],
|
| 260 |
+
label="Select columns to show",
|
| 261 |
+
elem_id="column-select",
|
| 262 |
+
interactive=True,
|
| 263 |
+
)
|
| 264 |
+
|
| 265 |
+
with gr.Column(min_width=320):
|
| 266 |
+
filter_columns_size = gr.CheckboxGroup(
|
| 267 |
+
label="Inference frameworks",
|
| 268 |
+
choices=[t.to_str() for t in InferenceFramework],
|
| 269 |
+
value=[t.to_str() for t in InferenceFramework],
|
| 270 |
+
interactive=True,
|
| 271 |
+
elem_id="filter-columns-size",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 272 |
)
|
| 273 |
+
|
| 274 |
+
filter_columns_type = gr.CheckboxGroup(
|
| 275 |
+
label="Model types",
|
| 276 |
+
choices=[t.to_str() for t in ModelType],
|
| 277 |
+
value=[t.to_str() for t in ModelType],
|
| 278 |
+
interactive=True,
|
| 279 |
+
elem_id="filter-columns-type",
|
|
|
|
|
|
|
| 280 |
)
|
| 281 |
|
| 282 |
+
filter_columns_precision = gr.CheckboxGroup(
|
| 283 |
+
label="Precision",
|
| 284 |
+
choices=[i.value.name for i in Precision],
|
| 285 |
+
value=[i.value.name for i in Precision],
|
| 286 |
+
interactive=True,
|
| 287 |
+
elem_id="filter-columns-precision",
|
| 288 |
)
|
| 289 |
|
| 290 |
+
# filter_columns_size = gr.CheckboxGroup(
|
| 291 |
+
# label="Model sizes (in billions of parameters)",
|
| 292 |
+
# choices=list(NUMERIC_INTERVALS.keys()),
|
| 293 |
+
# value=list(NUMERIC_INTERVALS.keys()),
|
| 294 |
+
# interactive=True,
|
| 295 |
+
# elem_id="filter-columns-size",
|
| 296 |
+
# )
|
| 297 |
+
|
| 298 |
+
# breakpoint()
|
| 299 |
+
benchmark_columns = add_benchmark_columns(shown_columns.value)
|
| 300 |
+
leaderboard_table = gr.components.Dataframe(
|
| 301 |
+
value=(
|
| 302 |
+
leaderboard_df[
|
| 303 |
+
[c.name for c in fields(AutoEvalColumn) if c.never_hidden]
|
| 304 |
+
+ shown_columns.value
|
| 305 |
+
+ benchmark_columns
|
| 306 |
+
+ [AutoEvalColumn.dummy.name]
|
| 307 |
+
]
|
| 308 |
+
if leaderboard_df.empty is False
|
| 309 |
+
else leaderboard_df
|
| 310 |
+
),
|
| 311 |
+
headers=[c.name for c in fields(AutoEvalColumn) if c.never_hidden] + shown_columns.value + benchmark_columns,
|
| 312 |
+
datatype=TYPES,
|
| 313 |
+
elem_id="leaderboard-table",
|
| 314 |
+
interactive=False,
|
| 315 |
+
visible=True,
|
| 316 |
+
) # column_widths=["2%", "20%"]
|
| 317 |
+
|
| 318 |
+
# Dummy leaderboard for handling the case when the user uses backspace key
|
| 319 |
+
hidden_leaderboard_table_for_search = gr.components.Dataframe(
|
| 320 |
+
value=original_df[COLS] if original_df.empty is False else original_df,
|
| 321 |
+
headers=COLS,
|
| 322 |
+
datatype=TYPES,
|
| 323 |
+
visible=False,
|
| 324 |
+
)
|
| 325 |
+
|
| 326 |
+
search_bar.submit(
|
| 327 |
+
update_table,
|
| 328 |
+
[
|
| 329 |
+
hidden_leaderboard_table_for_search,
|
| 330 |
+
shown_columns,
|
| 331 |
+
filter_columns_type,
|
| 332 |
+
filter_columns_precision,
|
| 333 |
+
filter_columns_size,
|
| 334 |
+
search_bar,
|
| 335 |
+
],
|
| 336 |
+
leaderboard_table
|
| 337 |
+
)
|
| 338 |
+
|
| 339 |
+
# Check query parameter once at startup and update search bar
|
| 340 |
+
demo.load(load_query, inputs=[], outputs=[search_bar])
|
| 341 |
+
|
| 342 |
+
for selector in [shown_columns, filter_columns_type, filter_columns_precision, filter_columns_size]:
|
| 343 |
+
selector.change(
|
| 344 |
+
update_table,
|
| 345 |
+
[
|
| 346 |
+
hidden_leaderboard_table_for_search,
|
| 347 |
+
shown_columns,
|
| 348 |
+
filter_columns_type,
|
| 349 |
+
filter_columns_precision,
|
| 350 |
+
filter_columns_size,
|
| 351 |
+
search_bar,
|
| 352 |
+
],
|
| 353 |
+
leaderboard_table,
|
| 354 |
+
queue=True,
|
| 355 |
+
)
|
| 356 |
+
|
| 357 |
+
# with gr.TabItem("About", elem_id="llm-benchmark-tab-table", id=2):
|
| 358 |
+
# gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
|
| 359 |
+
|
| 360 |
+
# dataset_table = gr.components.Dataframe(
|
| 361 |
+
# value=dataset_df,
|
| 362 |
+
# headers=list(dataset_df.columns),
|
| 363 |
+
# datatype=["str", "markdown", "str", "str", "str"],
|
| 364 |
+
# elem_id="dataset-table",
|
| 365 |
+
# interactive=False,
|
| 366 |
+
# visible=True,
|
| 367 |
+
# column_widths=["15%", "20%"],
|
| 368 |
+
# )
|
| 369 |
+
|
| 370 |
+
# gr.Markdown(LLM_BENCHMARKS_DETAILS, elem_classes="markdown-text")
|
| 371 |
+
# gr.Markdown(FAQ_TEXT, elem_classes="markdown-text")
|
| 372 |
+
|
| 373 |
+
with gr.TabItem("Submit a model ", elem_id="llm-benchmark-tab-table", id=3):
|
| 374 |
+
with gr.Column():
|
| 375 |
+
with gr.Row():
|
| 376 |
+
gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
|
| 377 |
+
|
| 378 |
+
with gr.Column():
|
| 379 |
+
with gr.Accordion(f"✅ Finished Evaluations ({len(finished_eval_queue_df)})", open=False):
|
| 380 |
+
with gr.Row():
|
| 381 |
+
finished_eval_table = gr.components.Dataframe(
|
| 382 |
+
value=finished_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
|
| 383 |
+
)
|
| 384 |
+
|
| 385 |
+
with gr.Accordion(f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})", open=False):
|
| 386 |
+
with gr.Row():
|
| 387 |
+
running_eval_table = gr.components.Dataframe(
|
| 388 |
+
value=running_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
|
| 389 |
+
)
|
| 390 |
+
|
| 391 |
+
with gr.Accordion(f"⏳ Scheduled Evaluation Queue ({len(pending_eval_queue_df)})", open=False):
|
| 392 |
+
with gr.Row():
|
| 393 |
+
pending_eval_table = gr.components.Dataframe(
|
| 394 |
+
value=pending_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
|
| 395 |
+
)
|
| 396 |
+
|
| 397 |
+
with gr.Row():
|
| 398 |
+
gr.Markdown("# Submit your model here", elem_classes="markdown-text")
|
| 399 |
+
|
| 400 |
+
with gr.Row():
|
| 401 |
+
inference_framework = gr.Dropdown(
|
| 402 |
+
choices=[t.to_str() for t in InferenceFramework],
|
| 403 |
+
label="Inference framework",
|
| 404 |
+
multiselect=False,
|
| 405 |
+
value=None,
|
| 406 |
+
interactive=True,
|
| 407 |
+
)
|
| 408 |
|
| 409 |
+
gpu_type = gr.Dropdown(
|
| 410 |
+
choices=[t.to_str() for t in GPUType],
|
| 411 |
+
label="GPU type",
|
| 412 |
+
multiselect=False,
|
| 413 |
+
value="NVIDIA-A100-PCIe-80GB",
|
| 414 |
+
interactive=True,
|
| 415 |
+
)
|
| 416 |
+
|
| 417 |
+
|
| 418 |
+
with gr.Row():
|
| 419 |
+
with gr.Column():
|
| 420 |
+
model_name_textbox = gr.Textbox(label="Model name")
|
| 421 |
+
revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
|
| 422 |
+
private = gr.Checkbox(False, label="Private", visible=not IS_PUBLIC)
|
| 423 |
+
model_type = gr.Dropdown(
|
| 424 |
+
choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
|
| 425 |
+
label="Model type",
|
| 426 |
+
multiselect=False,
|
| 427 |
+
value=None,
|
| 428 |
+
interactive=True,
|
| 429 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 430 |
|
| 431 |
+
with gr.Column():
|
| 432 |
+
precision = gr.Dropdown(
|
| 433 |
+
choices=[i.value.name for i in Precision if i != Precision.Unknown],
|
| 434 |
+
label="Precision",
|
| 435 |
+
multiselect=False,
|
| 436 |
+
value="float32",
|
| 437 |
+
interactive=True,
|
| 438 |
+
)
|
| 439 |
|
| 440 |
+
weight_type = gr.Dropdown(
|
| 441 |
+
choices=[i.value.name for i in WeightType],
|
| 442 |
+
label="Weights type",
|
| 443 |
+
multiselect=False,
|
| 444 |
+
value="Original",
|
| 445 |
+
interactive=True,
|
| 446 |
+
)
|
| 447 |
+
|
| 448 |
+
base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
|
| 449 |
+
|
| 450 |
+
submit_button = gr.Button("Submit Eval")
|
| 451 |
+
submission_result = gr.Markdown()
|
| 452 |
+
debug = gr.Checkbox(value=args.debug, label="Debug", visible=False)
|
| 453 |
+
submit_button.click(
|
| 454 |
+
add_new_eval,
|
| 455 |
+
[
|
| 456 |
+
model_name_textbox,
|
| 457 |
+
base_model_name_textbox,
|
| 458 |
+
revision_name_textbox,
|
| 459 |
+
precision,
|
| 460 |
+
private,
|
| 461 |
+
weight_type,
|
| 462 |
+
model_type,
|
| 463 |
+
inference_framework,
|
| 464 |
+
debug,
|
| 465 |
+
gpu_type
|
| 466 |
+
],
|
| 467 |
+
submission_result,
|
| 468 |
+
)
|
| 469 |
+
|
| 470 |
+
with gr.Row():
|
| 471 |
+
with gr.Accordion("Citing this leaderboard", open=False):
|
| 472 |
+
citation_button = gr.Textbox(
|
| 473 |
+
value=CITATION_BUTTON_TEXT,
|
| 474 |
+
label=CITATION_BUTTON_LABEL,
|
| 475 |
+
lines=20,
|
| 476 |
+
elem_id="citation-button",
|
| 477 |
+
show_copy_button=True,
|
| 478 |
+
)
|
| 479 |
+
|
| 480 |
+
scheduler = BackgroundScheduler()
|
| 481 |
+
|
| 482 |
+
scheduler.add_job(restart_space, "interval", hours=6)
|
| 483 |
+
|
| 484 |
+
def launch_backend():
|
| 485 |
+
import subprocess
|
| 486 |
+
from src.backend.envs import DEVICE
|
| 487 |
+
|
| 488 |
+
if DEVICE not in {"cpu"}:
|
| 489 |
+
_ = subprocess.run(["python", "backend-cli.py"])
|
| 490 |
+
|
| 491 |
+
# Thread(target=periodic_init, daemon=True).start()
|
| 492 |
+
# scheduler.add_job(launch_backend, "interval", seconds=120)
|
| 493 |
if __name__ == "__main__":
|
| 494 |
+
scheduler.start()
|
| 495 |
+
demo.queue(default_concurrency_limit=40).launch()
|
| 496 |
+
|
backend-cli.py
CHANGED
|
@@ -458,7 +458,6 @@ def get_args():
|
|
| 458 |
parser.add_argument("--gpu-type", type=str, default="NVIDIA-A100-PCIe-80GB",
|
| 459 |
help="GPU type. NVIDIA-A100-PCIe-80GB; NVIDIA-RTX-A5000-24GB; NVIDIA-H100-PCIe-80GB")
|
| 460 |
parser.add_argument("--debug_repo", action="store_true", help="Use debug repo")
|
| 461 |
-
parser.add_argument("--model_type", type=str, default="chat", help="Model type")
|
| 462 |
return parser.parse_args()
|
| 463 |
|
| 464 |
|
|
@@ -489,8 +488,7 @@ if __name__ == "__main__":
|
|
| 489 |
json_filepath="",
|
| 490 |
precision=precision, # Use precision from arguments
|
| 491 |
inference_framework=args.inference_framework, # Use inference framework from arguments
|
| 492 |
-
gpu_type=args.gpu_type
|
| 493 |
-
model_type=args.model_type,
|
| 494 |
)
|
| 495 |
curr_gpu_type = get_gpu_details()
|
| 496 |
if eval_request.gpu_type != curr_gpu_type:
|
|
|
|
| 458 |
parser.add_argument("--gpu-type", type=str, default="NVIDIA-A100-PCIe-80GB",
|
| 459 |
help="GPU type. NVIDIA-A100-PCIe-80GB; NVIDIA-RTX-A5000-24GB; NVIDIA-H100-PCIe-80GB")
|
| 460 |
parser.add_argument("--debug_repo", action="store_true", help="Use debug repo")
|
|
|
|
| 461 |
return parser.parse_args()
|
| 462 |
|
| 463 |
|
|
|
|
| 488 |
json_filepath="",
|
| 489 |
precision=precision, # Use precision from arguments
|
| 490 |
inference_framework=args.inference_framework, # Use inference framework from arguments
|
| 491 |
+
gpu_type=args.gpu_type
|
|
|
|
| 492 |
)
|
| 493 |
curr_gpu_type = get_gpu_details()
|
| 494 |
if eval_request.gpu_type != curr_gpu_type:
|
moe-cap-results
DELETED
|
File without changes
|
requirements.txt
CHANGED
|
@@ -1,6 +1,36 @@
|
|
| 1 |
-
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
| 3 |
datasets
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch
|
| 2 |
+
colorama
|
| 3 |
+
APScheduler
|
| 4 |
+
black
|
| 5 |
+
click
|
| 6 |
datasets
|
| 7 |
+
gradio==4.26.0
|
| 8 |
+
gradio_client
|
| 9 |
+
huggingface-hub
|
| 10 |
+
matplotlib
|
| 11 |
+
numpy
|
| 12 |
+
pandas
|
| 13 |
+
plotly
|
| 14 |
+
python-dateutil
|
| 15 |
+
requests
|
| 16 |
+
semantic-version
|
| 17 |
+
tqdm
|
| 18 |
+
wandb
|
| 19 |
+
transformers
|
| 20 |
+
tokenizers>=0.15.0
|
| 21 |
+
lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@v0.4.2
|
| 22 |
+
accelerate
|
| 23 |
+
sentencepiece
|
| 24 |
+
langdetect
|
| 25 |
+
sacrebleu
|
| 26 |
+
cchardet
|
| 27 |
+
rouge_score
|
| 28 |
+
bert-score
|
| 29 |
+
evaluate
|
| 30 |
+
spacy==3.7.4
|
| 31 |
+
selfcheckgpt
|
| 32 |
+
immutabledict
|
| 33 |
+
gputil
|
| 34 |
+
bitsandbytes
|
| 35 |
+
openai
|
| 36 |
+
scikit-learn
|
src/backend/run_eval_suite.py
CHANGED
|
@@ -17,22 +17,16 @@ def process_results_decorator(func):
|
|
| 17 |
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
| 18 |
prefilling_time = sum([r[2] for r in results]) / len(results)
|
| 19 |
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
prefill_throughput = sum([r[6] for r in results]) / len(results)
|
| 23 |
-
prefill_mfu = sum([r[7] for r in results]) / len(results)
|
| 24 |
-
prefill_mbu = sum([r[8] for r in results]) / len(results)
|
| 25 |
# print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
| 26 |
|
| 27 |
result_dict = func(self, doc, processed_results, *args, **kwargs)
|
| 28 |
result_dict["end_to_end_time"] = end_to_end_time
|
| 29 |
result_dict["prefilling_time"] = prefilling_time
|
| 30 |
result_dict["decoding_throughput"] = decoding_throughput
|
| 31 |
-
result_dict["
|
| 32 |
-
result_dict["
|
| 33 |
-
result_dict["prefill_throughput"] = prefill_throughput
|
| 34 |
-
result_dict["prefill_mfu"] = prefill_mfu
|
| 35 |
-
result_dict["prefill_mbu"] = prefill_mbu
|
| 36 |
return result_dict
|
| 37 |
return wrapper
|
| 38 |
ConfigurableTask.process_results = process_results_decorator(orig_process_results)
|
|
@@ -43,11 +37,8 @@ def aggregation_decorator(func):
|
|
| 43 |
aggregation_list["end_to_end_time"] = mean
|
| 44 |
aggregation_list["prefilling_time"] = mean
|
| 45 |
aggregation_list["decoding_throughput"] = mean
|
| 46 |
-
aggregation_list["
|
| 47 |
-
aggregation_list["
|
| 48 |
-
aggregation_list["prefill_throughput"] = mean
|
| 49 |
-
aggregation_list["prefill_mfu"] = mean
|
| 50 |
-
aggregation_list["prefill_mbu"] = mean
|
| 51 |
return aggregation_list
|
| 52 |
return wrapper
|
| 53 |
ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
|
|
@@ -58,11 +49,8 @@ def higher_is_better_decorator(func):
|
|
| 58 |
higher_is_better_dict["end_to_end_time"] = False
|
| 59 |
higher_is_better_dict["prefilling_time"] = False
|
| 60 |
higher_is_better_dict["decoding_throughput"] = True
|
| 61 |
-
higher_is_better_dict["
|
| 62 |
-
higher_is_better_dict["
|
| 63 |
-
higher_is_better_dict["prefill_throughput"] = True
|
| 64 |
-
higher_is_better_dict["prefill_mfu"] = True
|
| 65 |
-
higher_is_better_dict["prefill_mbu"] = True
|
| 66 |
return higher_is_better_dict
|
| 67 |
return wrapper
|
| 68 |
ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
|
|
@@ -77,8 +65,6 @@ from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
|
|
| 77 |
|
| 78 |
from src.backend.huggingface_generate_until import HFLMwithChatTemplate
|
| 79 |
from src.backend.moe_infinity import MoEHFLM
|
| 80 |
-
from src.backend.vllm import VLLM_MOE
|
| 81 |
-
from src.backend.sglang import SGLangMoE
|
| 82 |
|
| 83 |
def run_evaluation(
|
| 84 |
eval_request: EvalRequest,
|
|
|
|
| 17 |
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
| 18 |
prefilling_time = sum([r[2] for r in results]) / len(results)
|
| 19 |
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
| 20 |
+
mfu = sum([r[4] for r in results]) / len(results)
|
| 21 |
+
mbu = sum([r[5] for r in results]) / len(results)
|
|
|
|
|
|
|
|
|
|
| 22 |
# print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
| 23 |
|
| 24 |
result_dict = func(self, doc, processed_results, *args, **kwargs)
|
| 25 |
result_dict["end_to_end_time"] = end_to_end_time
|
| 26 |
result_dict["prefilling_time"] = prefilling_time
|
| 27 |
result_dict["decoding_throughput"] = decoding_throughput
|
| 28 |
+
result_dict["mfu"] = mfu
|
| 29 |
+
result_dict["mbu"] = mbu
|
|
|
|
|
|
|
|
|
|
| 30 |
return result_dict
|
| 31 |
return wrapper
|
| 32 |
ConfigurableTask.process_results = process_results_decorator(orig_process_results)
|
|
|
|
| 37 |
aggregation_list["end_to_end_time"] = mean
|
| 38 |
aggregation_list["prefilling_time"] = mean
|
| 39 |
aggregation_list["decoding_throughput"] = mean
|
| 40 |
+
aggregation_list["mfu"] = mean
|
| 41 |
+
aggregation_list["mbu"] = mean
|
|
|
|
|
|
|
|
|
|
| 42 |
return aggregation_list
|
| 43 |
return wrapper
|
| 44 |
ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
|
|
|
|
| 49 |
higher_is_better_dict["end_to_end_time"] = False
|
| 50 |
higher_is_better_dict["prefilling_time"] = False
|
| 51 |
higher_is_better_dict["decoding_throughput"] = True
|
| 52 |
+
higher_is_better_dict["mfu"] = True
|
| 53 |
+
higher_is_better_dict["mbu"] = True
|
|
|
|
|
|
|
|
|
|
| 54 |
return higher_is_better_dict
|
| 55 |
return wrapper
|
| 56 |
ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
|
|
|
|
| 65 |
|
| 66 |
from src.backend.huggingface_generate_until import HFLMwithChatTemplate
|
| 67 |
from src.backend.moe_infinity import MoEHFLM
|
|
|
|
|
|
|
| 68 |
|
| 69 |
def run_evaluation(
|
| 70 |
eval_request: EvalRequest,
|
src/backend/tasks/arena_hard/task.py
CHANGED
|
@@ -72,7 +72,7 @@ class ArenaHard(ConfigurableTask):
|
|
| 72 |
super().__init__(config={"metadata": {"version": self.VERSION}})
|
| 73 |
# these end tokens are hard coded because of the current limitaion of the llm-eval.
|
| 74 |
# self.generation_kwargs = {"until": ["\n\n", "<unk>", "<|im_end|>", "</s>", "<|endoftext|>"], "max_length": 512}
|
| 75 |
-
self.generation_kwargs = {"until": ["</s>", "<|im_end|>"], "
|
| 76 |
# self.generation_kwargs_sampling_number = 5 # the number of sampling for self-consistence
|
| 77 |
# self.generation_kwargs_sampling = {
|
| 78 |
# "temperature": 0.99,
|
|
|
|
| 72 |
super().__init__(config={"metadata": {"version": self.VERSION}})
|
| 73 |
# these end tokens are hard coded because of the current limitaion of the llm-eval.
|
| 74 |
# self.generation_kwargs = {"until": ["\n\n", "<unk>", "<|im_end|>", "</s>", "<|endoftext|>"], "max_length": 512}
|
| 75 |
+
self.generation_kwargs = {"until": ["</s>", "<|im_end|>"], "max_length": 4096}
|
| 76 |
# self.generation_kwargs_sampling_number = 5 # the number of sampling for self-consistence
|
| 77 |
# self.generation_kwargs_sampling = {
|
| 78 |
# "temperature": 0.99,
|
src/backend/tasks/measurement_task_utils.py
CHANGED
|
@@ -12,12 +12,8 @@ def process_results_decorator(func):
|
|
| 12 |
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
| 13 |
prefilling_time = sum([r[2] for r in results]) / len(results)
|
| 14 |
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
prefill_throughput = sum([r[6] for r in results]) / len(results)
|
| 18 |
-
prefill_mfu = sum([r[7] for r in results]) / len(results)
|
| 19 |
-
prefill_mbu = sum([r[8] for r in results]) / len(results)
|
| 20 |
-
|
| 21 |
|
| 22 |
# print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
| 23 |
|
|
@@ -26,11 +22,8 @@ def process_results_decorator(func):
|
|
| 26 |
result_dict["end_to_end_time"] = end_to_end_time
|
| 27 |
result_dict["prefilling_time"] = prefilling_time
|
| 28 |
result_dict["decoding_throughput"] = decoding_throughput
|
| 29 |
-
result_dict["
|
| 30 |
-
result_dict["
|
| 31 |
-
result_dict["prefill_throughput"] = prefill_throughput
|
| 32 |
-
result_dict["prefill_mfu"] = prefill_mfu
|
| 33 |
-
result_dict["prefill_mbu"] = prefill_mbu
|
| 34 |
return result_dict
|
| 35 |
return wrapper
|
| 36 |
|
|
@@ -42,11 +35,8 @@ def aggregation_decorator(func):
|
|
| 42 |
aggregation_list["end_to_end_time"] = mean
|
| 43 |
aggregation_list["prefilling_time"] = mean
|
| 44 |
aggregation_list["decoding_throughput"] = mean
|
| 45 |
-
aggregation_list["
|
| 46 |
-
aggregation_list["
|
| 47 |
-
aggregation_list["prefill_throughput"] = mean
|
| 48 |
-
aggregation_list["prefill_mfu"] = mean
|
| 49 |
-
aggregation_list["prefill_mbu"] = mean
|
| 50 |
return aggregation_list
|
| 51 |
return wrapper
|
| 52 |
|
|
@@ -58,11 +48,8 @@ def higher_is_better_decorator(func):
|
|
| 58 |
higher_is_better_dict["end_to_end_time"] = False
|
| 59 |
higher_is_better_dict["prefilling_time"] = False
|
| 60 |
higher_is_better_dict["decoding_throughput"] = True
|
| 61 |
-
higher_is_better_dict["
|
| 62 |
-
higher_is_better_dict["
|
| 63 |
-
higher_is_better_dict["prefill_throughput"] = True
|
| 64 |
-
higher_is_better_dict["prefill_mfu"] = True
|
| 65 |
-
higher_is_better_dict["prefill_mbu"] = True
|
| 66 |
return higher_is_better_dict
|
| 67 |
return wrapper
|
| 68 |
|
|
|
|
| 12 |
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
| 13 |
prefilling_time = sum([r[2] for r in results]) / len(results)
|
| 14 |
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
| 15 |
+
mfu = sum([r[4] for r in results]) / len(results)
|
| 16 |
+
mbu = sum([r[5] for r in results]) / len(results)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
# print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
| 19 |
|
|
|
|
| 22 |
result_dict["end_to_end_time"] = end_to_end_time
|
| 23 |
result_dict["prefilling_time"] = prefilling_time
|
| 24 |
result_dict["decoding_throughput"] = decoding_throughput
|
| 25 |
+
result_dict["mfu"] = mfu
|
| 26 |
+
result_dict["mbu"] = mbu
|
|
|
|
|
|
|
|
|
|
| 27 |
return result_dict
|
| 28 |
return wrapper
|
| 29 |
|
|
|
|
| 35 |
aggregation_list["end_to_end_time"] = mean
|
| 36 |
aggregation_list["prefilling_time"] = mean
|
| 37 |
aggregation_list["decoding_throughput"] = mean
|
| 38 |
+
aggregation_list["mfu"] = mean
|
| 39 |
+
aggregation_list["mbu"] = mean
|
|
|
|
|
|
|
|
|
|
| 40 |
return aggregation_list
|
| 41 |
return wrapper
|
| 42 |
|
|
|
|
| 48 |
higher_is_better_dict["end_to_end_time"] = False
|
| 49 |
higher_is_better_dict["prefilling_time"] = False
|
| 50 |
higher_is_better_dict["decoding_throughput"] = True
|
| 51 |
+
higher_is_better_dict["mfu"] = True
|
| 52 |
+
higher_is_better_dict["mbu"] = True
|
|
|
|
|
|
|
|
|
|
| 53 |
return higher_is_better_dict
|
| 54 |
return wrapper
|
| 55 |
|
src/display/about.py
CHANGED
|
@@ -18,13 +18,10 @@ Columns and Metrics:
|
|
| 18 |
- Method: The MOE LLMs inference framework.
|
| 19 |
- E2E(s): Average End to End generation time in seconds.
|
| 20 |
- PRE(s): Prefilling Time of input prompt in seconds.
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
- Prefill S-MBU(%): Sparse Model Bandwidth Utilization for Prefilling.
|
| 26 |
-
- Prefill S-MFU(%): Sparse Model FLOPs Utilization for Prefilling.
|
| 27 |
-
- Precision: The precision of used model.
|
| 28 |
|
| 29 |
"""
|
| 30 |
|
|
|
|
| 18 |
- Method: The MOE LLMs inference framework.
|
| 19 |
- E2E(s): Average End to End generation time in seconds.
|
| 20 |
- PRE(s): Prefilling Time of input prompt in seconds.
|
| 21 |
+
- T/s: Tokens throughout per second.
|
| 22 |
+
- MBU(%): Model Bandwidth Utilization.
|
| 23 |
+
- MFU(%): Model FLOPs Utilization.
|
| 24 |
+
- Precision: The precison of used model.
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
"""
|
| 27 |
|
src/display/utils.py
CHANGED
|
@@ -9,32 +9,25 @@ def fields(raw_class):
|
|
| 9 |
|
| 10 |
E2Es = "E2E(s)" #"End-to-end time (s)"
|
| 11 |
PREs = "PRE(s)" #"Prefilling time (s)"
|
| 12 |
-
TS = "
|
| 13 |
-
PTS = "Prefill T/s" #Prefill throughput (tok/s)
|
| 14 |
InFrame = "Method" #"Inference framework"
|
| 15 |
MULTIPLE_CHOICEs = ["mmlu"]
|
| 16 |
|
| 17 |
-
|
| 18 |
GPU_TEMP = 'Temp(C)'
|
| 19 |
GPU_Power = 'Power(W)'
|
| 20 |
GPU_Mem = 'Mem(G)'
|
| 21 |
GPU_Name = "GPU"
|
| 22 |
GPU_Util = 'Util(%)'
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
PSMFU = 'Prefill S-MFU(%)'
|
| 26 |
-
PSMBU = 'Prefill S-MBU(%)'
|
| 27 |
BATCH_SIZE = 'bs'
|
| 28 |
PRECISION = "Precision"
|
| 29 |
system_metrics_to_name_map = {
|
| 30 |
"end_to_end_time": f"{E2Es}",
|
| 31 |
"prefilling_time": f"{PREs}",
|
| 32 |
"decoding_throughput": f"{TS}",
|
| 33 |
-
"
|
| 34 |
-
"
|
| 35 |
-
"prefill_throughput": f"{PTS}",
|
| 36 |
-
"prefill_mfu": f"{PSMFU}",
|
| 37 |
-
"prefill_mbu": f"{PSMBU}",
|
| 38 |
}
|
| 39 |
|
| 40 |
gpu_metrics_to_name_map = {
|
|
@@ -85,11 +78,10 @@ class Tasks(Enum):
|
|
| 85 |
|
| 86 |
# # XXX include me back at some point
|
| 87 |
# selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
|
| 88 |
-
|
| 89 |
gsm8k = Task("gsm8k_custom", "em", "GSM8K") #GSM8K/EM (5-shot)
|
| 90 |
# gsm8k_cot = Task("gsm8k_cot", "em", "GSM8K COT") #GSM8K COT/EM (5-shot)
|
| 91 |
arena_hard = Task("arena_hard", "score", "Arena Hard") #Arena Hard/Score
|
| 92 |
-
mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
|
| 93 |
|
| 94 |
|
| 95 |
# These classes are for user facing column names,
|
|
@@ -114,7 +106,7 @@ auto_eval_column_dict.append(["model", ColumnContent, ColumnContent("Model", "ma
|
|
| 114 |
# # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
|
| 115 |
|
| 116 |
# Inference framework
|
| 117 |
-
auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True
|
| 118 |
|
| 119 |
for task in Tasks:
|
| 120 |
auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
|
|
@@ -125,30 +117,24 @@ for task in Tasks:
|
|
| 125 |
# auto_eval_column_dict.append([f"{task.name}_gpu_mem", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Mem}", "number", True, hidden=True)])
|
| 126 |
auto_eval_column_dict.append([f"{task.name}_gpu", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Name}", "str", True, hidden=True)])
|
| 127 |
# auto_eval_column_dict.append([f"{task.name}_gpu_util", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Util}", "number", True, hidden=True)])
|
| 128 |
-
auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} {PREs}", "number", False, hidden=True)])
|
| 129 |
if task.value.benchmark in MULTIPLE_CHOICEs:
|
| 130 |
continue
|
|
|
|
| 131 |
auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {TS}", "number", True, hidden=True)])
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
auto_eval_column_dict.append([f"{task.name}_decoding_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {DSMBU}", "number", True, hidden=True)])
|
| 135 |
-
auto_eval_column_dict.append([f"{task.name}_decoding_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {DSMFU}", "number", True, hidden=True)])
|
| 136 |
-
auto_eval_column_dict.append([f"{task.name}_prefill_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {PTS}", "number", True, hidden=True)])
|
| 137 |
-
auto_eval_column_dict.append([f"{task.name}_prefill_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {PSMBU}", "number", True, hidden=True)])
|
| 138 |
-
auto_eval_column_dict.append([f"{task.name}_prefill_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {PSMFU}", "number", True, hidden=True)])
|
| 139 |
-
|
| 140 |
|
| 141 |
|
| 142 |
# Model information
|
| 143 |
-
auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
# Dummy column for the search bar (hidden by the custom CSS)
|
| 153 |
auto_eval_column_dict.append(["dummy", ColumnContent, ColumnContent("model_name_for_query", "str", False, dummy=True)])
|
| 154 |
|
|
@@ -174,10 +160,10 @@ class ModelDetails:
|
|
| 174 |
|
| 175 |
|
| 176 |
class ModelType(Enum):
|
| 177 |
-
|
| 178 |
-
|
| 179 |
chat = ModelDetails(name="chat models (RLHF, DPO, IFT, ...)", symbol="💬")
|
| 180 |
-
|
| 181 |
Unknown = ModelDetails(name="", symbol="?")
|
| 182 |
|
| 183 |
def to_str(self, separator=" "):
|
|
@@ -185,25 +171,22 @@ class ModelType(Enum):
|
|
| 185 |
|
| 186 |
@staticmethod
|
| 187 |
def from_str(type):
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
if any([k in type for k in ["instruction-tuned", "RL-tuned", "chat", "🟦", "⭕", "💬"]]):
|
| 193 |
return ModelType.chat
|
| 194 |
-
|
| 195 |
-
|
| 196 |
return ModelType.Unknown
|
| 197 |
|
| 198 |
|
| 199 |
class InferenceFramework(Enum):
|
| 200 |
# "moe-infinity", hf-chat
|
| 201 |
-
|
| 202 |
HF_Chat = ModelDetails("hf-chat")
|
| 203 |
VLLM = ModelDetails("vllm_moe")
|
| 204 |
-
VLLM_FIX = ModelDetails("vllm_moe_fixbs")
|
| 205 |
-
TRTLLM = ModelDetails("tensorrt_llm")
|
| 206 |
-
SGLANG = ModelDetails("sglang")
|
| 207 |
Unknown = ModelDetails("?")
|
| 208 |
|
| 209 |
def to_str(self):
|
|
@@ -211,23 +194,16 @@ class InferenceFramework(Enum):
|
|
| 211 |
|
| 212 |
@staticmethod
|
| 213 |
def from_str(inference_framework: str):
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
if inference_framework in ["tensorrt_llm"]:
|
| 217 |
-
return InferenceFramework.TRTLLM
|
| 218 |
if inference_framework in ["hf-chat"]:
|
| 219 |
return InferenceFramework.HF_Chat
|
| 220 |
if inference_framework in ["vllm_moe"]:
|
| 221 |
return InferenceFramework.VLLM
|
| 222 |
-
if inference_framework in ["vllm_moe_fixbs"]:
|
| 223 |
-
return InferenceFramework.VLLM_FIX
|
| 224 |
-
if inference_framework in ["sglang"]:
|
| 225 |
-
return InferenceFramework.SGLANG
|
| 226 |
return InferenceFramework.Unknown
|
| 227 |
|
| 228 |
class GPUType(Enum):
|
| 229 |
A100_sxm = ModelDetails("NVIDIA-A100-SXM4-80GB")
|
| 230 |
-
A100_sxm4 = ModelDetails("NVIDIA-A100-SMX4-80GB")
|
| 231 |
A100_pcie = ModelDetails("NVIDIA-A100-PCIe-80GB")
|
| 232 |
Unknown = ModelDetails("?")
|
| 233 |
|
|
@@ -249,28 +225,28 @@ class WeightType(Enum):
|
|
| 249 |
|
| 250 |
|
| 251 |
class Precision(Enum):
|
| 252 |
-
|
| 253 |
-
|
| 254 |
bfloat16 = ModelDetails("bfloat16")
|
| 255 |
qt_8bit = ModelDetails("8bit")
|
| 256 |
qt_4bit = ModelDetails("4bit")
|
| 257 |
-
|
| 258 |
Unknown = ModelDetails("?")
|
| 259 |
|
| 260 |
@staticmethod
|
| 261 |
def from_str(precision: str):
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
if precision in ["torch.bfloat16", "bfloat16"]:
|
| 267 |
return Precision.bfloat16
|
| 268 |
if precision in ["8bit"]:
|
| 269 |
return Precision.qt_8bit
|
| 270 |
if precision in ["4bit"]:
|
| 271 |
return Precision.qt_4bit
|
| 272 |
-
|
| 273 |
-
|
| 274 |
return Precision.Unknown
|
| 275 |
|
| 276 |
|
|
|
|
| 9 |
|
| 10 |
E2Es = "E2E(s)" #"End-to-end time (s)"
|
| 11 |
PREs = "PRE(s)" #"Prefilling time (s)"
|
| 12 |
+
TS = "T/s" #Decoding throughput (tok/s)
|
|
|
|
| 13 |
InFrame = "Method" #"Inference framework"
|
| 14 |
MULTIPLE_CHOICEs = ["mmlu"]
|
| 15 |
|
|
|
|
| 16 |
GPU_TEMP = 'Temp(C)'
|
| 17 |
GPU_Power = 'Power(W)'
|
| 18 |
GPU_Mem = 'Mem(G)'
|
| 19 |
GPU_Name = "GPU"
|
| 20 |
GPU_Util = 'Util(%)'
|
| 21 |
+
MFU = 'MFU(%)'
|
| 22 |
+
MBU = 'MBU(%)'
|
|
|
|
|
|
|
| 23 |
BATCH_SIZE = 'bs'
|
| 24 |
PRECISION = "Precision"
|
| 25 |
system_metrics_to_name_map = {
|
| 26 |
"end_to_end_time": f"{E2Es}",
|
| 27 |
"prefilling_time": f"{PREs}",
|
| 28 |
"decoding_throughput": f"{TS}",
|
| 29 |
+
"mfu": f"{MFU}",
|
| 30 |
+
"mbu": f"{MBU}"
|
|
|
|
|
|
|
|
|
|
| 31 |
}
|
| 32 |
|
| 33 |
gpu_metrics_to_name_map = {
|
|
|
|
| 78 |
|
| 79 |
# # XXX include me back at some point
|
| 80 |
# selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
|
| 81 |
+
mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
|
| 82 |
gsm8k = Task("gsm8k_custom", "em", "GSM8K") #GSM8K/EM (5-shot)
|
| 83 |
# gsm8k_cot = Task("gsm8k_cot", "em", "GSM8K COT") #GSM8K COT/EM (5-shot)
|
| 84 |
arena_hard = Task("arena_hard", "score", "Arena Hard") #Arena Hard/Score
|
|
|
|
| 85 |
|
| 86 |
|
| 87 |
# These classes are for user facing column names,
|
|
|
|
| 106 |
# # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
|
| 107 |
|
| 108 |
# Inference framework
|
| 109 |
+
auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True)])
|
| 110 |
|
| 111 |
for task in Tasks:
|
| 112 |
auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
|
|
|
|
| 117 |
# auto_eval_column_dict.append([f"{task.name}_gpu_mem", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Mem}", "number", True, hidden=True)])
|
| 118 |
auto_eval_column_dict.append([f"{task.name}_gpu", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Name}", "str", True, hidden=True)])
|
| 119 |
# auto_eval_column_dict.append([f"{task.name}_gpu_util", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Util}", "number", True, hidden=True)])
|
|
|
|
| 120 |
if task.value.benchmark in MULTIPLE_CHOICEs:
|
| 121 |
continue
|
| 122 |
+
# auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} {PREs}", "number", False, hidden=True)])
|
| 123 |
auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {TS}", "number", True, hidden=True)])
|
| 124 |
+
auto_eval_column_dict.append([f"{task.name}_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {MBU}", "number", True, hidden=True)])
|
| 125 |
+
auto_eval_column_dict.append([f"{task.name}_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {MFU}", "number", True, hidden=True)])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
|
| 128 |
# Model information
|
| 129 |
+
auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
|
| 130 |
+
auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
|
| 131 |
+
auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
|
| 132 |
+
auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True)])
|
| 133 |
+
auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
|
| 134 |
+
auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
|
| 135 |
+
auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
|
| 136 |
+
auto_eval_column_dict.append(["still_on_hub", ColumnContent, ColumnContent("Available on the hub", "bool", False)])
|
| 137 |
+
auto_eval_column_dict.append(["revision", ColumnContent, ColumnContent("Model sha", "str", False, False)])
|
| 138 |
# Dummy column for the search bar (hidden by the custom CSS)
|
| 139 |
auto_eval_column_dict.append(["dummy", ColumnContent, ColumnContent("model_name_for_query", "str", False, dummy=True)])
|
| 140 |
|
|
|
|
| 160 |
|
| 161 |
|
| 162 |
class ModelType(Enum):
|
| 163 |
+
PT = ModelDetails(name="pretrained", symbol="🟢")
|
| 164 |
+
FT = ModelDetails(name="fine-tuned on domain-specific datasets", symbol="🔶")
|
| 165 |
chat = ModelDetails(name="chat models (RLHF, DPO, IFT, ...)", symbol="💬")
|
| 166 |
+
merges = ModelDetails(name="base merges and moerges", symbol="🤝")
|
| 167 |
Unknown = ModelDetails(name="", symbol="?")
|
| 168 |
|
| 169 |
def to_str(self, separator=" "):
|
|
|
|
| 171 |
|
| 172 |
@staticmethod
|
| 173 |
def from_str(type):
|
| 174 |
+
if "fine-tuned" in type or "🔶" in type:
|
| 175 |
+
return ModelType.FT
|
| 176 |
+
if "pretrained" in type or "🟢" in type:
|
| 177 |
+
return ModelType.PT
|
| 178 |
if any([k in type for k in ["instruction-tuned", "RL-tuned", "chat", "🟦", "⭕", "💬"]]):
|
| 179 |
return ModelType.chat
|
| 180 |
+
if "merge" in type or "🤝" in type:
|
| 181 |
+
return ModelType.merges
|
| 182 |
return ModelType.Unknown
|
| 183 |
|
| 184 |
|
| 185 |
class InferenceFramework(Enum):
|
| 186 |
# "moe-infinity", hf-chat
|
| 187 |
+
MoE_Infinity = ModelDetails("moe-infinity")
|
| 188 |
HF_Chat = ModelDetails("hf-chat")
|
| 189 |
VLLM = ModelDetails("vllm_moe")
|
|
|
|
|
|
|
|
|
|
| 190 |
Unknown = ModelDetails("?")
|
| 191 |
|
| 192 |
def to_str(self):
|
|
|
|
| 194 |
|
| 195 |
@staticmethod
|
| 196 |
def from_str(inference_framework: str):
|
| 197 |
+
if inference_framework in ["moe-infinity"]:
|
| 198 |
+
return InferenceFramework.MoE_Infinity
|
|
|
|
|
|
|
| 199 |
if inference_framework in ["hf-chat"]:
|
| 200 |
return InferenceFramework.HF_Chat
|
| 201 |
if inference_framework in ["vllm_moe"]:
|
| 202 |
return InferenceFramework.VLLM
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
return InferenceFramework.Unknown
|
| 204 |
|
| 205 |
class GPUType(Enum):
|
| 206 |
A100_sxm = ModelDetails("NVIDIA-A100-SXM4-80GB")
|
|
|
|
| 207 |
A100_pcie = ModelDetails("NVIDIA-A100-PCIe-80GB")
|
| 208 |
Unknown = ModelDetails("?")
|
| 209 |
|
|
|
|
| 225 |
|
| 226 |
|
| 227 |
class Precision(Enum):
|
| 228 |
+
float32 = ModelDetails("float32")
|
| 229 |
+
float16 = ModelDetails("float16")
|
| 230 |
bfloat16 = ModelDetails("bfloat16")
|
| 231 |
qt_8bit = ModelDetails("8bit")
|
| 232 |
qt_4bit = ModelDetails("4bit")
|
| 233 |
+
qt_GPTQ = ModelDetails("GPTQ")
|
| 234 |
Unknown = ModelDetails("?")
|
| 235 |
|
| 236 |
@staticmethod
|
| 237 |
def from_str(precision: str):
|
| 238 |
+
if precision in ["torch.float32", "float32"]:
|
| 239 |
+
return Precision.float32
|
| 240 |
+
if precision in ["torch.float16", "float16"]:
|
| 241 |
+
return Precision.float16
|
| 242 |
if precision in ["torch.bfloat16", "bfloat16"]:
|
| 243 |
return Precision.bfloat16
|
| 244 |
if precision in ["8bit"]:
|
| 245 |
return Precision.qt_8bit
|
| 246 |
if precision in ["4bit"]:
|
| 247 |
return Precision.qt_4bit
|
| 248 |
+
if precision in ["GPTQ", "None"]:
|
| 249 |
+
return Precision.qt_GPTQ
|
| 250 |
return Precision.Unknown
|
| 251 |
|
| 252 |
|
src/leaderboard/read_evals.py
CHANGED
|
@@ -140,7 +140,6 @@ class EvalResult:
|
|
| 140 |
revision=config.get("model_sha", ""),
|
| 141 |
still_on_hub=still_on_hub,
|
| 142 |
architecture=architecture,
|
| 143 |
-
model_type=ModelType.from_str(config.get("model_type", "")),
|
| 144 |
inference_framework=inference_framework,
|
| 145 |
)
|
| 146 |
|
|
@@ -175,22 +174,22 @@ class EvalResult:
|
|
| 175 |
|
| 176 |
# breakpoint()
|
| 177 |
# average = sum([v for v in self.results.values() if v is not None]) / len(Tasks)
|
| 178 |
-
|
| 179 |
data_dict = {
|
| 180 |
"eval_name": self.eval_name, # not a column, just a save name,
|
| 181 |
AutoEvalColumn.precision.name: self.precision.value.name,
|
| 182 |
-
|
| 183 |
AutoEvalColumn.model_type_symbol.name: self.model_type.value.symbol,
|
| 184 |
-
|
| 185 |
-
|
| 186 |
AutoEvalColumn.model.name: make_clickable_model(self.full_model),
|
| 187 |
AutoEvalColumn.dummy.name: self.full_model,
|
| 188 |
-
|
| 189 |
-
#
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
AutoEvalColumn.inference_framework.name: self.inference_framework,
|
| 195 |
}
|
| 196 |
|
|
|
|
| 140 |
revision=config.get("model_sha", ""),
|
| 141 |
still_on_hub=still_on_hub,
|
| 142 |
architecture=architecture,
|
|
|
|
| 143 |
inference_framework=inference_framework,
|
| 144 |
)
|
| 145 |
|
|
|
|
| 174 |
|
| 175 |
# breakpoint()
|
| 176 |
# average = sum([v for v in self.results.values() if v is not None]) / len(Tasks)
|
| 177 |
+
|
| 178 |
data_dict = {
|
| 179 |
"eval_name": self.eval_name, # not a column, just a save name,
|
| 180 |
AutoEvalColumn.precision.name: self.precision.value.name,
|
| 181 |
+
AutoEvalColumn.model_type.name: self.model_type.value.name,
|
| 182 |
AutoEvalColumn.model_type_symbol.name: self.model_type.value.symbol,
|
| 183 |
+
AutoEvalColumn.weight_type.name: self.weight_type.value.name,
|
| 184 |
+
AutoEvalColumn.architecture.name: self.architecture,
|
| 185 |
AutoEvalColumn.model.name: make_clickable_model(self.full_model),
|
| 186 |
AutoEvalColumn.dummy.name: self.full_model,
|
| 187 |
+
AutoEvalColumn.revision.name: self.revision,
|
| 188 |
+
# AutoEvalColumn.average.name: average,
|
| 189 |
+
AutoEvalColumn.license.name: self.license,
|
| 190 |
+
AutoEvalColumn.likes.name: self.likes,
|
| 191 |
+
AutoEvalColumn.params.name: self.num_params,
|
| 192 |
+
AutoEvalColumn.still_on_hub.name: self.still_on_hub,
|
| 193 |
AutoEvalColumn.inference_framework.name: self.inference_framework,
|
| 194 |
}
|
| 195 |
|
src/populate.py
CHANGED
|
@@ -75,7 +75,7 @@ def get_leaderboard_df(
|
|
| 75 |
df[col] = np.nan
|
| 76 |
|
| 77 |
if not df.empty:
|
| 78 |
-
df = df.
|
| 79 |
|
| 80 |
# filter out if any of the benchmarks have not been produced
|
| 81 |
# df = df[has_no_nan_values(df, benchmark_cols)]
|
|
|
|
| 75 |
df[col] = np.nan
|
| 76 |
|
| 77 |
if not df.empty:
|
| 78 |
+
df = df.round(decimals=2)
|
| 79 |
|
| 80 |
# filter out if any of the benchmarks have not been produced
|
| 81 |
# df = df[has_no_nan_values(df, benchmark_cols)]
|
src/utils.py
CHANGED
|
@@ -4,8 +4,6 @@ import subprocess
|
|
| 4 |
import re
|
| 5 |
import os
|
| 6 |
import GPUtil
|
| 7 |
-
from transformers import AutoConfig
|
| 8 |
-
from typing import List
|
| 9 |
|
| 10 |
try:
|
| 11 |
from src.display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
|
|
@@ -14,63 +12,44 @@ except:
|
|
| 14 |
from display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
|
| 15 |
|
| 16 |
MEM_BW_DICT ={
|
| 17 |
-
"NVIDIA-A100-PCIe-80GB":
|
| 18 |
-
"NVIDIA-A100-
|
| 19 |
-
"NVIDIA-H100-PCIe-80GB":
|
| 20 |
-
"NVIDIA-RTX-A5000-24GB":
|
| 21 |
-
"NVIDIA-RTX-A6000-48GB": 768e9,
|
| 22 |
}
|
| 23 |
|
| 24 |
PEAK_FLOPS_DICT = {
|
| 25 |
"float32":{
|
| 26 |
"NVIDIA-A100-PCIe-80GB": 312e12,
|
| 27 |
-
"NVIDIA-A100-
|
| 28 |
"NVIDIA-H100-PCIe-80GB": 756e12,
|
| 29 |
-
"NVIDIA-RTX-A5000-24GB": 222.2e12
|
| 30 |
-
"NVIDIA-RTX-A6000-48GB": 309.7e12
|
| 31 |
},
|
| 32 |
"float16":{
|
| 33 |
"NVIDIA-A100-PCIe-80GB": 624e12,
|
| 34 |
-
"NVIDIA-A100-
|
| 35 |
"NVIDIA-H100-PCIe-80GB": 1513e12,
|
| 36 |
-
"NVIDIA-RTX-A5000-24GB":
|
| 37 |
-
"NVIDIA-RTX-A6000-48GB": 309.7e12
|
| 38 |
},
|
| 39 |
"bfloat16":{
|
| 40 |
"NVIDIA-A100-PCIe-80GB": 624e12,
|
| 41 |
-
"NVIDIA-A100-
|
| 42 |
"NVIDIA-H100-PCIe-80GB": 1513e12,
|
| 43 |
-
"NVIDIA-RTX-A5000-24GB":
|
| 44 |
-
"NVIDIA-RTX-A6000-48GB": 309.7e12
|
| 45 |
},
|
| 46 |
-
"
|
| 47 |
"NVIDIA-A100-PCIe-80GB": 1248e12,
|
| 48 |
-
"NVIDIA-A100-
|
| 49 |
"NVIDIA-H100-PCIe-80GB": 3026e12,
|
| 50 |
-
"NVIDIA-RTX-A5000-24GB":
|
| 51 |
-
"NVIDIA-RTX-A6000-48GB": 309.7e12
|
| 52 |
},
|
| 53 |
-
"
|
| 54 |
-
"NVIDIA-A100-PCIe-80GB":
|
| 55 |
-
"NVIDIA-A100-
|
| 56 |
-
"NVIDIA-H100-PCIe-80GB":
|
| 57 |
-
"NVIDIA-RTX-A5000-24GB":
|
| 58 |
-
"NVIDIA-RTX-A6000-48GB": 0
|
| 59 |
-
},
|
| 60 |
-
"fp4": {
|
| 61 |
-
"NVIDIA-A100-PCIe-80GB": 1248e12,
|
| 62 |
-
"NVIDIA-A100-SXM4-80GB": 1248e12,
|
| 63 |
-
"NVIDIA-H100-PCIe-80GB": 3026e12,
|
| 64 |
-
"NVIDIA-RTX-A5000-24GB": 0,
|
| 65 |
-
"NVIDIA-RTX-A6000-48GB": 0
|
| 66 |
-
},
|
| 67 |
-
"int4": {
|
| 68 |
-
"NVIDIA-A100-PCIe-80GB": 1248e12,
|
| 69 |
-
"NVIDIA-A100-SXM4-80GB": 1248e12,
|
| 70 |
-
"NVIDIA-H100-PCIe-80GB": 3026e12,
|
| 71 |
-
"NVIDIA-RTX-A5000-24GB": 222.2e12,
|
| 72 |
-
"NVIDIA-RTX-A6000-48GB": 309.7e12
|
| 73 |
}
|
|
|
|
| 74 |
}
|
| 75 |
|
| 76 |
def my_snapshot_download(repo_id, revision, local_dir, repo_type, max_workers):
|
|
@@ -118,7 +97,7 @@ def parse_nvidia_smi():
|
|
| 118 |
# print(f"gpu_indices: {gpu_indices}")
|
| 119 |
gpu_stats = []
|
| 120 |
|
| 121 |
-
gpu_info_pattern = re.compile(r'(\d+)C\s+P\d+\s+(\d+)W
|
| 122 |
# gpu_name_pattern = re.compile(r'NVIDIA\s+([\w\s]+\d+(?:\s*GB)?)')
|
| 123 |
gpu_name_pattern = re.compile(r'NVIDIA\s+(RTX\s+)?([A-Z0-9]+)')
|
| 124 |
|
|
@@ -216,790 +195,17 @@ def get_peak_bw(gpu_name):
|
|
| 216 |
def get_peak_flops(gpu_name, precision):
|
| 217 |
return PEAK_FLOPS_DICT[precision][gpu_name]
|
| 218 |
|
| 219 |
-
def
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
}
|
| 229 |
-
kvs = []
|
| 230 |
-
true_kvs = []
|
| 231 |
-
attn_score = []
|
| 232 |
-
|
| 233 |
-
# Calculate KV sizes
|
| 234 |
-
per_token_kv_size = 2 * n_layers * d_head * n_kv_heads # Default calculation
|
| 235 |
-
|
| 236 |
-
if "DeepSeek" in model_name:
|
| 237 |
-
if hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
|
| 238 |
-
per_token_kv_size = n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
|
| 239 |
-
|
| 240 |
-
# Process each output
|
| 241 |
-
for x in outputs:
|
| 242 |
-
output_len = len(x.outputs[0].token_ids)
|
| 243 |
-
context_prefill_size = len(x.prompt_token_ids)
|
| 244 |
-
|
| 245 |
-
# Calculate attention scores
|
| 246 |
-
if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim"):
|
| 247 |
-
q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
|
| 248 |
-
origin_per_token_k_state_size = n_layers * n_attn_heads * q_head_dim
|
| 249 |
-
origin_per_token_v_state_size = n_layers * n_attn_heads * hf_config.v_head_dim
|
| 250 |
-
attention_score = context_prefill_size * origin_per_token_k_state_size + (output_len - 1) * origin_per_token_k_state_size / 2
|
| 251 |
-
attention_score += context_prefill_size * origin_per_token_v_state_size + (output_len - 1) * origin_per_token_v_state_size / 2
|
| 252 |
-
attention_score = attention_score / 1e12
|
| 253 |
-
else:
|
| 254 |
-
origin_per_token_kv_states_size = n_layers * n_attn_heads * d_head
|
| 255 |
-
attention_score = context_prefill_size * origin_per_token_kv_states_size + (output_len - 1) * origin_per_token_kv_states_size / 2
|
| 256 |
-
attention_score = attention_score * 2 / 1e12
|
| 257 |
-
|
| 258 |
-
# Store attention scores and KV sizes
|
| 259 |
-
attn_score.append(attention_score)
|
| 260 |
-
kv_size = context_prefill_size * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2
|
| 261 |
-
kv_size = kv_size / 1e12
|
| 262 |
-
true_kv = (context_prefill_size * per_token_kv_size + output_len * per_token_kv_size) / 1e12 * 1e3
|
| 263 |
-
kvs.append(kv_size)
|
| 264 |
-
true_kvs.append(true_kv)
|
| 265 |
-
|
| 266 |
-
# Calculate aggregate values
|
| 267 |
-
kv_size = sum(kvs)
|
| 268 |
-
true_kv_size = sum(true_kvs) * 1e3
|
| 269 |
-
attention_score = sum(attn_score) / len(attn_score)
|
| 270 |
-
|
| 271 |
-
# Calculate attention size per token
|
| 272 |
-
if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim") and hasattr(hf_config, "kv_lora_rank"):
|
| 273 |
-
q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
|
| 274 |
-
if not hasattr(hf_config, "q_lora_rank") or not hf_config.q_lora_rank:
|
| 275 |
-
attention_size_per_token = (d_model * n_attn_heads * q_head_dim) + \
|
| 276 |
-
(d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
|
| 277 |
-
(hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
|
| 278 |
-
(hf_config.v_head_dim * n_attn_heads * d_model)
|
| 279 |
-
attention_size_per_token = attention_size_per_token / 1e12
|
| 280 |
-
else:
|
| 281 |
-
attention_size_per_token = (d_model * hf_config.q_lora_rank) + \
|
| 282 |
-
(hf_config.q_lora_rank * n_attn_heads * q_head_dim) + \
|
| 283 |
-
(d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
|
| 284 |
-
(hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
|
| 285 |
-
(hf_config.v_head_dim * n_attn_heads * d_model)
|
| 286 |
-
attention_size_per_token = attention_size_per_token / 1e12
|
| 287 |
-
else:
|
| 288 |
-
attention_size_per_token = d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) + n_attn_heads * d_head * d_model
|
| 289 |
-
attention_size_per_token = attention_size_per_token / 1e12
|
| 290 |
-
|
| 291 |
-
# Calculate expert sizes
|
| 292 |
-
expert_size = d_ff * 3 * d_model / 1e12
|
| 293 |
-
shared_experts_size_total = 0
|
| 294 |
-
deepseek_dense_ffn_size = 0
|
| 295 |
-
deepseek_sparse_layer_num = 0
|
| 296 |
-
|
| 297 |
-
if "Qwen" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "shared_expert_intermediate_size"):
|
| 298 |
-
d_ff = hf_config.moe_intermediate_size
|
| 299 |
-
d_ff_share = hf_config.shared_expert_intermediate_size
|
| 300 |
-
shared_experts_size = d_ff_share * 3 * d_model
|
| 301 |
-
expert_size = d_ff * 3 * d_model
|
| 302 |
-
shared_experts_size_total = shared_experts_size / 1e12
|
| 303 |
-
expert_size = expert_size / 1e12
|
| 304 |
-
elif "Qwen3" in model_name and hasattr(hf_config, "moe_intermediate_size"):
|
| 305 |
-
d_ff = hf_config.moe_intermediate_size
|
| 306 |
-
expert_size = d_ff * 3 * d_model
|
| 307 |
-
expert_size = expert_size / 1e12
|
| 308 |
-
elif "DeepSeek" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "intermediate_size") and hasattr(hf_config, "first_k_dense_replace"):
|
| 309 |
-
d_ff = hf_config.moe_intermediate_size
|
| 310 |
-
d_ff_dense = hf_config.intermediate_size
|
| 311 |
-
deepseek_num_dense_layer = hf_config.first_k_dense_replace
|
| 312 |
-
shared_experts_size = d_ff * 3 * d_model
|
| 313 |
-
expert_size = d_ff * 3 * d_model
|
| 314 |
-
shared_experts = 2
|
| 315 |
-
shared_experts_size_total = shared_experts_size * shared_experts / 1e12
|
| 316 |
-
expert_size = expert_size / 1e12
|
| 317 |
-
deepseek_sparse_layer_num = n_layers - deepseek_num_dense_layer
|
| 318 |
-
deepseek_dense_ffn_size = d_ff_dense * 3 * d_model / 1e12
|
| 319 |
-
|
| 320 |
-
# Calculate S-MBU and S-MFU
|
| 321 |
-
if "Qwen" in model_name and not "Qwen3" in model_name:
|
| 322 |
-
smbu = ((n_layers*(avg_activated_experts * expert_size + shared_experts_size_total + attention_size_per_token) +
|
| 323 |
-
kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 324 |
-
smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size + shared_experts_size_total) + attention_score) \
|
| 325 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 326 |
-
elif "Qwen3" in model_name:
|
| 327 |
-
smbu = ((n_layers * (avg_activated_experts * expert_size + attention_size_per_token) +
|
| 328 |
-
kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 329 |
-
smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
|
| 330 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 331 |
-
elif "DeepSeek" in model_name:
|
| 332 |
-
smbu = ((n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
|
| 333 |
-
(avg_activated_experts * expert_size + shared_experts_size_total) + \
|
| 334 |
-
deepseek_num_dense_layer * deepseek_dense_ffn_size + \
|
| 335 |
-
kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 336 |
-
smfu = (n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
|
| 337 |
-
(n_experts_per_tok * expert_size + shared_experts_size_total) + \
|
| 338 |
-
deepseek_num_dense_layer * deepseek_dense_ffn_size + attention_score) \
|
| 339 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 340 |
-
else:
|
| 341 |
-
smbu = ((n_layers*(avg_activated_experts * expert_size + attention_size_per_token) +
|
| 342 |
-
kv_size) * precision/ (batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 343 |
-
smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
|
| 344 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 345 |
-
|
| 346 |
-
return {
|
| 347 |
-
'smbu': smbu,
|
| 348 |
-
'smfu': smfu,
|
| 349 |
-
'kv_size': true_kv_size,
|
| 350 |
-
'decoding_throughput': decoding_tp
|
| 351 |
-
}
|
| 352 |
-
|
| 353 |
-
def _calculate_batch_metrics_sglang(outputs, decoding_tp, n_layers, d_model,
|
| 354 |
-
n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
|
| 355 |
-
avg_activated_experts, hf_config, num_gpus, model_name,
|
| 356 |
-
used_dtype, batch_size, precision, ttft=None, prefill_tp=None):
|
| 357 |
-
"""Calculate metrics for a batch of outputs"""
|
| 358 |
-
# Initialize hardware specs and output lists
|
| 359 |
-
hardware_specs = _get_hardware_specs(used_dtype)
|
| 360 |
-
output_data = _extract_output_data(outputs)
|
| 361 |
-
|
| 362 |
-
# Calculate model-specific sizes
|
| 363 |
-
per_token_kv_size = _calculate_kv_size(model_name, hf_config, n_layers, d_head, n_kv_heads)
|
| 364 |
-
attention_size_per_token = _calculate_attention_size(model_name, hf_config, d_model, n_attn_heads, d_head, n_kv_heads)
|
| 365 |
-
expert_config = _calculate_expert_config(model_name, hf_config, d_ff, d_model, n_layers)
|
| 366 |
-
|
| 367 |
-
# Process outputs and calculate metrics
|
| 368 |
-
metrics_data = _process_outputs(output_data, per_token_kv_size, attention_size_per_token,
|
| 369 |
-
model_name, hf_config, n_layers, n_attn_heads, d_head)
|
| 370 |
-
|
| 371 |
-
# Calculate throughput metrics
|
| 372 |
-
if ttft is None or prefill_tp is None:
|
| 373 |
-
ttft, prefill_tp = _calculate_throughput_metrics(batch_size, output_data['prefill_lengths'],
|
| 374 |
-
output_data['max_duration'])
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
# Calculate S-MBU and S-MFU
|
| 378 |
-
smbu_smfu_metrics = _calculate_smbu_smfu(model_name, n_layers, attention_size_per_token,
|
| 379 |
-
expert_config, avg_activated_experts, metrics_data,
|
| 380 |
-
hardware_specs, num_gpus, precision, ttft, prefill_tp,
|
| 381 |
-
batch_size, decoding_tp)
|
| 382 |
-
|
| 383 |
-
return {
|
| 384 |
-
'prefill_smbu': smbu_smfu_metrics['prefill_smbu'],
|
| 385 |
-
'prefill_smfu': smbu_smfu_metrics['prefill_smfu'],
|
| 386 |
-
'decoding_smbu': smbu_smfu_metrics['decoding_smbu'],
|
| 387 |
-
'decoding_smfu': smbu_smfu_metrics['decoding_smfu'],
|
| 388 |
-
'kv_size': metrics_data['true_kv_size'],
|
| 389 |
-
'decoding_throughput': decoding_tp,
|
| 390 |
-
'prefill_tp': prefill_tp,
|
| 391 |
-
'ttft': ttft
|
| 392 |
-
}
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
def _get_hardware_specs(used_dtype):
|
| 396 |
-
"""Get hardware specifications"""
|
| 397 |
-
gpu_type = get_gpu_details()
|
| 398 |
-
return {
|
| 399 |
-
"peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
|
| 400 |
-
"peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
|
| 401 |
-
}
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
def _extract_output_data(outputs):
|
| 405 |
-
"""Extract relevant data from outputs"""
|
| 406 |
-
prefill_lengths = []
|
| 407 |
-
output_lengths = []
|
| 408 |
-
max_duration = 0.0
|
| 409 |
-
|
| 410 |
-
for x in outputs:
|
| 411 |
-
output_lengths.append(x['meta_info']['completion_tokens'])
|
| 412 |
-
prefill_lengths.append(x['meta_info']['prompt_tokens'])
|
| 413 |
-
max_duration = max(max_duration, x['meta_info']['e2e_latency'])
|
| 414 |
-
|
| 415 |
-
return {
|
| 416 |
-
'prefill_lengths': prefill_lengths,
|
| 417 |
-
'output_lengths': output_lengths,
|
| 418 |
-
'max_duration': max_duration
|
| 419 |
-
}
|
| 420 |
-
|
| 421 |
-
|
| 422 |
-
def _calculate_kv_size(model_name, hf_config, n_layers, d_head, n_kv_heads):
|
| 423 |
-
"""Calculate per-token KV size based on model type"""
|
| 424 |
-
if "DeepSeek" in model_name and hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
|
| 425 |
-
return n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
|
| 426 |
-
return 2 * n_layers * d_head * n_kv_heads
|
| 427 |
-
|
| 428 |
-
|
| 429 |
-
def _calculate_attention_size(model_name, hf_config, d_model, n_attn_heads, d_head, n_kv_heads):
|
| 430 |
-
"""Calculate attention size per token based on model type"""
|
| 431 |
-
if ("DeepSeek" in model_name and
|
| 432 |
-
hasattr(hf_config, "qk_rope_head_dim") and
|
| 433 |
-
hasattr(hf_config, "qk_nope_head_dim") and
|
| 434 |
-
hasattr(hf_config, "v_head_dim") and
|
| 435 |
-
hasattr(hf_config, "kv_lora_rank")):
|
| 436 |
-
|
| 437 |
-
return _calculate_deepseek_attention_size(hf_config, d_model, n_attn_heads)
|
| 438 |
-
|
| 439 |
-
return (d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) +
|
| 440 |
-
n_attn_heads * d_head * d_model) / 1e12
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
def _calculate_deepseek_attention_size(hf_config, d_model, n_attn_heads):
|
| 444 |
-
"""Calculate DeepSeek-specific attention size"""
|
| 445 |
-
q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
|
| 446 |
-
|
| 447 |
-
base_size = ((d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) +
|
| 448 |
-
(hf_config.kv_lora_rank * n_attn_heads *
|
| 449 |
-
(q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) +
|
| 450 |
-
(hf_config.v_head_dim * n_attn_heads * d_model))
|
| 451 |
-
|
| 452 |
-
if hasattr(hf_config, "q_lora_rank") and hf_config.q_lora_rank:
|
| 453 |
-
q_size = (d_model * hf_config.q_lora_rank +
|
| 454 |
-
hf_config.q_lora_rank * n_attn_heads * q_head_dim)
|
| 455 |
-
else:
|
| 456 |
-
q_size = d_model * n_attn_heads * q_head_dim
|
| 457 |
-
|
| 458 |
-
return (base_size + q_size) / 1e12
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
def _calculate_expert_config(model_name, hf_config, d_ff, d_model, n_layers):
|
| 462 |
-
"""Calculate expert configuration based on model type"""
|
| 463 |
-
config = {
|
| 464 |
-
'expert_size': d_ff * 3 * d_model / 1e12,
|
| 465 |
-
'shared_experts_size_total': 0,
|
| 466 |
-
'deepseek_dense_ffn_size': 0,
|
| 467 |
-
'deepseek_sparse_layer_num': 0,
|
| 468 |
-
'deepseek_num_dense_layer': 0
|
| 469 |
-
}
|
| 470 |
-
|
| 471 |
-
if "Qwen" in model_name and not "Qwen3" in model_name:
|
| 472 |
-
config.update(_get_qwen_expert_config(hf_config, d_model))
|
| 473 |
-
elif "Qwen3" in model_name:
|
| 474 |
-
config.update(_get_qwen3_expert_config(hf_config, d_model))
|
| 475 |
-
elif "DeepSeek" in model_name:
|
| 476 |
-
config.update(_get_deepseek_expert_config(hf_config, d_model, n_layers))
|
| 477 |
-
|
| 478 |
-
return config
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
def _get_qwen_expert_config(hf_config, d_model):
|
| 482 |
-
"""Get Qwen-specific expert configuration"""
|
| 483 |
-
if (hasattr(hf_config, "moe_intermediate_size") and
|
| 484 |
-
hasattr(hf_config, "shared_expert_intermediate_size")):
|
| 485 |
-
|
| 486 |
-
return {
|
| 487 |
-
'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12,
|
| 488 |
-
'shared_experts_size_total': hf_config.shared_expert_intermediate_size * 3 * d_model / 1e12
|
| 489 |
-
}
|
| 490 |
-
return {}
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
def _get_qwen3_expert_config(hf_config, d_model):
|
| 494 |
-
"""Get Qwen3-specific expert configuration"""
|
| 495 |
-
if hasattr(hf_config, "moe_intermediate_size"):
|
| 496 |
-
return {
|
| 497 |
-
'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12
|
| 498 |
-
}
|
| 499 |
-
return {}
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
def _get_deepseek_expert_config(hf_config, d_model, n_layers):
|
| 503 |
-
"""Get DeepSeek-specific expert configuration"""
|
| 504 |
-
if (hasattr(hf_config, "moe_intermediate_size") and
|
| 505 |
-
hasattr(hf_config, "intermediate_size") and
|
| 506 |
-
hasattr(hf_config, "first_k_dense_replace")):
|
| 507 |
-
|
| 508 |
-
deepseek_num_dense_layer = hf_config.first_k_dense_replace
|
| 509 |
-
return {
|
| 510 |
-
'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12,
|
| 511 |
-
'shared_experts_size_total': hf_config.moe_intermediate_size * 3 * d_model * 2 / 1e12,
|
| 512 |
-
'deepseek_dense_ffn_size': hf_config.intermediate_size * 3 * d_model / 1e12,
|
| 513 |
-
'deepseek_sparse_layer_num': n_layers - deepseek_num_dense_layer,
|
| 514 |
-
'deepseek_num_dense_layer': deepseek_num_dense_layer
|
| 515 |
-
}
|
| 516 |
-
return {}
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
def _process_outputs(output_data, per_token_kv_size, attention_size_per_token,
|
| 520 |
-
model_name, hf_config, n_layers, n_attn_heads, d_head):
|
| 521 |
-
"""Process outputs to calculate KV sizes and attention scores"""
|
| 522 |
-
kvs = []
|
| 523 |
-
true_kvs = []
|
| 524 |
-
attn_scores = []
|
| 525 |
-
|
| 526 |
-
for prefill_len, output_len in zip(output_data['prefill_lengths'], output_data['output_lengths']):
|
| 527 |
-
# Calculate attention score
|
| 528 |
-
attn_score = _calculate_attention_score(model_name, hf_config, prefill_len, output_len,
|
| 529 |
-
n_layers, n_attn_heads, d_head)
|
| 530 |
-
attn_scores.append(attn_score)
|
| 531 |
-
|
| 532 |
-
# Calculate KV sizes
|
| 533 |
-
kv_size = (prefill_len * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2) / 1e12
|
| 534 |
-
true_kv = (prefill_len * per_token_kv_size + output_len * per_token_kv_size) / 1e9
|
| 535 |
-
|
| 536 |
-
kvs.append(kv_size)
|
| 537 |
-
true_kvs.append(true_kv)
|
| 538 |
-
|
| 539 |
-
return {
|
| 540 |
-
'kv_size': sum(kvs),
|
| 541 |
-
'true_kv_size': sum(true_kvs) * 1e3,
|
| 542 |
-
'attention_score': sum(attn_scores) / len(attn_scores)
|
| 543 |
-
}
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
def _calculate_attention_score(model_name, hf_config, prefill_len, output_len,
|
| 547 |
-
n_layers, n_attn_heads, d_head):
|
| 548 |
-
"""Calculate attention score for a single output"""
|
| 549 |
-
if ("DeepSeek" in model_name and
|
| 550 |
-
hasattr(hf_config, "qk_rope_head_dim") and
|
| 551 |
-
hasattr(hf_config, "qk_nope_head_dim") and
|
| 552 |
-
hasattr(hf_config, "v_head_dim")):
|
| 553 |
-
|
| 554 |
-
q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
|
| 555 |
-
k_size = n_layers * n_attn_heads * q_head_dim
|
| 556 |
-
v_size = n_layers * n_attn_heads * hf_config.v_head_dim
|
| 557 |
-
|
| 558 |
-
score = (prefill_len * k_size + (output_len - 1) * k_size / 2 +
|
| 559 |
-
prefill_len * v_size + (output_len - 1) * v_size / 2)
|
| 560 |
-
else:
|
| 561 |
-
kv_size = n_layers * n_attn_heads * d_head
|
| 562 |
-
score = (prefill_len * kv_size + (output_len - 1) * kv_size / 2) * 2
|
| 563 |
-
|
| 564 |
-
return score / 1e12
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
def _calculate_throughput_metrics(batch_size, prefill_lengths, max_duration):
|
| 568 |
-
"""Calculate throughput metrics"""
|
| 569 |
-
total_prefill = sum(prefill_lengths)
|
| 570 |
-
prefill_tp = total_prefill / (max_duration)
|
| 571 |
-
ttft = max_duration / batch_size
|
| 572 |
-
return ttft, prefill_tp
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
def _calculate_smbu_smfu(model_name, n_layers, attention_size_per_token, expert_config,
|
| 576 |
-
avg_activated_experts, metrics_data, hardware_specs, num_gpus,
|
| 577 |
-
precision, ttft, prefill_tp, batch_size, decoding_tp):
|
| 578 |
-
"""Calculate S-MBU and S-MFU metrics"""
|
| 579 |
-
prefill_activation = avg_activated_experts[1]
|
| 580 |
-
decode_steps_activation = avg_activated_experts[2:]
|
| 581 |
-
|
| 582 |
-
# Calculate prefill metrics
|
| 583 |
-
prefill_smbu, prefill_smfu = _calculate_prefill_metrics(
|
| 584 |
-
model_name, n_layers, attention_size_per_token, expert_config,
|
| 585 |
-
prefill_activation, metrics_data['attention_score'], hardware_specs,
|
| 586 |
-
num_gpus, precision, ttft, prefill_tp
|
| 587 |
-
)
|
| 588 |
-
|
| 589 |
-
# Calculate decoding metrics
|
| 590 |
-
decoding_smbu, decoding_smfu = _calculate_decoding_metrics(
|
| 591 |
-
model_name, n_layers, attention_size_per_token, expert_config,
|
| 592 |
-
decode_steps_activation, metrics_data, hardware_specs,
|
| 593 |
-
num_gpus, precision, batch_size, decoding_tp
|
| 594 |
-
)
|
| 595 |
-
|
| 596 |
-
return {
|
| 597 |
-
'prefill_smbu': prefill_smbu,
|
| 598 |
-
'prefill_smfu': prefill_smfu,
|
| 599 |
-
'decoding_smbu': decoding_smbu,
|
| 600 |
-
'decoding_smfu': decoding_smfu
|
| 601 |
-
}
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
def _calculate_prefill_metrics(model_name, n_layers, attention_size_per_token, expert_config,
|
| 605 |
-
prefill_activation, attention_score, hardware_specs,
|
| 606 |
-
num_gpus, precision, ttft, prefill_tp):
|
| 607 |
-
"""Calculate prefill S-MBU and S-MFU"""
|
| 608 |
-
model_calculators = {
|
| 609 |
-
'Qwen': _calculate_qwen_prefill,
|
| 610 |
-
'Qwen3': _calculate_qwen3_prefill,
|
| 611 |
-
'DeepSeek': _calculate_deepseek_prefill
|
| 612 |
-
}
|
| 613 |
-
|
| 614 |
-
for model_type, calculator in model_calculators.items():
|
| 615 |
-
if model_type in model_name and (model_type != 'Qwen' or 'Qwen3' not in model_name):
|
| 616 |
-
return calculator(n_layers, attention_size_per_token, expert_config,
|
| 617 |
-
prefill_activation, attention_score, hardware_specs,
|
| 618 |
-
num_gpus, precision, ttft, prefill_tp)
|
| 619 |
-
|
| 620 |
-
# Default case
|
| 621 |
-
return _calculate_default_prefill(n_layers, attention_size_per_token, expert_config,
|
| 622 |
-
prefill_activation, attention_score, hardware_specs,
|
| 623 |
-
num_gpus, precision, ttft, prefill_tp)
|
| 624 |
-
|
| 625 |
-
|
| 626 |
-
def _calculate_decoding_metrics(model_name, n_layers, attention_size_per_token, expert_config,
|
| 627 |
-
decode_steps_activation, metrics_data, hardware_specs,
|
| 628 |
-
num_gpus, precision, batch_size, decoding_tp):
|
| 629 |
-
"""Calculate decoding S-MBU and S-MFU"""
|
| 630 |
-
decoding_smbus = []
|
| 631 |
-
|
| 632 |
-
for activation in decode_steps_activation:
|
| 633 |
-
if "Qwen" in model_name and "Qwen3" not in model_name:
|
| 634 |
-
smbu, smfu = _calculate_qwen_decoding(n_layers, attention_size_per_token, expert_config,
|
| 635 |
-
activation, metrics_data, hardware_specs, num_gpus,
|
| 636 |
-
precision, batch_size, decoding_tp)
|
| 637 |
-
elif "Qwen3" in model_name:
|
| 638 |
-
smbu, smfu = _calculate_qwen3_decoding(n_layers, attention_size_per_token, expert_config,
|
| 639 |
-
activation, metrics_data, hardware_specs, num_gpus,
|
| 640 |
-
precision, batch_size, decoding_tp)
|
| 641 |
-
elif "DeepSeek" in model_name:
|
| 642 |
-
smbu, smfu = _calculate_deepseek_decoding(n_layers, attention_size_per_token, expert_config,
|
| 643 |
-
activation, metrics_data, hardware_specs, num_gpus,
|
| 644 |
-
precision, batch_size, decoding_tp)
|
| 645 |
-
else:
|
| 646 |
-
smbu, smfu = _calculate_default_decoding(n_layers, attention_size_per_token, expert_config,
|
| 647 |
-
activation, metrics_data, hardware_specs, num_gpus,
|
| 648 |
-
precision, batch_size, decoding_tp)
|
| 649 |
-
decoding_smbus.append(smbu)
|
| 650 |
-
|
| 651 |
-
return sum(decoding_smbus) / len(decoding_smbus), smfu
|
| 652 |
-
|
| 653 |
-
|
| 654 |
-
# Helper functions for specific model calculations
|
| 655 |
-
def _calculate_qwen_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
|
| 656 |
-
attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
|
| 657 |
-
smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
|
| 658 |
-
expert_config['shared_experts_size_total'] +
|
| 659 |
-
attention_size_per_token)) * precision / ttft
|
| 660 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 661 |
-
|
| 662 |
-
smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size'] +
|
| 663 |
-
expert_config['shared_experts_size_total']) + attention_score) * 2 * prefill_tp
|
| 664 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 665 |
-
|
| 666 |
-
return smbu, smfu
|
| 667 |
-
|
| 668 |
-
|
| 669 |
-
def _calculate_qwen3_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
|
| 670 |
-
attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
|
| 671 |
-
smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
|
| 672 |
-
attention_size_per_token)) * precision / ttft
|
| 673 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 674 |
-
|
| 675 |
-
smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size']) +
|
| 676 |
-
attention_score) * 2 * prefill_tp
|
| 677 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 678 |
-
|
| 679 |
-
return smbu, smfu
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
def _calculate_deepseek_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
|
| 683 |
-
attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
|
| 684 |
-
smbu_numerator = ((n_layers * attention_size_per_token +
|
| 685 |
-
expert_config['deepseek_sparse_layer_num'] *
|
| 686 |
-
(prefill_activation * expert_config['expert_size'] +
|
| 687 |
-
expert_config['shared_experts_size_total']) +
|
| 688 |
-
expert_config['deepseek_num_dense_layer'] *
|
| 689 |
-
expert_config['deepseek_dense_ffn_size']) * precision / ttft)
|
| 690 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 691 |
-
|
| 692 |
-
smfu_numerator = ((n_layers * attention_size_per_token +
|
| 693 |
-
expert_config['deepseek_sparse_layer_num'] *
|
| 694 |
-
(expert_config['expert_size'] + expert_config['shared_experts_size_total']) +
|
| 695 |
-
expert_config['deepseek_num_dense_layer'] *
|
| 696 |
-
expert_config['deepseek_dense_ffn_size'] + attention_score) * 2 * prefill_tp)
|
| 697 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 698 |
-
|
| 699 |
-
return smbu, smfu
|
| 700 |
-
|
| 701 |
-
|
| 702 |
-
def _calculate_default_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
|
| 703 |
-
attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
|
| 704 |
-
# Default implementation
|
| 705 |
-
smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
|
| 706 |
-
attention_size_per_token)) * precision / ttft
|
| 707 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 708 |
-
|
| 709 |
-
smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size']) +
|
| 710 |
-
attention_score) * 2 * prefill_tp
|
| 711 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 712 |
-
|
| 713 |
-
return smbu, smfu
|
| 714 |
-
|
| 715 |
-
|
| 716 |
-
def _calculate_qwen_decoding(n_layers, attention_size_per_token, expert_config, activation,
|
| 717 |
-
metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
|
| 718 |
-
smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
|
| 719 |
-
expert_config['shared_experts_size_total'] +
|
| 720 |
-
attention_size_per_token) +
|
| 721 |
-
metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
|
| 722 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 723 |
-
|
| 724 |
-
smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size'] +
|
| 725 |
-
expert_config['shared_experts_size_total']) +
|
| 726 |
-
metrics_data['attention_score']) * 2 * decoding_tp)
|
| 727 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 728 |
-
|
| 729 |
-
return smbu, smfu
|
| 730 |
-
|
| 731 |
-
|
| 732 |
-
def _calculate_qwen3_decoding(n_layers, attention_size_per_token, expert_config, activation,
|
| 733 |
-
metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
|
| 734 |
-
smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
|
| 735 |
-
attention_size_per_token) +
|
| 736 |
-
metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
|
| 737 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 738 |
-
|
| 739 |
-
smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size']) +
|
| 740 |
-
metrics_data['attention_score']) * 2 * decoding_tp)
|
| 741 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 742 |
-
|
| 743 |
-
return smbu, smfu
|
| 744 |
-
|
| 745 |
-
|
| 746 |
-
def _calculate_deepseek_decoding(n_layers, attention_size_per_token, expert_config, activation,
|
| 747 |
-
metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
|
| 748 |
-
smbu_numerator = ((n_layers * attention_size_per_token +
|
| 749 |
-
expert_config['deepseek_sparse_layer_num'] *
|
| 750 |
-
(activation * expert_config['expert_size'] +
|
| 751 |
-
expert_config['shared_experts_size_total']) +
|
| 752 |
-
expert_config['deepseek_num_dense_layer'] *
|
| 753 |
-
expert_config['deepseek_dense_ffn_size'] +
|
| 754 |
-
metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
|
| 755 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 756 |
-
|
| 757 |
-
smfu_numerator = ((n_layers * attention_size_per_token +
|
| 758 |
-
expert_config['deepseek_sparse_layer_num'] *
|
| 759 |
-
(expert_config['expert_size'] + expert_config['shared_experts_size_total']) +
|
| 760 |
-
expert_config['deepseek_num_dense_layer'] *
|
| 761 |
-
expert_config['deepseek_dense_ffn_size'] +
|
| 762 |
-
metrics_data['attention_score']) * 2 * decoding_tp)
|
| 763 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 764 |
-
|
| 765 |
-
return smbu, smfu
|
| 766 |
-
|
| 767 |
-
|
| 768 |
-
def _calculate_default_decoding(n_layers, attention_size_per_token, expert_config, activation,
|
| 769 |
-
metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
|
| 770 |
-
smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
|
| 771 |
-
attention_size_per_token) +
|
| 772 |
-
metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
|
| 773 |
-
smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 774 |
-
|
| 775 |
-
smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size']) +
|
| 776 |
-
metrics_data['attention_score']) * 2 * decoding_tp)
|
| 777 |
-
smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 778 |
-
|
| 779 |
-
return smbu, smfu
|
| 780 |
-
|
| 781 |
-
def _calculate_batch_metrics_hflm(output_len, context_prefill_size, decoding_tp, n_layers, d_model,
|
| 782 |
-
n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
|
| 783 |
-
avg_activated_experts, hf_config, num_gpus, model_name,
|
| 784 |
-
used_dtype, batch_size, precision):
|
| 785 |
-
"""Calculate metrics for a batch of outputs"""
|
| 786 |
-
gpu_type = get_gpu_details()
|
| 787 |
-
hardware_specs = {
|
| 788 |
-
"peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
|
| 789 |
-
"peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
|
| 790 |
-
}
|
| 791 |
-
|
| 792 |
-
# Calculate KV sizes
|
| 793 |
-
per_token_kv_size = 2 * n_layers * d_head * n_kv_heads # Default calculation
|
| 794 |
-
|
| 795 |
-
if "DeepSeek" in model_name:
|
| 796 |
-
if hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
|
| 797 |
-
per_token_kv_size = n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
|
| 798 |
-
|
| 799 |
-
|
| 800 |
-
# Calculate attention scores
|
| 801 |
-
if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim"):
|
| 802 |
-
q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
|
| 803 |
-
origin_per_token_k_state_size = n_layers * n_attn_heads * q_head_dim
|
| 804 |
-
origin_per_token_v_state_size = n_layers * n_attn_heads * hf_config.v_head_dim
|
| 805 |
-
attention_score = context_prefill_size * origin_per_token_k_state_size + (output_len - 1) * origin_per_token_k_state_size / 2
|
| 806 |
-
attention_score += context_prefill_size * origin_per_token_v_state_size + (output_len - 1) * origin_per_token_v_state_size / 2
|
| 807 |
-
attention_score = attention_score / 1e12
|
| 808 |
else:
|
| 809 |
-
|
| 810 |
-
attention_score = context_prefill_size * origin_per_token_kv_states_size + (output_len - 1) * origin_per_token_kv_states_size / 2
|
| 811 |
-
attention_score = attention_score * 2 / 1e12
|
| 812 |
-
|
| 813 |
-
# Store attention scores and KV sizes
|
| 814 |
-
kv_size = context_prefill_size * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2
|
| 815 |
-
kv_size = kv_size / 1e12
|
| 816 |
-
true_kv = (context_prefill_size * per_token_kv_size + output_len * per_token_kv_size) / 1e12 * 1e3
|
| 817 |
-
|
| 818 |
-
# Calculate aggregate values
|
| 819 |
-
kv_size = kv_size * batch_size
|
| 820 |
-
true_kv_size = true_kv * batch_size * 1e3
|
| 821 |
-
# Calculate attention size per token
|
| 822 |
-
if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim") and hasattr(hf_config, "kv_lora_rank"):
|
| 823 |
-
q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
|
| 824 |
-
if not hasattr(hf_config, "q_lora_rank") or not hf_config.q_lora_rank:
|
| 825 |
-
attention_size_per_token = (d_model * n_attn_heads * q_head_dim) + \
|
| 826 |
-
(d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
|
| 827 |
-
(hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
|
| 828 |
-
(hf_config.v_head_dim * n_attn_heads * d_model)
|
| 829 |
-
attention_size_per_token = attention_size_per_token / 1e12
|
| 830 |
-
else:
|
| 831 |
-
attention_size_per_token = (d_model * hf_config.q_lora_rank) + \
|
| 832 |
-
(hf_config.q_lora_rank * n_attn_heads * q_head_dim) + \
|
| 833 |
-
(d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
|
| 834 |
-
(hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
|
| 835 |
-
(hf_config.v_head_dim * n_attn_heads * d_model)
|
| 836 |
-
attention_size_per_token = attention_size_per_token / 1e12
|
| 837 |
-
else:
|
| 838 |
-
attention_size_per_token = d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) + n_attn_heads * d_head * d_model
|
| 839 |
-
attention_size_per_token = attention_size_per_token / 1e12
|
| 840 |
-
|
| 841 |
-
# Calculate expert sizes
|
| 842 |
-
expert_size = d_ff * 3 * d_model / 1e12
|
| 843 |
-
shared_experts_size_total = 0
|
| 844 |
-
deepseek_dense_ffn_size = 0
|
| 845 |
-
deepseek_sparse_layer_num = 0
|
| 846 |
-
|
| 847 |
-
if "Qwen" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "shared_expert_intermediate_size"):
|
| 848 |
-
d_ff = hf_config.moe_intermediate_size
|
| 849 |
-
d_ff_share = hf_config.shared_expert_intermediate_size
|
| 850 |
-
shared_experts_size = d_ff_share * 3 * d_model
|
| 851 |
-
expert_size = d_ff * 3 * d_model
|
| 852 |
-
shared_experts_size_total = shared_experts_size / 1e12
|
| 853 |
-
expert_size = expert_size / 1e12
|
| 854 |
-
elif "Qwen3" in model_name and hasattr(hf_config, "moe_intermediate_size"):
|
| 855 |
-
d_ff = hf_config.moe_intermediate_size
|
| 856 |
-
expert_size = d_ff * 3 * d_model
|
| 857 |
-
expert_size = expert_size / 1e12
|
| 858 |
-
elif "DeepSeek" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "intermediate_size") and hasattr(hf_config, "first_k_dense_replace"):
|
| 859 |
-
d_ff = hf_config.moe_intermediate_size
|
| 860 |
-
d_ff_dense = hf_config.intermediate_size
|
| 861 |
-
deepseek_num_dense_layer = hf_config.first_k_dense_replace
|
| 862 |
-
shared_experts_size = d_ff * 3 * d_model
|
| 863 |
-
expert_size = d_ff * 3 * d_model
|
| 864 |
-
shared_experts = 2
|
| 865 |
-
shared_experts_size_total = shared_experts_size * shared_experts / 1e12
|
| 866 |
-
expert_size = expert_size / 1e12
|
| 867 |
-
deepseek_sparse_layer_num = n_layers - deepseek_num_dense_layer
|
| 868 |
-
deepseek_dense_ffn_size = d_ff_dense * 3 * d_model / 1e12
|
| 869 |
-
|
| 870 |
-
# Calculate S-MBU and S-MFU
|
| 871 |
-
if "Qwen" in model_name:
|
| 872 |
-
smbu = ((n_layers*(avg_activated_experts * expert_size + shared_experts_size_total + attention_size_per_token) +
|
| 873 |
-
kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 874 |
-
smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size + shared_experts_size_total) + attention_score) \
|
| 875 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 876 |
-
elif "Qwen3" in model_name:
|
| 877 |
-
smbu = ((n_layers * (avg_activated_experts * expert_size + attention_size_per_token) +
|
| 878 |
-
kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 879 |
-
smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
|
| 880 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 881 |
-
elif "DeepSeek" in model_name:
|
| 882 |
-
smbu = ((n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
|
| 883 |
-
(avg_activated_experts * expert_size + shared_experts_size_total) + \
|
| 884 |
-
deepseek_num_dense_layer * deepseek_dense_ffn_size + \
|
| 885 |
-
kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 886 |
-
smfu = (n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
|
| 887 |
-
(n_experts_per_tok * expert_size + shared_experts_size_total) + \
|
| 888 |
-
deepseek_num_dense_layer * deepseek_dense_ffn_size + attention_score) \
|
| 889 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 890 |
-
else:
|
| 891 |
-
smbu = ((n_layers*(avg_activated_experts * expert_size + attention_size_per_token) +
|
| 892 |
-
kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
|
| 893 |
-
smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
|
| 894 |
-
* 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
|
| 895 |
-
|
| 896 |
-
return {
|
| 897 |
-
'smbu': smbu,
|
| 898 |
-
'smfu': smfu,
|
| 899 |
-
'kv_size': true_kv_size,
|
| 900 |
-
'decoding_throughput': decoding_tp,
|
| 901 |
-
'ttft': 0
|
| 902 |
-
}
|
| 903 |
-
class ModelInfoRetriever:
|
| 904 |
-
def __init__(self, model_name: str, precision: str = 'float16'):
|
| 905 |
-
if precision not in ['float32', 'float16', 'bfloat16', 'int8', 'int4', 'awq', 'gptq', 'fp8', 'fp4']:
|
| 906 |
-
raise ValueError("Precision must be one of ['float32', 'float16', 'bfloat16', 'int8', 'int4', 'awq', 'gptq', 'fp8', 'fp4']")
|
| 907 |
-
self.model_name = model_name
|
| 908 |
-
self.precision = precision
|
| 909 |
-
self.config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
|
| 910 |
-
self.model_type = self.config.model_type
|
| 911 |
-
|
| 912 |
-
def get_model_precision_bits(self):
|
| 913 |
-
"""Returns bit width used by the given quantization format."""
|
| 914 |
-
if self.precision == 'float32':
|
| 915 |
-
return 4
|
| 916 |
-
if self.precision in ['float16', 'bfloat16']:
|
| 917 |
-
return 2
|
| 918 |
-
if self.precision in ['int8', 'fp8']:
|
| 919 |
-
return 1
|
| 920 |
-
if self.precision in ['int4', 'fp4', 'gptq', 'awq']:
|
| 921 |
-
return 0.5
|
| 922 |
-
raise ValueError(f"Unsupported precision: {self.precision}")
|
| 923 |
-
|
| 924 |
-
def get_attention_info(self):
|
| 925 |
-
"""Returns attention-related info"""
|
| 926 |
-
return {
|
| 927 |
-
'num_attention_heads': getattr(self.config, "num_attention_heads", None),
|
| 928 |
-
'num_key_value_heads': getattr(self.config, "num_key_value_heads", getattr(self.config, "num_kv_heads", None)),
|
| 929 |
-
'head_dim': getattr(self.config, "head_dim", getattr(self.config, "hidden_size", None) // getattr(self.config, "num_attention_heads", 1))
|
| 930 |
-
}
|
| 931 |
-
|
| 932 |
-
def get_rope_info(self):
|
| 933 |
-
"""Returns RoPE (rotary embedding) info if available"""
|
| 934 |
-
if hasattr(self.config, "rope_scaling"):
|
| 935 |
-
return {
|
| 936 |
-
"type": self.config.rope_scaling.get("type"),
|
| 937 |
-
"factor": self.config.rope_scaling.get("factor")
|
| 938 |
-
}
|
| 939 |
-
elif hasattr(self.config, "use_alibi"):
|
| 940 |
-
return {"type": "alibi", "enabled": self.config.use_alibi}
|
| 941 |
-
else:
|
| 942 |
-
return {"type": "none"}
|
| 943 |
-
|
| 944 |
-
def get_moe_info(self, d_model=None):
|
| 945 |
-
"""Returns MoE configuration such as number of experts and FFN dim"""
|
| 946 |
-
if d_model is None:
|
| 947 |
-
d_model = getattr(self.config, "hidden_size", None)
|
| 948 |
-
|
| 949 |
-
num_experts = (
|
| 950 |
-
getattr(self.config, "num_local_experts", None) or
|
| 951 |
-
getattr(self.config, "num_experts", None) or
|
| 952 |
-
getattr(self.config, "n_routed_experts", None) or
|
| 953 |
-
getattr(getattr(self.config, "ffn_config", {}), "moe_num_experts", None) or
|
| 954 |
-
1
|
| 955 |
-
)
|
| 956 |
-
n_experts_per_tok = (
|
| 957 |
-
getattr(self.config, "num_experts_per_tok", None) or
|
| 958 |
-
getattr(self.config, "num_selected_experts", None) or
|
| 959 |
-
getattr(getattr(self.config, "ffn_config", {}), "moe_top_k", None) or
|
| 960 |
-
1
|
| 961 |
-
)
|
| 962 |
-
d_ff = (
|
| 963 |
-
getattr(self.config, "ffn_dim", None) or
|
| 964 |
-
getattr(self.config, "intermediate_size", None) or
|
| 965 |
-
getattr(self.config, "d_ff", None) or
|
| 966 |
-
(d_model * getattr(self.config, "ff_ratio", 4)) or
|
| 967 |
-
getattr(getattr(self.config, "ffn_config", {}), "ffn_hidden_size", None) or
|
| 968 |
-
(4 * d_model)
|
| 969 |
-
)
|
| 970 |
-
|
| 971 |
-
return {
|
| 972 |
-
"num_experts": num_experts,
|
| 973 |
-
"experts_per_token": n_experts_per_tok,
|
| 974 |
-
"ffn_dim": d_ff
|
| 975 |
-
}
|
| 976 |
-
|
| 977 |
-
def get_architecture_info(self):
|
| 978 |
-
"""Returns model-wide architecture info"""
|
| 979 |
-
return {
|
| 980 |
-
"model_type": self.model_type,
|
| 981 |
-
"hidden_size": getattr(self.config, "hidden_size", None),
|
| 982 |
-
"num_hidden_layers": getattr(self.config, "num_hidden_layers", None),
|
| 983 |
-
"max_position_embeddings": getattr(self.config, "max_position_embeddings", None),
|
| 984 |
-
"vocab_size": getattr(self.config, "vocab_size", None),
|
| 985 |
-
"architectures": getattr(self.config, "architectures", []),
|
| 986 |
-
}
|
| 987 |
-
|
| 988 |
-
def summarize(self):
|
| 989 |
-
"""Aggregate all extracted info in a dictionary"""
|
| 990 |
-
d_model = getattr(self.config, "hidden_size", None)
|
| 991 |
-
return {
|
| 992 |
-
"model_name": self.model_name,
|
| 993 |
-
"model_type": self.model_type,
|
| 994 |
-
"precision_bits": self.get_model_precision_bits(),
|
| 995 |
-
"architecture": self.get_architecture_info(),
|
| 996 |
-
"attention": self.get_attention_info(),
|
| 997 |
-
"rope": self.get_rope_info(),
|
| 998 |
-
"moe": self.get_moe_info(d_model)
|
| 999 |
-
}
|
| 1000 |
-
|
| 1001 |
-
|
| 1002 |
|
| 1003 |
-
|
| 1004 |
-
|
| 1005 |
-
# print(get_gpu_details())
|
|
|
|
| 4 |
import re
|
| 5 |
import os
|
| 6 |
import GPUtil
|
|
|
|
|
|
|
| 7 |
|
| 8 |
try:
|
| 9 |
from src.display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
|
|
|
|
| 12 |
from display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
|
| 13 |
|
| 14 |
MEM_BW_DICT ={
|
| 15 |
+
"NVIDIA-A100-PCIe-80GB": 1935,
|
| 16 |
+
"NVIDIA-A100-SXM-80GB": 2039,
|
| 17 |
+
"NVIDIA-H100-PCIe-80GB": 2039,
|
| 18 |
+
"NVIDIA-RTX-A5000-24GB": 768
|
|
|
|
| 19 |
}
|
| 20 |
|
| 21 |
PEAK_FLOPS_DICT = {
|
| 22 |
"float32":{
|
| 23 |
"NVIDIA-A100-PCIe-80GB": 312e12,
|
| 24 |
+
"NVIDIA-A100-SXM-80GB": 312e12,
|
| 25 |
"NVIDIA-H100-PCIe-80GB": 756e12,
|
| 26 |
+
"NVIDIA-RTX-A5000-24GB": 222.2e12
|
|
|
|
| 27 |
},
|
| 28 |
"float16":{
|
| 29 |
"NVIDIA-A100-PCIe-80GB": 624e12,
|
| 30 |
+
"NVIDIA-A100-SXM-80GB": 624e12,
|
| 31 |
"NVIDIA-H100-PCIe-80GB": 1513e12,
|
| 32 |
+
"NVIDIA-RTX-A5000-24GB": 444.4e12
|
|
|
|
| 33 |
},
|
| 34 |
"bfloat16":{
|
| 35 |
"NVIDIA-A100-PCIe-80GB": 624e12,
|
| 36 |
+
"NVIDIA-A100-SXM-80GB": 624e12,
|
| 37 |
"NVIDIA-H100-PCIe-80GB": 1513e12,
|
| 38 |
+
"NVIDIA-RTX-A5000-24GB": 444.4e12
|
|
|
|
| 39 |
},
|
| 40 |
+
"8bit":{
|
| 41 |
"NVIDIA-A100-PCIe-80GB": 1248e12,
|
| 42 |
+
"NVIDIA-A100-SXM-80GB": 1248e12,
|
| 43 |
"NVIDIA-H100-PCIe-80GB": 3026e12,
|
| 44 |
+
"NVIDIA-RTX-A5000-24GB": 889e12
|
|
|
|
| 45 |
},
|
| 46 |
+
"4bit": {
|
| 47 |
+
"NVIDIA-A100-PCIe-80GB": 2496e12,
|
| 48 |
+
"NVIDIA-A100-SXM-80GB": 2496e12,
|
| 49 |
+
"NVIDIA-H100-PCIe-80GB": 6052e12,
|
| 50 |
+
"NVIDIA-RTX-A5000-24GB": 1778e12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
}
|
| 52 |
+
|
| 53 |
}
|
| 54 |
|
| 55 |
def my_snapshot_download(repo_id, revision, local_dir, repo_type, max_workers):
|
|
|
|
| 97 |
# print(f"gpu_indices: {gpu_indices}")
|
| 98 |
gpu_stats = []
|
| 99 |
|
| 100 |
+
gpu_info_pattern = re.compile(r'(\d+)C\s+P\d+\s+(\d+)W / \d+W\s+\|\s+(\d+)MiB / \d+MiB\s+\|\s+(\d+)%')
|
| 101 |
# gpu_name_pattern = re.compile(r'NVIDIA\s+([\w\s]+\d+(?:\s*GB)?)')
|
| 102 |
gpu_name_pattern = re.compile(r'NVIDIA\s+(RTX\s+)?([A-Z0-9]+)')
|
| 103 |
|
|
|
|
| 195 |
def get_peak_flops(gpu_name, precision):
|
| 196 |
return PEAK_FLOPS_DICT[precision][gpu_name]
|
| 197 |
|
| 198 |
+
def transfer_precision2bytes(precision):
|
| 199 |
+
if precision == "float32":
|
| 200 |
+
return 4
|
| 201 |
+
elif precision in ["float16", "bfloat16"]:
|
| 202 |
+
return 2
|
| 203 |
+
elif precision == "8bit":
|
| 204 |
+
return 1
|
| 205 |
+
elif precision == "4bit":
|
| 206 |
+
return 0.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
else:
|
| 208 |
+
raise ValueError(f"Unsupported precision: {precision}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
|
| 210 |
+
if __name__ == "__main__":
|
| 211 |
+
print(analyze_gpu_stats(parse_nvidia_smi()))
|
|
|