Dockerfile CHANGED
@@ -1,16 +1,8 @@
1
- FROM python:3.10
2
-
3
- RUN apt-get update && apt-get install -y git
4
-
5
- RUN pip install --no-cache-dir \
6
- gradio>=4.44.0 \
7
- pandas>=2.0.0 \
8
- datasets>=2.16.0 \
9
- huggingface_hub>=0.20.0 \
10
- uvicorn>=0.23.0 \
11
- fastapi \
12
- spaces
13
-
14
- COPY app.py /app/app.py
15
-
16
- CMD ["python", "/app/app.py"]
 
1
+ # Use specific python image
2
+ FROM registry.hf.space/sparse-generative-ai-open-moe-llm-leaderboard:latest
3
+
4
+ RUN pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity --no-cache-dir
5
+ # To fix pydantic version
6
+ RUN pip install pydantic==2.6.4 --no-cache-dir
7
+ # To fix selfcheck (selfchatgpt) dataset missing
8
+ RUN python -m spacy download en
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,108 +1,85 @@
1
  ---
2
- title: MoE-CAP Dashboard
3
  emoji: 🔥
4
  colorFrom: green
5
  colorTo: indigo
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: true
10
  license: apache-2.0
11
  fullWidth: true
12
- allow_embedding: true
13
  tags:
14
- - leaderboard
15
  ---
16
 
17
- # MoE-CAP
18
 
19
- MoE-CAP is a benchmarking method designed to evaluate sparse MoE systems by integrating Cost, Accuracy, and Performance across these three dimensions.
20
 
21
- ## News
22
- - MoE-CAP has been accepted to NeurIPS 2025 Dataset and Benchmark Track 🎉 See you in San Diego, US.
23
 
24
- ## Requirements
25
- Python: >= 3.9
26
 
27
- ## Installation
28
- ```bash
29
- git clone https://github.com/sparse-generative-ai/MoE-CAP.git
30
- cd MoE-CAP
31
- pip install -e .
32
- ```
33
- Then you can import `moe_cap` directly.
34
 
35
- ## Quick Example
36
- ### SGLang
37
- 1. Launch our sglang custom server (e.g. H100)
38
- ```bash
39
- python -m moe_cap.systems.sglang \
40
- --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
41
- --port 30000 \
42
- --expert-distribution-recorder-mode stat \
43
- --tp-size 8 \
44
- --reasoning-parser deepseek-r1
45
- ```
46
 
47
- 2. Run our benchmark
48
- ```bash
49
- python -m moe_cap.runner.sglang_profile \
50
- --config-file configs/gsm8k_qwen3_235b_a22b.yaml \
51
- --output_dir outputs/sglang/
52
- ```
53
 
54
- The results will be stored under `outputs/sglang/`.
 
 
 
 
 
 
55
 
56
- ### vLLM
57
- ```bash
58
- python -m moe_cap.systems.vllm \
59
- --model Qwen/Qwen3-235B-A22B-Thinking-2507 \
60
- --port 8000 \
61
- --host 0.0.0.0 \
62
- --tensor-parallel-size 8 \
63
- --reasoning-parser deepseek_r1 \
64
- --max-num-batched-tokens 131072 # Set max-num-batched-tokens large referring to vLLM tuning guide.
65
- # V1's mixed prefill-decode batching makes separate profiling difficult.
66
- ```
67
 
68
  ```bash
69
- python -m moe_cap.runner.openai_api_profile \
70
- --config-file configs/gsm8k_qwen3_235b_a22b.yaml \
71
- --output_dir outputs/vllm/ \
72
- --api-url http://0.0.0.0:8000/v1/completions
 
 
73
  ```
74
 
75
- The results will be stored under `outputs/vllm/`.
76
 
77
- ## Contributing to MoE-CAP
78
 
79
- Thank you for your interest in contributing to the MoE-CAP project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yinsicheng Jiang, Yao Fu or Yeqi Huang via email at [ysc.jiang@ed.ac.uk](mailto:ysc.jiang@ed.ac.uk), [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk) or [yeqi.huang@ed.ac.uk](mailto:yeqi.huang@ed.ac.uk).
80
 
81
- ### What We're Looking For in Contributions
 
 
 
82
 
83
- We are looking for contributions in several key areas to enhance the MoE-CAP project:
84
 
85
- 1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
86
 
87
- 2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
 
 
88
 
89
- 3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
90
 
91
- 4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
92
 
93
- Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
 
 
 
 
 
 
 
 
94
 
95
- Your contributions are crucial to the success and improvement of the MoE-CAP project. We look forward to collaborating with you.
96
-
97
- ## Cite our paper
98
- ```bibtex
99
- @misc{jiang2025moecapbenchmarkingcostaccuracy,
100
- title={MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems},
101
- author={Yinsicheng Jiang and Yao Fu and Yeqi Huang and Ping Nie and Zhan Lu and Leyang Xue and Congjie He and Man-Kit Sit and Jilong Xue and Li Dong and Ziming Miao and Dayou Du and Tairan Xu and Kai Zou and Edoardo Ponti and Luo Mai},
102
- year={2025},
103
- eprint={2412.07067},
104
- archivePrefix={arXiv},
105
- primaryClass={cs.LG},
106
- url={https://arxiv.org/abs/2412.07067},
107
- }
108
- ```
 
1
  ---
2
+ title: OPEN-MOE-LLM-LEADERBOARD
3
  emoji: 🔥
4
  colorFrom: green
5
  colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 4.26.0
8
  app_file: app.py
9
  pinned: true
10
  license: apache-2.0
11
  fullWidth: true
 
12
  tags:
13
+ - leaderboard
14
  ---
15
 
16
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
17
 
18
+ # Contributing to Open-MOE-LLM-Leaderboard
19
 
20
+ Thank you for your interest in contributing to the Open-MOE-LLM-Leaderboard project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yao Fu via email at [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk).
 
21
 
22
+ ## What We're Looking For in Contributions
 
23
 
24
+ We are looking for contributions in several key areas to enhance the Open-MOE-LLM-Leaderboard project:
 
 
 
 
 
 
25
 
26
+ 1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
 
 
 
 
 
 
 
 
 
 
27
 
28
+ 2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
 
 
 
 
 
29
 
30
+ 3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
31
+
32
+ 4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
33
+
34
+ Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
35
+
36
+ Your contributions are crucial to the success and improvement of the Open-MOE-LLM-Leaderboard project. We look forward to collaborating with you.
37
 
38
+
39
+ ## Development Setup
40
+
41
+ To start contributing, set up your development environment as follows:
 
 
 
 
 
 
 
42
 
43
  ```bash
44
+ conda create -n leaderboard python=3.10
45
+ conda activate leaderboard
46
+ pip install -r requirements.txt
47
+ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity
48
+ pip install pydantic==2.6.4 # Resolves a dependency conflict with moe-infinity
49
+ python -m spacy download en # Required for selfcheckgpt
50
  ```
51
 
52
+ ## Architecture Overview
53
 
54
+ The Open-MOE-LLM-Leaderboard project uses the following architecture:
55
 
56
+ - **User Interface (Gradio)** ->upload-> **HuggingFace Dataset (Request)** ->download-> **Backend GPU Server** ->upload-> **HuggingFace Dataset (Result)** ->download-> **User Interface (Gradio)**
57
 
58
+ In brief:
59
+ 1. Users submit model benchmarking requests through the Gradio interface ([app.py](./app.py)). These requests are then recorded in a HuggingFace dataset ([sparse-generative-ai/requests](https://huggingface.co/datasets/sparse-generative-ai/requests)).
60
+ 2. The backend ([backend-cli.py](./backend-cli.py)), running on a GPU server, processes these requests, performs the benchmarking tasks, and uploads the results to another HuggingFace dataset ([sparse-generative-ai/results](https://huggingface.co/datasets/sparse-generative-ai/results)).
61
+ 3. Finally, the Gradio interface retrieves and displays these results to the users.
62
 
63
+ ## Running the Gradio Interface
64
 
65
+ To launch the Gradio interface, execute:
66
 
67
+ ```bash
68
+ python app.py
69
+ ```
70
 
71
+ Then, open your browser and navigate to http://127.0.0.1:7860.
72
 
73
+ ## Running the Backend
74
 
75
+ To start the backend process, use:
76
+
77
+ ```bash
78
+ python backend-cli.py --debug
79
+ ```
80
+
81
+ For additional details, please consult the [backend-cli.py](./backend-cli.py) script.
82
+
83
+ ---
84
 
85
+ We look forward to your contributions and are here to help guide you through the process. Thank you for supporting the Open-MOE-LLM-Leaderboard project!
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,527 +1,496 @@
1
  #!/usr/bin/env python
2
  import os
3
- import json
4
- from typing import List, Tuple
5
-
6
- os.environ["GRADIO_LANGUAGE"] = "en"
7
-
8
- RESULT_DIR = os.environ.get("MOECAP_RESULT_DIR")
9
- if not RESULT_DIR:
10
- # For testing purposes, you can uncomment the line below:
11
- # RESULT_DIR = "generic_result_dir"
12
- # If you are running locally without this env var,
13
- # ensure you handle this error or set the var.
14
- pass
15
 
16
  import gradio as gr
17
  import pandas as pd
18
- from datasets import load_dataset
19
- import plotly.graph_objects as go
20
-
21
-
22
- def f2(x):
23
- """Format to 2 decimal places if number, else return as-is."""
24
- if isinstance(x, (int, float)):
25
- return round(float(x), 2)
26
- return x
27
-
28
-
29
- def normalize(val, vmin, vmax, baseline=20):
30
- """Normalize value to baseline-100 range."""
31
- if vmax == vmin:
32
- return baseline + 40
33
- return baseline + (val - vmin) / (vmax - vmin) * (100 - baseline)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
 
36
- def normalize_cost(val, max_tick, baseline=20):
37
- """Normalize cost (lower is better)."""
38
- if max_tick == 0:
39
- return baseline + 40
40
- return baseline + (max_tick - min(val, max_tick)) / max_tick * (100 - baseline)
41
 
42
 
43
- def generate_radar_plot(selected_rows_data: List[dict]) -> go.Figure:
44
- """Generate a CAP radar plot from selected rows."""
45
-
46
- layout_settings = dict(
47
- height=750,
48
- autosize=True,
49
- margin=dict(t=80, b=100, l=80, r=80),
50
- paper_bgcolor='white',
51
- plot_bgcolor='white',
52
- )
53
 
54
- if not selected_rows_data or len(selected_rows_data) == 0:
55
- fig = go.Figure()
56
- fig.add_annotation(
57
- text="Please select 1-3 rows from the table to generate radar plot",
58
- xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
59
- font=dict(size=16, color="black"), # Ensure text is black
60
- xanchor='center', yanchor='middle'
61
- )
62
- fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
63
- return fig
64
-
65
- if len(selected_rows_data) > 3:
66
- fig = go.Figure()
67
- fig.add_annotation(
68
- text="Error: Please select no more than 3 rows!",
69
- xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
70
- font=dict(size=18, color="red"),
71
- xanchor='center', yanchor='middle'
72
  )
73
- fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
74
- return fig
75
-
76
- datasets = [row.get('Dataset', '') for row in selected_rows_data]
77
- unique_datasets = set(datasets)
78
- if len(unique_datasets) > 1:
79
- fig = go.Figure()
80
- fig.add_annotation(
81
- text="Error: Please select rows from the same dataset!",
82
- xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
83
- font=dict(size=18, color="red"),
84
- xanchor='center', yanchor='middle'
85
  )
86
- fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
87
- return fig
88
-
89
- dataset_name = datasets[0] if datasets else "Unknown"
90
-
91
- data = {}
92
- for row in selected_rows_data:
93
- model_name = row.get('Model', 'Unknown')
94
- if isinstance(model_name, str) and 'href' in model_name:
95
- try:
96
- model_name = model_name.split('>', 1)[1].split('<', 1)[0]
97
- except:
98
- pass
99
-
100
- method = row.get('Method', '')
101
- if isinstance(model_name, str) and '/' in model_name:
102
- legend_name = model_name.split('/')[-1]
103
- else:
104
- legend_name = str(model_name)
105
-
106
- if method and method not in ['Unknown', '-', '']:
107
- legend_name = f"{legend_name}-{method}"
108
-
109
- acc = row.get('Accuracy(%)', 0)
110
- cost = row.get('Cost($)', 0)
111
- throughput = row.get('Decoding T/s', 0)
112
-
113
- try:
114
- acc = float(acc) if acc not in [None, '-', ''] else 0
115
- cost = float(cost) if cost not in [None, '-', ''] else 0
116
- throughput = float(throughput) if throughput not in [None, '-', ''] else 0
117
- except:
118
- acc, cost, throughput = 0, 0, 0
119
-
120
- data[legend_name] = {
121
- 'accuracy': acc / 100.0 if acc > 1 else acc,
122
- 'cost': cost,
123
- 'throughput': throughput
124
- }
125
-
126
- throughputs = [v['throughput'] for v in data.values()]
127
- costs = [v['cost'] for v in data.values()]
128
- accs = [v['accuracy'] for v in data.values()]
129
-
130
- tp_min, tp_max = (min(throughputs), max(throughputs)) if throughputs else (0, 1)
131
- cost_max = max(costs) if costs else 1
132
- acc_min, acc_max = (min(accs), 1.0) if accs else (0, 1)
133
-
134
- baseline = 20
135
- categories = ['Throughput (T/s)', 'Cost ($)', 'Accuracy', 'Throughput (T/s)']
136
 
137
- fig = go.Figure()
138
 
139
- for system, values in data.items():
140
- raw_vals = [values['throughput'], values['cost'], values['accuracy']]
141
- norm_vals = [
142
- normalize(values['throughput'], tp_min, tp_max, baseline),
143
- normalize_cost(values['cost'], cost_max, baseline),
144
- normalize(values['accuracy'], acc_min, acc_max, baseline)
145
- ]
146
- norm_vals += [norm_vals[0]]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
- hovertext = [
149
- f"Throughput: {raw_vals[0]:.2f} T/s",
150
- f"Cost: ${raw_vals[1]:.2f}",
151
- f"Accuracy: {raw_vals[2]*100:.2f}%",
152
- f"Throughput: {raw_vals[0]:.2f} T/s"
153
- ]
154
-
155
- fig.add_trace(go.Scatterpolar(
156
- r=norm_vals,
157
- theta=categories,
158
- fill='toself',
159
- name=system,
160
- text=hovertext,
161
- hoverinfo='text+name',
162
- line=dict(width=2)
163
- ))
164
-
165
- fig.update_layout(
166
- title=dict(text=f"CAP Radar Plot: {dataset_name}", x=0.5, xanchor='center', font=dict(size=20, color="black")),
167
- polar=dict(
168
- radialaxis=dict(
169
- visible=True,
170
- range=[0, 100],
171
- tickfont=dict(size=12, color="black"),
172
- gridcolor='lightgray', # Add this
173
- linecolor='gray', # Add this
174
- showline=True # Add this
175
- ),
176
- angularaxis=dict(
177
- tickfont=dict(size=14, color="black"),
178
- rotation=90,
179
- direction='clockwise',
180
- gridcolor='lightgray', # Add this
181
- linecolor='gray', # Add this
182
- showline=True # Add this
183
- ),
184
- bgcolor="white"
185
- ),
186
- legend=dict(orientation='h', yanchor='bottom', y=-0.15, xanchor='center', x=0.5, font=dict(size=13, color="black")),
187
- **layout_settings
188
- )
189
-
190
- return fig
191
-
192
-
193
- def json_to_row(path: str, metrics: dict) -> dict:
194
- model_name = metrics.get("model_name")
195
- if not model_name:
196
- model_name = "unknown-model"
197
-
198
- dataset = metrics.get("dataset", "Unknown")
199
- method = metrics.get("method", "Unknown")
200
- precision = metrics.get("precision", "Unknown")
201
- model_type = metrics.get("model_type", "Unknown")
202
- e2e_s = metrics.get("e2e_s", None)
203
- batch_size = metrics.get("batch_size", None)
204
- gpu_type = metrics.get("gpu_type", "")
205
- cost = metrics.get("cost", None)
206
-
207
- em = metrics.get("exact_match")
208
- correct = metrics.get("correct")
209
- total = metrics.get("total")
210
- if isinstance(correct, (int, float)) and isinstance(total, (int, float)) and total > 0:
211
- acc = correct / total
212
- else:
213
- acc = em
214
-
215
- def pct(x):
216
- return round(x * 100, 2) if isinstance(x, (int, float)) else None
217
-
218
- if isinstance(model_name, str) and "/" in model_name:
219
- hf_url = f"https://huggingface.co/{model_name}"
220
- model_cell = f"<a href='{hf_url}' target='_blank' style='color: #0366d6; text-decoration: none;'>{model_name}</a>"
221
- else:
222
- model_cell = model_name
223
-
224
- row = {
225
- "Model": model_cell,
226
- "Dataset": dataset,
227
- "Method": method,
228
- "Model type": model_type,
229
- "Precision": precision,
230
- "E2E(s)": f2(e2e_s),
231
- "GPU": gpu_type,
232
- "Accuracy(%)": pct(acc),
233
- "Cost($)": cost,
234
- "Decoding T/s": f2(metrics.get("decoding_throughput")),
235
- "Prefill T/s": f2(metrics.get("prefill_tp")),
236
- "Prefill<br>S-MBU(%)": pct(metrics.get("prefill_smbu")),
237
- "Prefill<br>S-MFU(%)": pct(metrics.get("prefill_smfu")),
238
- "Decoding<br>S-MBU(%)": pct(metrics.get("decoding_smbu")),
239
- "Decoding<br>S-MFU(%)": pct(metrics.get("decoding_smfu")),
240
- "TTFT(s)": f2(metrics.get("ttft")),
241
- "TPOT(s)": f2(metrics.get("tpot")),
242
- "Batch size": batch_size,
243
- }
244
- return row
245
-
246
-
247
- def load_from_dir(dir_path: str, selected_tasks=None, selected_frameworks=None, selected_model_types=None, selected_precisions=None, search_keyword="", force_refresh=False):
248
- if not dir_path:
249
- return "<p style='color:black'>Result Directory not set.</p>", []
250
 
251
- try:
252
- pattern = f"hf://datasets/{dir_path}/**/*.json"
253
- dl_mode = "force_redownload" if force_refresh else None
254
- print(f"Fetching from {pattern} (mode={dl_mode})...")
255
- ds = load_dataset("json", data_files={"train": pattern}, split="train", download_mode=dl_mode)
256
- except Exception as e:
257
- print(f"Error loading dataset: {e}")
258
- return "<p style='color:black'>No files loaded or Dataset not found.</p>", []
259
 
260
- rows = []
261
- for i, example in enumerate(ds):
262
- metrics = example.get("metrics") or example.get("json") or example
263
- rows.append(json_to_row(f"{dir_path}#{i}", metrics))
264
 
265
- if not rows:
266
- return "<p style='color:black'>No records found.</p>", []
 
267
 
268
- df = pd.DataFrame(rows)
 
 
269
 
270
- # --- Filtering Logic ---
271
- # This logic is consistent: if a filter is provided, we ONLY keep rows
272
- # where the column value is inside the selected list.
273
-
274
- if selected_tasks:
275
- df = df[df["Dataset"].astype(str).str.lower().isin([x.lower() for x in selected_tasks])]
276
- if selected_frameworks:
277
- df = df[df["Method"].astype(str).str.lower().isin([str(x).lower() for x in selected_frameworks])]
278
- if selected_model_types:
279
- df = df[df["Model type"].astype(str).str.lower().isin([str(x).lower() for x in selected_model_types])]
280
- if selected_precisions:
281
- df = df[df["Precision"].astype(str).str.lower().isin([str(x).lower() for x in selected_precisions])]
282
- if search_keyword and search_keyword.strip():
283
- df = df[df.astype(str).apply(lambda row: row.str.lower().str.contains(search_keyword.strip().lower()).any(), axis=1)]
284
-
285
- if df.empty:
286
- return "<p style='color:black'>No records found.</p>", []
287
-
288
- df = df.fillna("-")
289
- df.insert(0, 'Row #', range(len(df)))
290
-
291
- table_html = f'<div class="table-container">{df.to_html(escape=False, index=False, classes="metrics-table")}</div>'
292
- df_without_rownum = df.drop('Row #', axis=1)
293
- return table_html, df_without_rownum.to_dict('records')
294
 
 
 
 
295
 
296
- def auto_refresh_from_dir(dir_path, tasks, frameworks, types, precisions, search):
297
- return load_from_dir(dir_path, tasks, frameworks, types, precisions, search, force_refresh=True)
 
298
 
 
 
 
299
 
300
- def parse_and_generate_plot(df_data, indices_str):
301
- if not indices_str or not indices_str.strip():
302
- return generate_radar_plot([])
303
- try:
304
- indices = [int(idx.strip()) for idx in indices_str.split(',') if idx.strip()][:3]
305
- selected_rows = [df_data[i] for i in indices if 0 <= i < len(df_data)]
306
- return generate_radar_plot(selected_rows)
307
- except:
308
- return generate_radar_plot([])
309
-
310
-
311
- def initial_load(dir_path, tasks, frameworks, types, precisions, search):
312
- """Load data and generate initial radar plot with rows 0,1,2."""
313
- table_html, df_data = auto_refresh_from_dir(dir_path, tasks, frameworks, types, precisions, search)
314
- plot = parse_and_generate_plot(df_data, "0,1,2")
315
- return table_html, df_data, plot
316
-
317
-
318
- def build_app() -> gr.Blocks:
319
- # NUCLEAR CSS FIX: Overwrite all generic Gradio variables to force light mode
320
- row_css = """
321
- /* 1. FORCE LIGHT VARIABLES GLOBALLY */
322
- :root, .gradio-container, body {
323
- --body-background-fill: #f5f7fa !important;
324
- --body-text-color: #374151 !important;
325
- --background-fill-primary: #ffffff !important;
326
- --background-fill-secondary: #f3f4f6 !important;
327
- --border-color-primary: #e5e7eb !important;
328
- --block-background-fill: #ffffff !important;
329
- --block-label-text-color: #374151 !important;
330
- --block-title-text-color: #1f2937 !important;
331
- --input-background-fill: #ffffff !important;
332
- --color-accent: #0366d6 !important;
333
-
334
- /* Reset dark mode specific variables to light values */
335
- --neutral-50: #f9fafb; --neutral-100: #f3f4f6; --neutral-200: #e5e7eb;
336
- --neutral-300: #d1d5da; --neutral-400: #9ca3af; --neutral-500: #6b7280;
337
- --neutral-600: #4b5563; --neutral-700: #374151; --neutral-800: #1f2937;
338
- }
339
-
340
- /* 2. RESET STANDARD CONTAINERS */
341
- .gradio-container .block,
342
- .gradio-container .panel,
343
- .gradio-container .form {
344
- background-color: white !important;
345
- border-color: #e1e4e8 !important;
346
- }
347
-
348
- /* 3. SPECIFIC FIX FOR THE DARK "FILTERS" and "RADAR" SECTIONS */
349
- .filter-section {
350
- background-color: #ffffff !important;
351
- border: 2px solid #e1e4e8 !important;
352
- border-radius: 8px !important;
353
- padding: 16px !important;
354
- box-shadow: 0 2px 4px rgba(0,0,0,0.05) !important;
355
- color: #24292e !important; /* Set default text color for the section */
356
- }
357
-
358
- /* Remove background color from text elements to prevent "dark blocks" */
359
- .filter-section label,
360
- .filter-section span,
361
- .filter-section p {
362
- background-color: transparent !important;
363
- }
364
-
365
- /* 4. BUTTON FIXES - TARGET BY ID FOR SPECIFICITY */
366
- #gen_btn {
367
- background-color: #0366d6 !important;
368
- color: white !important;
369
- border: none !important;
370
- }
371
- #gen_btn:hover {
372
- opacity: 0.9;
373
- }
374
-
375
- /* 5. INPUTS & CHECKBOXES */
376
- /* Re-apply white background to inputs specifically */
377
- .filter-section input,
378
- .filter-section textarea,
379
- .filter-section select {
380
- background-color: #ffffff !important;
381
- border: 1px solid #d1d5da !important;
382
- color: #24292e !important;
383
- }
384
-
385
- /* --- FIX FOR CHECKBOXES --- */
386
- /* Use explicit styling for the checked state to ensure visibility */
387
- .filter-section input[type="checkbox"] {
388
- appearance: none !important;
389
- -webkit-appearance: none !important;
390
- width: 16px !important;
391
- height: 16px !important;
392
- background-color: white !important;
393
- border: 1px solid #d1d5da !important;
394
- border-radius: 3px !important;
395
- position: relative !important;
396
- cursor: pointer !important;
397
- }
398
-
399
- .filter-section input[type="checkbox"]:checked {
400
- background-color: #0366d6 !important;
401
- border-color: #0366d6 !important;
402
- /* Draw the checkmark using an SVG data URI */
403
- background-image: url("data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M12.207 4.793a1 1 0 010 1.414l-5 5a1 1 0 01-1.414 0l-2-2a1 1 0 011.414-1.414L6.5 9.086l4.293-4.293a1 1 0 011.414 0z'/%3e%3c/svg%3e") !important;
404
- background-size: 100% 100% !important;
405
- background-position: center !important;
406
- background-repeat: no-repeat !important;
407
- }
408
-
409
- .filter-section label span {
410
- color: #24292e !important;
411
- }
412
-
413
- /* 6. SEARCH BOX */
414
- .search-box {
415
- background: white !important;
416
- padding: 16px !important;
417
- border-radius: 6px;
418
- border: 2px solid #e1e4e8 !important;
419
- margin-bottom: 16px;
420
- }
421
-
422
- /* 7. TABLE STYLING */
423
- .table-container {
424
- overflow-x: auto;
425
- max-height: 75vh;
426
- border: 2px solid #e1e4e8;
427
- border-radius: 6px;
428
- background: white !important;
429
- }
430
- table.metrics-table {
431
- width: 100%; border-collapse: collapse; background: white !important;
432
- }
433
- table.metrics-table th, table.metrics-table td {
434
- padding: 10px 14px; border: 1px solid #e1e4e8;
435
- white-space: nowrap; font-size: 13px; color: #24292e !important;
436
- }
437
- table.metrics-table th {
438
- background: #f6f8fa !important; font-weight: 600; position: sticky; top: 0;
439
- }
440
- .metrics-table th:first-child, .metrics-table td:first-child {
441
- background-color: #f0f0f0 !important; text-align: center;
442
- }
443
-
444
- /* 8. PLOT CONTAINER - FORCE WHITE BACKGROUND */
445
- .plot-container {
446
- width: 100% !important;
447
- background-color: white !important;
448
- }
449
- .plot-container > div, .plot-container .plotly {
450
- background-color: white !important;
451
- }
452
-
453
- /* 9. LINKS */
454
- a { color: #0366d6 !important; text-decoration: none; }
455
- a:hover { text-decoration: underline; }
456
- """
457
-
458
- with gr.Blocks(title="MoE-CAP Dashboard", css=row_css, theme=gr.themes.Default()) as demo:
459
- gr.Markdown("# MoE-CAP Dashboard")
460
-
461
- with gr.Row():
462
- # Left Sidebar
463
- with gr.Column(scale=2):
464
- with gr.Group(elem_classes="search-box"):
465
- search_input = gr.Textbox(label="🔍 Search", placeholder="Search...", lines=1)
466
-
467
- with gr.Group(elem_classes="filter-section"):
468
- gr.Markdown("### 🎛️ Filters")
469
- dir_path = gr.State(RESULT_DIR)
470
-
471
- task_filter = gr.CheckboxGroup(
472
- label="📊 Tasks",
473
- choices=[("GSM8K", "gsm8k"), ("LongBench", "longbench"), ("MMLU", "mmlu"), ("NuminaMath", "numinamath"), ("RULER", "ruler")],
474
- value=["gsm8k", "longbench", "mmlu", "numinamath", "ruler"]
475
  )
476
- framework_filter = gr.CheckboxGroup(label="⚙️ Frameworks", choices=["sglang", "vllm"], value=["sglang", "vllm"])
477
- model_type_filter = gr.CheckboxGroup(label="🤖 Model Types", choices=["instruct", "thinking"], value=["instruct", "thinking"])
478
- precision_filter = gr.CheckboxGroup(label="🎯 Precision", choices=["bfloat16", "fp8"], value=["bfloat16", "fp8"])
479
-
480
- with gr.Accordion("📖 About Tasks & Metrics", open=True):
481
- gr.Markdown(
482
- "### Tasks\n- **GSM8K**, **LongBench**, **MMLU**, **NuminaMath**, **RULER**\n\n"
483
- "### Metrics\n- **E2E(s)**: Latency | **Cost($)** | **T/s**: Throughput | **S-MBU/MFU**: Utilization | **TPOT**, **TTFT**",
484
- elem_classes="info-section"
485
  )
486
 
487
- gr.Markdown(
488
- "Github Repo: [https://github.com/Auto-CAP/MoE-CAP](https://github.com/Auto-CAP/MoE-CAP)",
489
- elem_classes="info-section"
 
 
 
490
  )
491
 
492
- # Right Main Content
493
- with gr.Column(scale=5):
494
- leaderboard_output = gr.HTML(label="📈 Results")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
495
 
496
- with gr.Group(elem_classes="filter-section"):
497
- gr.Markdown("### 📊 CAP Radar Plot")
498
- gr.Markdown("**How to use:** Look at the 'Row #' column in the table. Enter row numbers (e.g., 0,1,2) and click Generate.")
499
-
500
- with gr.Row():
501
- row_indices_input = gr.Textbox(label="Row Numbers", placeholder="0,1,2", value="0,1,2", scale=3)
502
- generate_btn = gr.Button("🎯 Generate", variant="primary", scale=1, elem_id="gen_btn")
503
-
504
- radar_plot = gr.Plot(value=generate_radar_plot([]), elem_classes="plot-container")
505
-
506
- # State & Events
507
- df_data_state = gr.State([])
508
- inputs = [dir_path, task_filter, framework_filter, model_type_filter, precision_filter, search_input]
509
-
510
- # Load data and generate initial plot on page load
511
- demo.load(fn=initial_load, inputs=inputs, outputs=[leaderboard_output, df_data_state, radar_plot])
512
-
513
- search_input.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
514
- task_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
515
- framework_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
516
- model_type_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
517
- precision_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
518
-
519
- generate_btn.click(fn=parse_and_generate_plot, inputs=[df_data_state, row_indices_input], outputs=[radar_plot])
520
-
521
- gr.Timer(60.0).tick(fn=auto_refresh_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
522
 
523
- return demo
 
 
 
 
 
 
 
524
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
525
  if __name__ == "__main__":
526
- app = build_app()
527
- app.launch()
 
 
1
  #!/usr/bin/env python
2
  import os
3
+ import datetime
4
+ import socket
5
+ import base64
6
+ from threading import Thread
 
 
 
 
 
 
 
 
7
 
8
  import gradio as gr
9
  import pandas as pd
10
+ import time
11
+ from apscheduler.schedulers.background import BackgroundScheduler
12
+
13
+ from huggingface_hub import snapshot_download
14
+
15
+ from src.display.about import (
16
+ CITATION_BUTTON_LABEL,
17
+ CITATION_BUTTON_TEXT,
18
+ EVALUATION_QUEUE_TEXT,
19
+ INTRODUCTION_TEXT,
20
+ LLM_BENCHMARKS_TEXT,
21
+ LLM_BENCHMARKS_DETAILS,
22
+ FAQ_TEXT,
23
+ TITLE,
24
+ ACKNOWLEDGEMENT_TEXT,
25
+ )
26
+
27
+ from src.display.css_html_js import custom_css
28
+
29
+ from src.display.utils import (
30
+ BENCHMARK_COLS,
31
+ COLS,
32
+ EVAL_COLS,
33
+ EVAL_TYPES,
34
+ TYPES,
35
+ AutoEvalColumn,
36
+ ModelType,
37
+ InferenceFramework,
38
+ fields,
39
+ WeightType,
40
+ Precision,
41
+ GPUType
42
+ )
43
+
44
+ from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, H4_TOKEN, IS_PUBLIC, \
45
+ QUEUE_REPO, REPO_ID, RESULTS_REPO, DEBUG_QUEUE_REPO, DEBUG_RESULTS_REPO
46
+ from src.populate import get_evaluation_queue_df, get_leaderboard_df
47
+ from src.submission.submit import add_new_eval
48
+ from src.utils import get_dataset_summary_table
49
+
50
+ def get_args():
51
+ import argparse
52
+
53
+ parser = argparse.ArgumentParser(description="Run the LLM Leaderboard")
54
+ parser.add_argument("--debug", action="store_true", help="Run in debug mode")
55
+ return parser.parse_args()
56
+
57
+ args = get_args()
58
+ if args.debug:
59
+ print("Running in debug mode")
60
+ QUEUE_REPO = DEBUG_QUEUE_REPO
61
+ RESULTS_REPO = DEBUG_RESULTS_REPO
62
+
63
+ def ui_snapshot_download(repo_id, local_dir, repo_type, tqdm_class, etag_timeout):
64
+ try:
65
+ print(local_dir)
66
+ snapshot_download(
67
+ repo_id=repo_id, local_dir=local_dir, repo_type=repo_type, tqdm_class=tqdm_class, etag_timeout=etag_timeout
68
+ )
69
+ except Exception as e:
70
+ restart_space()
71
 
72
 
73
+ def restart_space():
74
+ API.restart_space(repo_id=REPO_ID, token=H4_TOKEN)
 
 
 
75
 
76
 
77
+ def init_space():
78
+ # dataset_df = get_dataset_summary_table(file_path="blog/Hallucination-Leaderboard-Summary.csv")
 
 
 
 
 
 
 
 
79
 
80
+ if socket.gethostname() not in {"neuromancer"}:
81
+ # sync model_type with open-llm-leaderboard
82
+ ui_snapshot_download(
83
+ repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  )
85
+ ui_snapshot_download(
86
+ repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30
 
 
 
 
 
 
 
 
 
 
87
  )
88
+ raw_data, original_df = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, "", COLS, BENCHMARK_COLS)
89
+
90
+ finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df = get_evaluation_queue_df(
91
+ EVAL_REQUESTS_PATH, EVAL_COLS
92
+ )
93
+ # return dataset_df, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df
94
+ return None, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
 
96
 
97
+ def add_benchmark_columns(shown_columns):
98
+ benchmark_columns = []
99
+ for benchmark in BENCHMARK_COLS:
100
+ if benchmark in shown_columns:
101
+ for c in COLS:
102
+ if benchmark in c and benchmark != c:
103
+ benchmark_columns.append(c)
104
+ return benchmark_columns
105
+
106
+
107
+ # Searching and filtering
108
+ def update_table(
109
+ hidden_df: pd.DataFrame, columns: list, type_query: list, precision_query: list, size_query: list, query: str
110
+ ):
111
+ filtered_df = filter_models(hidden_df, type_query, size_query, precision_query)
112
+ filtered_df = filter_queries(query, filtered_df)
113
+ benchmark_columns = add_benchmark_columns(columns)
114
+ df = select_columns(filtered_df, columns + benchmark_columns)
115
+ return df
116
+
117
+
118
+ def search_table(df: pd.DataFrame, query: str) -> pd.DataFrame:
119
+ return df[(df[AutoEvalColumn.dummy.name].str.contains(query, case=False))]
120
+
121
+
122
+ def select_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
123
+ # always_here_cols = [AutoEvalColumn.model_type_symbol.name, AutoEvalColumn.model.name]
124
+
125
+ always_here_cols = [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
126
+ dummy_col = [AutoEvalColumn.dummy.name]
127
+
128
+ # We use COLS to maintain sorting
129
+ filtered_df = df[
130
+ # always_here_cols + [c for c in COLS if c in df.columns and c in columns] + [AutoEvalColumn.dummy.name]
131
+ always_here_cols
132
+ + [c for c in COLS if c in df.columns and c in columns]
133
+ + dummy_col
134
+ ]
135
+ return filtered_df
136
+
137
+
138
+ def filter_queries(query: str, filtered_df: pd.DataFrame):
139
+ final_df = []
140
+ if query != "":
141
+ queries = [q.strip() for q in query.split(";")]
142
+ for _q in queries:
143
+ _q = _q.strip()
144
+ if _q != "":
145
+ temp_filtered_df = search_table(filtered_df, _q)
146
+ if len(temp_filtered_df) > 0:
147
+ final_df.append(temp_filtered_df)
148
+ if len(final_df) > 0:
149
+ filtered_df = pd.concat(final_df)
150
+ subset = [AutoEvalColumn.model.name, AutoEvalColumn.precision.name, AutoEvalColumn.revision.name]
151
+ filtered_df = filtered_df.drop_duplicates(subset=subset)
152
+ return filtered_df
153
+
154
+
155
+ def filter_models(df: pd.DataFrame, type_query: list, size_query: list, precision_query: list) -> pd.DataFrame:
156
+ # Show all models
157
+ filtered_df = df
158
+
159
+ type_emoji = [t[0] for t in type_query]
160
+ filtered_df = filtered_df.loc[df[AutoEvalColumn.model_type_symbol.name].isin(type_emoji)]
161
+ filtered_df = filtered_df.loc[df[AutoEvalColumn.precision.name].isin(precision_query + ["None"])]
162
+
163
+ # numeric_interval = pd.IntervalIndex(sorted([NUMERIC_INTERVALS[s] for s in size_query]))
164
+ # params_column = pd.to_numeric(df[AutoEvalColumn.params.name], errors="coerce")
165
+ # mask = params_column.apply(lambda x: any(numeric_interval.contains(x)))
166
+ # filtered_df = filtered_df.loc[mask]
167
+
168
+ return filtered_df
169
+
170
+ shown_columns = None
171
+ dataset_df, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df = init_space()
172
+ leaderboard_df = original_df.copy()
173
+
174
+ # def update_leaderboard_table():
175
+ # global leaderboard_df, shown_columns
176
+ # print("Updating leaderboard table")
177
+ # return leaderboard_df[
178
+ # [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
179
+ # + shown_columns.value
180
+ # + [AutoEvalColumn.dummy.name]
181
+ # ] if not leaderboard_df.empty else leaderboard_df
182
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
+ # def update_hidden_leaderboard_table():
185
+ # global original_df
186
+ # return original_df[COLS] if original_df.empty is False else original_df
 
 
 
 
 
187
 
188
+ # def update_dataset_table():
189
+ # global dataset_df
190
+ # return dataset_df
 
191
 
192
+ # def update_finish_table():
193
+ # global finished_eval_queue_df
194
+ # return finished_eval_queue_df
195
 
196
+ # def update_running_table():
197
+ # global running_eval_queue_df
198
+ # return running_eval_queue_df
199
 
200
+ # def update_pending_table():
201
+ # global pending_eval_queue_df
202
+ # return pending_eval_queue_df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
 
204
+ # def update_finish_num():
205
+ # global finished_eval_queue_df
206
+ # return len(finished_eval_queue_df)
207
 
208
+ # def update_running_num():
209
+ # global running_eval_queue_df
210
+ # return len(running_eval_queue_df)
211
 
212
+ # def update_pending_num():
213
+ # global pending_eval_queue_df
214
+ # return len(pending_eval_queue_df)
215
 
216
+ # triggered only once at startup => read query parameter if it exists
217
+ def load_query(request: gr.Request):
218
+ query = request.query_params.get("query") or ""
219
+ return query
220
+
221
+
222
+ def get_image_html(url, image_path):
223
+ with open(image_path, "rb") as image_file:
224
+ encoded_string = base64.b64encode(image_file.read()).decode()
225
+ return f'<a href="{url}" target="_blank"><img src="data:image/jpg;base64,{encoded_string}" alt="NetMind.AI Logo" style="width:100pt;"></a>'
226
+
227
+
228
+ # Prepare the HTML content with the image
229
+ image_html = get_image_html("https://netmind.ai/home", "./src/display/imgs/Netmind.AI_LOGO.jpg")
230
+
231
+
232
+ demo = gr.Blocks(css=custom_css)
233
+ with demo:
234
+ gr.HTML(TITLE)
235
+ gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
236
+ gr.HTML(ACKNOWLEDGEMENT_TEXT.format(image_html=image_html))
237
+
238
+ with gr.Tabs(elem_classes="tab-buttons") as tabs:
239
+ with gr.TabItem("open-moe-llm-leaderboard", elem_id="llm-benchmark-tab-table", id=0):
240
+ with gr.Row():
241
+ with gr.Column():
242
+ with gr.Row():
243
+ search_bar = gr.Textbox(
244
+ placeholder=" 🔍 Model search (separate multiple queries with `;`)",
245
+ show_label=False,
246
+ elem_id="search-bar"
247
+ )
248
+ with gr.Row():
249
+ shown_columns = gr.CheckboxGroup(
250
+ choices=[
251
+ c.name
252
+ for c in fields(AutoEvalColumn)
253
+ if not c.hidden and not c.never_hidden and not c.dummy
254
+ ],
255
+ value=[
256
+ c.name
257
+ for c in fields(AutoEvalColumn)
258
+ if c.displayed_by_default and not c.hidden and not c.never_hidden
259
+ ],
260
+ label="Select columns to show",
261
+ elem_id="column-select",
262
+ interactive=True,
263
+ )
264
+
265
+ with gr.Column(min_width=320):
266
+ filter_columns_size = gr.CheckboxGroup(
267
+ label="Inference frameworks",
268
+ choices=[t.to_str() for t in InferenceFramework],
269
+ value=[t.to_str() for t in InferenceFramework],
270
+ interactive=True,
271
+ elem_id="filter-columns-size",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  )
273
+
274
+ filter_columns_type = gr.CheckboxGroup(
275
+ label="Model types",
276
+ choices=[t.to_str() for t in ModelType],
277
+ value=[t.to_str() for t in ModelType],
278
+ interactive=True,
279
+ elem_id="filter-columns-type",
 
 
280
  )
281
 
282
+ filter_columns_precision = gr.CheckboxGroup(
283
+ label="Precision",
284
+ choices=[i.value.name for i in Precision],
285
+ value=[i.value.name for i in Precision],
286
+ interactive=True,
287
+ elem_id="filter-columns-precision",
288
  )
289
 
290
+ # filter_columns_size = gr.CheckboxGroup(
291
+ # label="Model sizes (in billions of parameters)",
292
+ # choices=list(NUMERIC_INTERVALS.keys()),
293
+ # value=list(NUMERIC_INTERVALS.keys()),
294
+ # interactive=True,
295
+ # elem_id="filter-columns-size",
296
+ # )
297
+
298
+ # breakpoint()
299
+ benchmark_columns = add_benchmark_columns(shown_columns.value)
300
+ leaderboard_table = gr.components.Dataframe(
301
+ value=(
302
+ leaderboard_df[
303
+ [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
304
+ + shown_columns.value
305
+ + benchmark_columns
306
+ + [AutoEvalColumn.dummy.name]
307
+ ]
308
+ if leaderboard_df.empty is False
309
+ else leaderboard_df
310
+ ),
311
+ headers=[c.name for c in fields(AutoEvalColumn) if c.never_hidden] + shown_columns.value + benchmark_columns,
312
+ datatype=TYPES,
313
+ elem_id="leaderboard-table",
314
+ interactive=False,
315
+ visible=True,
316
+ ) # column_widths=["2%", "20%"]
317
+
318
+ # Dummy leaderboard for handling the case when the user uses backspace key
319
+ hidden_leaderboard_table_for_search = gr.components.Dataframe(
320
+ value=original_df[COLS] if original_df.empty is False else original_df,
321
+ headers=COLS,
322
+ datatype=TYPES,
323
+ visible=False,
324
+ )
325
+
326
+ search_bar.submit(
327
+ update_table,
328
+ [
329
+ hidden_leaderboard_table_for_search,
330
+ shown_columns,
331
+ filter_columns_type,
332
+ filter_columns_precision,
333
+ filter_columns_size,
334
+ search_bar,
335
+ ],
336
+ leaderboard_table
337
+ )
338
+
339
+ # Check query parameter once at startup and update search bar
340
+ demo.load(load_query, inputs=[], outputs=[search_bar])
341
+
342
+ for selector in [shown_columns, filter_columns_type, filter_columns_precision, filter_columns_size]:
343
+ selector.change(
344
+ update_table,
345
+ [
346
+ hidden_leaderboard_table_for_search,
347
+ shown_columns,
348
+ filter_columns_type,
349
+ filter_columns_precision,
350
+ filter_columns_size,
351
+ search_bar,
352
+ ],
353
+ leaderboard_table,
354
+ queue=True,
355
+ )
356
+
357
+ # with gr.TabItem("About", elem_id="llm-benchmark-tab-table", id=2):
358
+ # gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
359
+
360
+ # dataset_table = gr.components.Dataframe(
361
+ # value=dataset_df,
362
+ # headers=list(dataset_df.columns),
363
+ # datatype=["str", "markdown", "str", "str", "str"],
364
+ # elem_id="dataset-table",
365
+ # interactive=False,
366
+ # visible=True,
367
+ # column_widths=["15%", "20%"],
368
+ # )
369
+
370
+ # gr.Markdown(LLM_BENCHMARKS_DETAILS, elem_classes="markdown-text")
371
+ # gr.Markdown(FAQ_TEXT, elem_classes="markdown-text")
372
+
373
+ with gr.TabItem("Submit a model ", elem_id="llm-benchmark-tab-table", id=3):
374
+ with gr.Column():
375
+ with gr.Row():
376
+ gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
377
+
378
+ with gr.Column():
379
+ with gr.Accordion(f"✅ Finished Evaluations ({len(finished_eval_queue_df)})", open=False):
380
+ with gr.Row():
381
+ finished_eval_table = gr.components.Dataframe(
382
+ value=finished_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
383
+ )
384
+
385
+ with gr.Accordion(f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})", open=False):
386
+ with gr.Row():
387
+ running_eval_table = gr.components.Dataframe(
388
+ value=running_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
389
+ )
390
+
391
+ with gr.Accordion(f"⏳ Scheduled Evaluation Queue ({len(pending_eval_queue_df)})", open=False):
392
+ with gr.Row():
393
+ pending_eval_table = gr.components.Dataframe(
394
+ value=pending_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
395
+ )
396
+
397
+ with gr.Row():
398
+ gr.Markdown("# Submit your model here", elem_classes="markdown-text")
399
+
400
+ with gr.Row():
401
+ inference_framework = gr.Dropdown(
402
+ choices=[t.to_str() for t in InferenceFramework],
403
+ label="Inference framework",
404
+ multiselect=False,
405
+ value=None,
406
+ interactive=True,
407
+ )
408
 
409
+ gpu_type = gr.Dropdown(
410
+ choices=[t.to_str() for t in GPUType],
411
+ label="GPU type",
412
+ multiselect=False,
413
+ value="NVIDIA-A100-PCIe-80GB",
414
+ interactive=True,
415
+ )
416
+
417
+
418
+ with gr.Row():
419
+ with gr.Column():
420
+ model_name_textbox = gr.Textbox(label="Model name")
421
+ revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
422
+ private = gr.Checkbox(False, label="Private", visible=not IS_PUBLIC)
423
+ model_type = gr.Dropdown(
424
+ choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
425
+ label="Model type",
426
+ multiselect=False,
427
+ value=None,
428
+ interactive=True,
429
+ )
 
 
 
 
 
430
 
431
+ with gr.Column():
432
+ precision = gr.Dropdown(
433
+ choices=[i.value.name for i in Precision if i != Precision.Unknown],
434
+ label="Precision",
435
+ multiselect=False,
436
+ value="float32",
437
+ interactive=True,
438
+ )
439
 
440
+ weight_type = gr.Dropdown(
441
+ choices=[i.value.name for i in WeightType],
442
+ label="Weights type",
443
+ multiselect=False,
444
+ value="Original",
445
+ interactive=True,
446
+ )
447
+
448
+ base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
449
+
450
+ submit_button = gr.Button("Submit Eval")
451
+ submission_result = gr.Markdown()
452
+ debug = gr.Checkbox(value=args.debug, label="Debug", visible=False)
453
+ submit_button.click(
454
+ add_new_eval,
455
+ [
456
+ model_name_textbox,
457
+ base_model_name_textbox,
458
+ revision_name_textbox,
459
+ precision,
460
+ private,
461
+ weight_type,
462
+ model_type,
463
+ inference_framework,
464
+ debug,
465
+ gpu_type
466
+ ],
467
+ submission_result,
468
+ )
469
+
470
+ with gr.Row():
471
+ with gr.Accordion("Citing this leaderboard", open=False):
472
+ citation_button = gr.Textbox(
473
+ value=CITATION_BUTTON_TEXT,
474
+ label=CITATION_BUTTON_LABEL,
475
+ lines=20,
476
+ elem_id="citation-button",
477
+ show_copy_button=True,
478
+ )
479
+
480
+ scheduler = BackgroundScheduler()
481
+
482
+ scheduler.add_job(restart_space, "interval", hours=6)
483
+
484
+ def launch_backend():
485
+ import subprocess
486
+ from src.backend.envs import DEVICE
487
+
488
+ if DEVICE not in {"cpu"}:
489
+ _ = subprocess.run(["python", "backend-cli.py"])
490
+
491
+ # Thread(target=periodic_init, daemon=True).start()
492
+ # scheduler.add_job(launch_backend, "interval", seconds=120)
493
  if __name__ == "__main__":
494
+ scheduler.start()
495
+ demo.queue(default_concurrency_limit=40).launch()
496
+
backend-cli.py CHANGED
@@ -458,7 +458,6 @@ def get_args():
458
  parser.add_argument("--gpu-type", type=str, default="NVIDIA-A100-PCIe-80GB",
459
  help="GPU type. NVIDIA-A100-PCIe-80GB; NVIDIA-RTX-A5000-24GB; NVIDIA-H100-PCIe-80GB")
460
  parser.add_argument("--debug_repo", action="store_true", help="Use debug repo")
461
- parser.add_argument("--model_type", type=str, default="chat", help="Model type")
462
  return parser.parse_args()
463
 
464
 
@@ -489,8 +488,7 @@ if __name__ == "__main__":
489
  json_filepath="",
490
  precision=precision, # Use precision from arguments
491
  inference_framework=args.inference_framework, # Use inference framework from arguments
492
- gpu_type=args.gpu_type,
493
- model_type=args.model_type,
494
  )
495
  curr_gpu_type = get_gpu_details()
496
  if eval_request.gpu_type != curr_gpu_type:
 
458
  parser.add_argument("--gpu-type", type=str, default="NVIDIA-A100-PCIe-80GB",
459
  help="GPU type. NVIDIA-A100-PCIe-80GB; NVIDIA-RTX-A5000-24GB; NVIDIA-H100-PCIe-80GB")
460
  parser.add_argument("--debug_repo", action="store_true", help="Use debug repo")
 
461
  return parser.parse_args()
462
 
463
 
 
488
  json_filepath="",
489
  precision=precision, # Use precision from arguments
490
  inference_framework=args.inference_framework, # Use inference framework from arguments
491
+ gpu_type=args.gpu_type
 
492
  )
493
  curr_gpu_type = get_gpu_details()
494
  if eval_request.gpu_type != curr_gpu_type:
moe-cap-results DELETED
File without changes
requirements.txt CHANGED
@@ -1,6 +1,36 @@
1
- gradio>=4.44.0
2
- pandas
 
 
 
3
  datasets
4
- huggingface_hub<0.25.0
5
- plotly>=5.0.0
6
- kaleido>=0.2.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch
2
+ colorama
3
+ APScheduler
4
+ black
5
+ click
6
  datasets
7
+ gradio==4.26.0
8
+ gradio_client
9
+ huggingface-hub
10
+ matplotlib
11
+ numpy
12
+ pandas
13
+ plotly
14
+ python-dateutil
15
+ requests
16
+ semantic-version
17
+ tqdm
18
+ wandb
19
+ transformers
20
+ tokenizers>=0.15.0
21
+ lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@v0.4.2
22
+ accelerate
23
+ sentencepiece
24
+ langdetect
25
+ sacrebleu
26
+ cchardet
27
+ rouge_score
28
+ bert-score
29
+ evaluate
30
+ spacy==3.7.4
31
+ selfcheckgpt
32
+ immutabledict
33
+ gputil
34
+ bitsandbytes
35
+ openai
36
+ scikit-learn
src/backend/run_eval_suite.py CHANGED
@@ -17,22 +17,16 @@ def process_results_decorator(func):
17
  end_to_end_time = sum([r[1] for r in results]) / len(results)
18
  prefilling_time = sum([r[2] for r in results]) / len(results)
19
  decoding_throughput = sum([r[3] for r in results]) / len(results)
20
- decoding_mfu = sum([r[4] for r in results]) / len(results)
21
- decoding_mbu = sum([r[5] for r in results]) / len(results)
22
- prefill_throughput = sum([r[6] for r in results]) / len(results)
23
- prefill_mfu = sum([r[7] for r in results]) / len(results)
24
- prefill_mbu = sum([r[8] for r in results]) / len(results)
25
  # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
26
 
27
  result_dict = func(self, doc, processed_results, *args, **kwargs)
28
  result_dict["end_to_end_time"] = end_to_end_time
29
  result_dict["prefilling_time"] = prefilling_time
30
  result_dict["decoding_throughput"] = decoding_throughput
31
- result_dict["decoding_mfu"] = decoding_mfu
32
- result_dict["decoding_mbu"] = decoding_mbu
33
- result_dict["prefill_throughput"] = prefill_throughput
34
- result_dict["prefill_mfu"] = prefill_mfu
35
- result_dict["prefill_mbu"] = prefill_mbu
36
  return result_dict
37
  return wrapper
38
  ConfigurableTask.process_results = process_results_decorator(orig_process_results)
@@ -43,11 +37,8 @@ def aggregation_decorator(func):
43
  aggregation_list["end_to_end_time"] = mean
44
  aggregation_list["prefilling_time"] = mean
45
  aggregation_list["decoding_throughput"] = mean
46
- aggregation_list["decoding_mfu"] = mean
47
- aggregation_list["decoding_mbu"] = mean
48
- aggregation_list["prefill_throughput"] = mean
49
- aggregation_list["prefill_mfu"] = mean
50
- aggregation_list["prefill_mbu"] = mean
51
  return aggregation_list
52
  return wrapper
53
  ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
@@ -58,11 +49,8 @@ def higher_is_better_decorator(func):
58
  higher_is_better_dict["end_to_end_time"] = False
59
  higher_is_better_dict["prefilling_time"] = False
60
  higher_is_better_dict["decoding_throughput"] = True
61
- higher_is_better_dict["decoding_mfu"] = True
62
- higher_is_better_dict["decoding_mbu"] = True
63
- higher_is_better_dict["prefill_throughput"] = True
64
- higher_is_better_dict["prefill_mfu"] = True
65
- higher_is_better_dict["prefill_mbu"] = True
66
  return higher_is_better_dict
67
  return wrapper
68
  ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
@@ -77,8 +65,6 @@ from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
77
 
78
  from src.backend.huggingface_generate_until import HFLMwithChatTemplate
79
  from src.backend.moe_infinity import MoEHFLM
80
- from src.backend.vllm import VLLM_MOE
81
- from src.backend.sglang import SGLangMoE
82
 
83
  def run_evaluation(
84
  eval_request: EvalRequest,
 
17
  end_to_end_time = sum([r[1] for r in results]) / len(results)
18
  prefilling_time = sum([r[2] for r in results]) / len(results)
19
  decoding_throughput = sum([r[3] for r in results]) / len(results)
20
+ mfu = sum([r[4] for r in results]) / len(results)
21
+ mbu = sum([r[5] for r in results]) / len(results)
 
 
 
22
  # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
23
 
24
  result_dict = func(self, doc, processed_results, *args, **kwargs)
25
  result_dict["end_to_end_time"] = end_to_end_time
26
  result_dict["prefilling_time"] = prefilling_time
27
  result_dict["decoding_throughput"] = decoding_throughput
28
+ result_dict["mfu"] = mfu
29
+ result_dict["mbu"] = mbu
 
 
 
30
  return result_dict
31
  return wrapper
32
  ConfigurableTask.process_results = process_results_decorator(orig_process_results)
 
37
  aggregation_list["end_to_end_time"] = mean
38
  aggregation_list["prefilling_time"] = mean
39
  aggregation_list["decoding_throughput"] = mean
40
+ aggregation_list["mfu"] = mean
41
+ aggregation_list["mbu"] = mean
 
 
 
42
  return aggregation_list
43
  return wrapper
44
  ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
 
49
  higher_is_better_dict["end_to_end_time"] = False
50
  higher_is_better_dict["prefilling_time"] = False
51
  higher_is_better_dict["decoding_throughput"] = True
52
+ higher_is_better_dict["mfu"] = True
53
+ higher_is_better_dict["mbu"] = True
 
 
 
54
  return higher_is_better_dict
55
  return wrapper
56
  ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
 
65
 
66
  from src.backend.huggingface_generate_until import HFLMwithChatTemplate
67
  from src.backend.moe_infinity import MoEHFLM
 
 
68
 
69
  def run_evaluation(
70
  eval_request: EvalRequest,
src/backend/tasks/arena_hard/task.py CHANGED
@@ -72,7 +72,7 @@ class ArenaHard(ConfigurableTask):
72
  super().__init__(config={"metadata": {"version": self.VERSION}})
73
  # these end tokens are hard coded because of the current limitaion of the llm-eval.
74
  # self.generation_kwargs = {"until": ["\n\n", "<unk>", "<|im_end|>", "</s>", "<|endoftext|>"], "max_length": 512}
75
- self.generation_kwargs = {"until": ["</s>", "<|im_end|>"], "max_gen_toks": 4096}
76
  # self.generation_kwargs_sampling_number = 5 # the number of sampling for self-consistence
77
  # self.generation_kwargs_sampling = {
78
  # "temperature": 0.99,
 
72
  super().__init__(config={"metadata": {"version": self.VERSION}})
73
  # these end tokens are hard coded because of the current limitaion of the llm-eval.
74
  # self.generation_kwargs = {"until": ["\n\n", "<unk>", "<|im_end|>", "</s>", "<|endoftext|>"], "max_length": 512}
75
+ self.generation_kwargs = {"until": ["</s>", "<|im_end|>"], "max_length": 4096}
76
  # self.generation_kwargs_sampling_number = 5 # the number of sampling for self-consistence
77
  # self.generation_kwargs_sampling = {
78
  # "temperature": 0.99,
src/backend/tasks/measurement_task_utils.py CHANGED
@@ -12,12 +12,8 @@ def process_results_decorator(func):
12
  end_to_end_time = sum([r[1] for r in results]) / len(results)
13
  prefilling_time = sum([r[2] for r in results]) / len(results)
14
  decoding_throughput = sum([r[3] for r in results]) / len(results)
15
- decoding_mfu = sum([r[4] for r in results]) / len(results)
16
- decoding_mbu = sum([r[5] for r in results]) / len(results)
17
- prefill_throughput = sum([r[6] for r in results]) / len(results)
18
- prefill_mfu = sum([r[7] for r in results]) / len(results)
19
- prefill_mbu = sum([r[8] for r in results]) / len(results)
20
-
21
 
22
  # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
23
 
@@ -26,11 +22,8 @@ def process_results_decorator(func):
26
  result_dict["end_to_end_time"] = end_to_end_time
27
  result_dict["prefilling_time"] = prefilling_time
28
  result_dict["decoding_throughput"] = decoding_throughput
29
- result_dict["decoding_mfu"] = decoding_mfu
30
- result_dict["decoding_mbu"] = decoding_mbu
31
- result_dict["prefill_throughput"] = prefill_throughput
32
- result_dict["prefill_mfu"] = prefill_mfu
33
- result_dict["prefill_mbu"] = prefill_mbu
34
  return result_dict
35
  return wrapper
36
 
@@ -42,11 +35,8 @@ def aggregation_decorator(func):
42
  aggregation_list["end_to_end_time"] = mean
43
  aggregation_list["prefilling_time"] = mean
44
  aggregation_list["decoding_throughput"] = mean
45
- aggregation_list["decoding_mfu"] = mean
46
- aggregation_list["decoding_mbu"] = mean
47
- aggregation_list["prefill_throughput"] = mean
48
- aggregation_list["prefill_mfu"] = mean
49
- aggregation_list["prefill_mbu"] = mean
50
  return aggregation_list
51
  return wrapper
52
 
@@ -58,11 +48,8 @@ def higher_is_better_decorator(func):
58
  higher_is_better_dict["end_to_end_time"] = False
59
  higher_is_better_dict["prefilling_time"] = False
60
  higher_is_better_dict["decoding_throughput"] = True
61
- higher_is_better_dict["decoding_mfu"] = True
62
- higher_is_better_dict["decoding_mbu"] = True
63
- higher_is_better_dict["prefill_throughput"] = True
64
- higher_is_better_dict["prefill_mfu"] = True
65
- higher_is_better_dict["prefill_mbu"] = True
66
  return higher_is_better_dict
67
  return wrapper
68
 
 
12
  end_to_end_time = sum([r[1] for r in results]) / len(results)
13
  prefilling_time = sum([r[2] for r in results]) / len(results)
14
  decoding_throughput = sum([r[3] for r in results]) / len(results)
15
+ mfu = sum([r[4] for r in results]) / len(results)
16
+ mbu = sum([r[5] for r in results]) / len(results)
 
 
 
 
17
 
18
  # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
19
 
 
22
  result_dict["end_to_end_time"] = end_to_end_time
23
  result_dict["prefilling_time"] = prefilling_time
24
  result_dict["decoding_throughput"] = decoding_throughput
25
+ result_dict["mfu"] = mfu
26
+ result_dict["mbu"] = mbu
 
 
 
27
  return result_dict
28
  return wrapper
29
 
 
35
  aggregation_list["end_to_end_time"] = mean
36
  aggregation_list["prefilling_time"] = mean
37
  aggregation_list["decoding_throughput"] = mean
38
+ aggregation_list["mfu"] = mean
39
+ aggregation_list["mbu"] = mean
 
 
 
40
  return aggregation_list
41
  return wrapper
42
 
 
48
  higher_is_better_dict["end_to_end_time"] = False
49
  higher_is_better_dict["prefilling_time"] = False
50
  higher_is_better_dict["decoding_throughput"] = True
51
+ higher_is_better_dict["mfu"] = True
52
+ higher_is_better_dict["mbu"] = True
 
 
 
53
  return higher_is_better_dict
54
  return wrapper
55
 
src/display/about.py CHANGED
@@ -18,13 +18,10 @@ Columns and Metrics:
18
  - Method: The MOE LLMs inference framework.
19
  - E2E(s): Average End to End generation time in seconds.
20
  - PRE(s): Prefilling Time of input prompt in seconds.
21
- - Decoding T/s: Tokens throughout per second for decoding.
22
- - Decoding S-MBU(%): Sparse Model Bandwidth Utilization for decoding.
23
- - Decoding S-MFU(%): Sparse Model FLOPs Utilization for decoding.
24
- - Prefill T/s: Tokens throughout per second for Prefilling.
25
- - Prefill S-MBU(%): Sparse Model Bandwidth Utilization for Prefilling.
26
- - Prefill S-MFU(%): Sparse Model FLOPs Utilization for Prefilling.
27
- - Precision: The precision of used model.
28
 
29
  """
30
 
 
18
  - Method: The MOE LLMs inference framework.
19
  - E2E(s): Average End to End generation time in seconds.
20
  - PRE(s): Prefilling Time of input prompt in seconds.
21
+ - T/s: Tokens throughout per second.
22
+ - MBU(%): Model Bandwidth Utilization.
23
+ - MFU(%): Model FLOPs Utilization.
24
+ - Precision: The precison of used model.
 
 
 
25
 
26
  """
27
 
src/display/utils.py CHANGED
@@ -9,32 +9,25 @@ def fields(raw_class):
9
 
10
  E2Es = "E2E(s)" #"End-to-end time (s)"
11
  PREs = "PRE(s)" #"Prefilling time (s)"
12
- TS = "Decoding T/s" #Decoding throughput (tok/s)
13
- PTS = "Prefill T/s" #Prefill throughput (tok/s)
14
  InFrame = "Method" #"Inference framework"
15
  MULTIPLE_CHOICEs = ["mmlu"]
16
 
17
-
18
  GPU_TEMP = 'Temp(C)'
19
  GPU_Power = 'Power(W)'
20
  GPU_Mem = 'Mem(G)'
21
  GPU_Name = "GPU"
22
  GPU_Util = 'Util(%)'
23
- DSMFU = 'Decoding S-MFU(%)'
24
- DSMBU = 'Decoding S-MBU(%)'
25
- PSMFU = 'Prefill S-MFU(%)'
26
- PSMBU = 'Prefill S-MBU(%)'
27
  BATCH_SIZE = 'bs'
28
  PRECISION = "Precision"
29
  system_metrics_to_name_map = {
30
  "end_to_end_time": f"{E2Es}",
31
  "prefilling_time": f"{PREs}",
32
  "decoding_throughput": f"{TS}",
33
- "decoding_mfu": f"{DSMFU}",
34
- "decoding_mbu": f"{DSMBU}",
35
- "prefill_throughput": f"{PTS}",
36
- "prefill_mfu": f"{PSMFU}",
37
- "prefill_mbu": f"{PSMBU}",
38
  }
39
 
40
  gpu_metrics_to_name_map = {
@@ -85,11 +78,10 @@ class Tasks(Enum):
85
 
86
  # # XXX include me back at some point
87
  # selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
88
- # selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
89
  gsm8k = Task("gsm8k_custom", "em", "GSM8K") #GSM8K/EM (5-shot)
90
  # gsm8k_cot = Task("gsm8k_cot", "em", "GSM8K COT") #GSM8K COT/EM (5-shot)
91
  arena_hard = Task("arena_hard", "score", "Arena Hard") #Arena Hard/Score
92
- mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
93
 
94
 
95
  # These classes are for user facing column names,
@@ -114,7 +106,7 @@ auto_eval_column_dict.append(["model", ColumnContent, ColumnContent("Model", "ma
114
  # # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
115
 
116
  # Inference framework
117
- auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True, dummy=True)])
118
 
119
  for task in Tasks:
120
  auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
@@ -125,30 +117,24 @@ for task in Tasks:
125
  # auto_eval_column_dict.append([f"{task.name}_gpu_mem", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Mem}", "number", True, hidden=True)])
126
  auto_eval_column_dict.append([f"{task.name}_gpu", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Name}", "str", True, hidden=True)])
127
  # auto_eval_column_dict.append([f"{task.name}_gpu_util", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Util}", "number", True, hidden=True)])
128
- auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} {PREs}", "number", False, hidden=True)])
129
  if task.value.benchmark in MULTIPLE_CHOICEs:
130
  continue
 
131
  auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {TS}", "number", True, hidden=True)])
132
- # if task.value.benchmark != "gsm8k_custom":
133
- # continue
134
- auto_eval_column_dict.append([f"{task.name}_decoding_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {DSMBU}", "number", True, hidden=True)])
135
- auto_eval_column_dict.append([f"{task.name}_decoding_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {DSMFU}", "number", True, hidden=True)])
136
- auto_eval_column_dict.append([f"{task.name}_prefill_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {PTS}", "number", True, hidden=True)])
137
- auto_eval_column_dict.append([f"{task.name}_prefill_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {PSMBU}", "number", True, hidden=True)])
138
- auto_eval_column_dict.append([f"{task.name}_prefill_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {PSMFU}", "number", True, hidden=True)])
139
-
140
 
141
 
142
  # Model information
143
- auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False, dummy=True)])
144
- # auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
145
- # auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
146
- auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True, dummy=True)])
147
- # auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
148
- # auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
149
- # auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
150
- # auto_eval_column_dict.append(["still_on_hub", ColumnContent, ColumnContent("Available on the hub", "bool", False)])
151
- # auto_eval_column_dict.append(["revision", ColumnContent, ColumnContent("Model sha", "str", False, False)])
152
  # Dummy column for the search bar (hidden by the custom CSS)
153
  auto_eval_column_dict.append(["dummy", ColumnContent, ColumnContent("model_name_for_query", "str", False, dummy=True)])
154
 
@@ -174,10 +160,10 @@ class ModelDetails:
174
 
175
 
176
  class ModelType(Enum):
177
- # PT = ModelDetails(name="pretrained", symbol="🟢")
178
- # FT = ModelDetails(name="fine-tuned on domain-specific datasets", symbol="🔶")
179
  chat = ModelDetails(name="chat models (RLHF, DPO, IFT, ...)", symbol="💬")
180
- # merges = ModelDetails(name="base merges and moerges", symbol="🤝")
181
  Unknown = ModelDetails(name="", symbol="?")
182
 
183
  def to_str(self, separator=" "):
@@ -185,25 +171,22 @@ class ModelType(Enum):
185
 
186
  @staticmethod
187
  def from_str(type):
188
- # if "fine-tuned" in type or "🔶" in type:
189
- # return ModelType.FT
190
- # if "pretrained" in type or "🟢" in type:
191
- # return ModelType.PT
192
  if any([k in type for k in ["instruction-tuned", "RL-tuned", "chat", "🟦", "⭕", "💬"]]):
193
  return ModelType.chat
194
- # if "merge" in type or "🤝" in type:
195
- # return ModelType.merges
196
  return ModelType.Unknown
197
 
198
 
199
  class InferenceFramework(Enum):
200
  # "moe-infinity", hf-chat
201
- # MoE_Infinity = ModelDetails("moe-infinity")
202
  HF_Chat = ModelDetails("hf-chat")
203
  VLLM = ModelDetails("vllm_moe")
204
- VLLM_FIX = ModelDetails("vllm_moe_fixbs")
205
- TRTLLM = ModelDetails("tensorrt_llm")
206
- SGLANG = ModelDetails("sglang")
207
  Unknown = ModelDetails("?")
208
 
209
  def to_str(self):
@@ -211,23 +194,16 @@ class InferenceFramework(Enum):
211
 
212
  @staticmethod
213
  def from_str(inference_framework: str):
214
- # if inference_framework in ["moe-infinity"]:
215
- # return InferenceFramework.MoE_Infinity
216
- if inference_framework in ["tensorrt_llm"]:
217
- return InferenceFramework.TRTLLM
218
  if inference_framework in ["hf-chat"]:
219
  return InferenceFramework.HF_Chat
220
  if inference_framework in ["vllm_moe"]:
221
  return InferenceFramework.VLLM
222
- if inference_framework in ["vllm_moe_fixbs"]:
223
- return InferenceFramework.VLLM_FIX
224
- if inference_framework in ["sglang"]:
225
- return InferenceFramework.SGLANG
226
  return InferenceFramework.Unknown
227
 
228
  class GPUType(Enum):
229
  A100_sxm = ModelDetails("NVIDIA-A100-SXM4-80GB")
230
- A100_sxm4 = ModelDetails("NVIDIA-A100-SMX4-80GB")
231
  A100_pcie = ModelDetails("NVIDIA-A100-PCIe-80GB")
232
  Unknown = ModelDetails("?")
233
 
@@ -249,28 +225,28 @@ class WeightType(Enum):
249
 
250
 
251
  class Precision(Enum):
252
- # float32 = ModelDetails("float32")
253
- # float16 = ModelDetails("float16")
254
  bfloat16 = ModelDetails("bfloat16")
255
  qt_8bit = ModelDetails("8bit")
256
  qt_4bit = ModelDetails("4bit")
257
- # qt_GPTQ = ModelDetails("GPTQ")
258
  Unknown = ModelDetails("?")
259
 
260
  @staticmethod
261
  def from_str(precision: str):
262
- # if precision in ["torch.float32", "float32"]:
263
- # return Precision.float32
264
- # if precision in ["torch.float16", "float16"]:
265
- # return Precision.float16
266
  if precision in ["torch.bfloat16", "bfloat16"]:
267
  return Precision.bfloat16
268
  if precision in ["8bit"]:
269
  return Precision.qt_8bit
270
  if precision in ["4bit"]:
271
  return Precision.qt_4bit
272
- # if precision in ["GPTQ", "None"]:
273
- # return Precision.qt_GPTQ
274
  return Precision.Unknown
275
 
276
 
 
9
 
10
  E2Es = "E2E(s)" #"End-to-end time (s)"
11
  PREs = "PRE(s)" #"Prefilling time (s)"
12
+ TS = "T/s" #Decoding throughput (tok/s)
 
13
  InFrame = "Method" #"Inference framework"
14
  MULTIPLE_CHOICEs = ["mmlu"]
15
 
 
16
  GPU_TEMP = 'Temp(C)'
17
  GPU_Power = 'Power(W)'
18
  GPU_Mem = 'Mem(G)'
19
  GPU_Name = "GPU"
20
  GPU_Util = 'Util(%)'
21
+ MFU = 'MFU(%)'
22
+ MBU = 'MBU(%)'
 
 
23
  BATCH_SIZE = 'bs'
24
  PRECISION = "Precision"
25
  system_metrics_to_name_map = {
26
  "end_to_end_time": f"{E2Es}",
27
  "prefilling_time": f"{PREs}",
28
  "decoding_throughput": f"{TS}",
29
+ "mfu": f"{MFU}",
30
+ "mbu": f"{MBU}"
 
 
 
31
  }
32
 
33
  gpu_metrics_to_name_map = {
 
78
 
79
  # # XXX include me back at some point
80
  # selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
81
+ mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
82
  gsm8k = Task("gsm8k_custom", "em", "GSM8K") #GSM8K/EM (5-shot)
83
  # gsm8k_cot = Task("gsm8k_cot", "em", "GSM8K COT") #GSM8K COT/EM (5-shot)
84
  arena_hard = Task("arena_hard", "score", "Arena Hard") #Arena Hard/Score
 
85
 
86
 
87
  # These classes are for user facing column names,
 
106
  # # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
107
 
108
  # Inference framework
109
+ auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True)])
110
 
111
  for task in Tasks:
112
  auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
 
117
  # auto_eval_column_dict.append([f"{task.name}_gpu_mem", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Mem}", "number", True, hidden=True)])
118
  auto_eval_column_dict.append([f"{task.name}_gpu", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Name}", "str", True, hidden=True)])
119
  # auto_eval_column_dict.append([f"{task.name}_gpu_util", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Util}", "number", True, hidden=True)])
 
120
  if task.value.benchmark in MULTIPLE_CHOICEs:
121
  continue
122
+ # auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} {PREs}", "number", False, hidden=True)])
123
  auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {TS}", "number", True, hidden=True)])
124
+ auto_eval_column_dict.append([f"{task.name}_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {MBU}", "number", True, hidden=True)])
125
+ auto_eval_column_dict.append([f"{task.name}_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {MFU}", "number", True, hidden=True)])
 
 
 
 
 
 
126
 
127
 
128
  # Model information
129
+ auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
130
+ auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
131
+ auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
132
+ auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True)])
133
+ auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
134
+ auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
135
+ auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
136
+ auto_eval_column_dict.append(["still_on_hub", ColumnContent, ColumnContent("Available on the hub", "bool", False)])
137
+ auto_eval_column_dict.append(["revision", ColumnContent, ColumnContent("Model sha", "str", False, False)])
138
  # Dummy column for the search bar (hidden by the custom CSS)
139
  auto_eval_column_dict.append(["dummy", ColumnContent, ColumnContent("model_name_for_query", "str", False, dummy=True)])
140
 
 
160
 
161
 
162
  class ModelType(Enum):
163
+ PT = ModelDetails(name="pretrained", symbol="🟢")
164
+ FT = ModelDetails(name="fine-tuned on domain-specific datasets", symbol="🔶")
165
  chat = ModelDetails(name="chat models (RLHF, DPO, IFT, ...)", symbol="💬")
166
+ merges = ModelDetails(name="base merges and moerges", symbol="🤝")
167
  Unknown = ModelDetails(name="", symbol="?")
168
 
169
  def to_str(self, separator=" "):
 
171
 
172
  @staticmethod
173
  def from_str(type):
174
+ if "fine-tuned" in type or "🔶" in type:
175
+ return ModelType.FT
176
+ if "pretrained" in type or "🟢" in type:
177
+ return ModelType.PT
178
  if any([k in type for k in ["instruction-tuned", "RL-tuned", "chat", "🟦", "⭕", "💬"]]):
179
  return ModelType.chat
180
+ if "merge" in type or "🤝" in type:
181
+ return ModelType.merges
182
  return ModelType.Unknown
183
 
184
 
185
  class InferenceFramework(Enum):
186
  # "moe-infinity", hf-chat
187
+ MoE_Infinity = ModelDetails("moe-infinity")
188
  HF_Chat = ModelDetails("hf-chat")
189
  VLLM = ModelDetails("vllm_moe")
 
 
 
190
  Unknown = ModelDetails("?")
191
 
192
  def to_str(self):
 
194
 
195
  @staticmethod
196
  def from_str(inference_framework: str):
197
+ if inference_framework in ["moe-infinity"]:
198
+ return InferenceFramework.MoE_Infinity
 
 
199
  if inference_framework in ["hf-chat"]:
200
  return InferenceFramework.HF_Chat
201
  if inference_framework in ["vllm_moe"]:
202
  return InferenceFramework.VLLM
 
 
 
 
203
  return InferenceFramework.Unknown
204
 
205
  class GPUType(Enum):
206
  A100_sxm = ModelDetails("NVIDIA-A100-SXM4-80GB")
 
207
  A100_pcie = ModelDetails("NVIDIA-A100-PCIe-80GB")
208
  Unknown = ModelDetails("?")
209
 
 
225
 
226
 
227
  class Precision(Enum):
228
+ float32 = ModelDetails("float32")
229
+ float16 = ModelDetails("float16")
230
  bfloat16 = ModelDetails("bfloat16")
231
  qt_8bit = ModelDetails("8bit")
232
  qt_4bit = ModelDetails("4bit")
233
+ qt_GPTQ = ModelDetails("GPTQ")
234
  Unknown = ModelDetails("?")
235
 
236
  @staticmethod
237
  def from_str(precision: str):
238
+ if precision in ["torch.float32", "float32"]:
239
+ return Precision.float32
240
+ if precision in ["torch.float16", "float16"]:
241
+ return Precision.float16
242
  if precision in ["torch.bfloat16", "bfloat16"]:
243
  return Precision.bfloat16
244
  if precision in ["8bit"]:
245
  return Precision.qt_8bit
246
  if precision in ["4bit"]:
247
  return Precision.qt_4bit
248
+ if precision in ["GPTQ", "None"]:
249
+ return Precision.qt_GPTQ
250
  return Precision.Unknown
251
 
252
 
src/leaderboard/read_evals.py CHANGED
@@ -140,7 +140,6 @@ class EvalResult:
140
  revision=config.get("model_sha", ""),
141
  still_on_hub=still_on_hub,
142
  architecture=architecture,
143
- model_type=ModelType.from_str(config.get("model_type", "")),
144
  inference_framework=inference_framework,
145
  )
146
 
@@ -175,22 +174,22 @@ class EvalResult:
175
 
176
  # breakpoint()
177
  # average = sum([v for v in self.results.values() if v is not None]) / len(Tasks)
178
-
179
  data_dict = {
180
  "eval_name": self.eval_name, # not a column, just a save name,
181
  AutoEvalColumn.precision.name: self.precision.value.name,
182
- # AutoEvalColumn.model_type.name: self.model_type.value.name,
183
  AutoEvalColumn.model_type_symbol.name: self.model_type.value.symbol,
184
- # AutoEvalColumn.weight_type.name: self.weight_type.value.name,
185
- # AutoEvalColumn.architecture.name: self.architecture,
186
  AutoEvalColumn.model.name: make_clickable_model(self.full_model),
187
  AutoEvalColumn.dummy.name: self.full_model,
188
- # AutoEvalColumn.revision.name: self.revision,
189
- # # AutoEvalColumn.average.name: average,
190
- # AutoEvalColumn.license.name: self.license,
191
- # AutoEvalColumn.likes.name: self.likes,
192
- # AutoEvalColumn.params.name: self.num_params,
193
- # AutoEvalColumn.still_on_hub.name: self.still_on_hub,
194
  AutoEvalColumn.inference_framework.name: self.inference_framework,
195
  }
196
 
 
140
  revision=config.get("model_sha", ""),
141
  still_on_hub=still_on_hub,
142
  architecture=architecture,
 
143
  inference_framework=inference_framework,
144
  )
145
 
 
174
 
175
  # breakpoint()
176
  # average = sum([v for v in self.results.values() if v is not None]) / len(Tasks)
177
+
178
  data_dict = {
179
  "eval_name": self.eval_name, # not a column, just a save name,
180
  AutoEvalColumn.precision.name: self.precision.value.name,
181
+ AutoEvalColumn.model_type.name: self.model_type.value.name,
182
  AutoEvalColumn.model_type_symbol.name: self.model_type.value.symbol,
183
+ AutoEvalColumn.weight_type.name: self.weight_type.value.name,
184
+ AutoEvalColumn.architecture.name: self.architecture,
185
  AutoEvalColumn.model.name: make_clickable_model(self.full_model),
186
  AutoEvalColumn.dummy.name: self.full_model,
187
+ AutoEvalColumn.revision.name: self.revision,
188
+ # AutoEvalColumn.average.name: average,
189
+ AutoEvalColumn.license.name: self.license,
190
+ AutoEvalColumn.likes.name: self.likes,
191
+ AutoEvalColumn.params.name: self.num_params,
192
+ AutoEvalColumn.still_on_hub.name: self.still_on_hub,
193
  AutoEvalColumn.inference_framework.name: self.inference_framework,
194
  }
195
 
src/populate.py CHANGED
@@ -75,7 +75,7 @@ def get_leaderboard_df(
75
  df[col] = np.nan
76
 
77
  if not df.empty:
78
- df = df.map(lambda x: round(x, 2) if isinstance(x, (int, float)) else x)
79
 
80
  # filter out if any of the benchmarks have not been produced
81
  # df = df[has_no_nan_values(df, benchmark_cols)]
 
75
  df[col] = np.nan
76
 
77
  if not df.empty:
78
+ df = df.round(decimals=2)
79
 
80
  # filter out if any of the benchmarks have not been produced
81
  # df = df[has_no_nan_values(df, benchmark_cols)]
src/utils.py CHANGED
@@ -4,8 +4,6 @@ import subprocess
4
  import re
5
  import os
6
  import GPUtil
7
- from transformers import AutoConfig
8
- from typing import List
9
 
10
  try:
11
  from src.display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
@@ -14,63 +12,44 @@ except:
14
  from display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
15
 
16
  MEM_BW_DICT ={
17
- "NVIDIA-A100-PCIe-80GB": 1935e9,
18
- "NVIDIA-A100-SXM4-80GB": 2039e9,
19
- "NVIDIA-H100-PCIe-80GB": 2039e9,
20
- "NVIDIA-RTX-A5000-24GB": 768e9,
21
- "NVIDIA-RTX-A6000-48GB": 768e9,
22
  }
23
 
24
  PEAK_FLOPS_DICT = {
25
  "float32":{
26
  "NVIDIA-A100-PCIe-80GB": 312e12,
27
- "NVIDIA-A100-SXM4-80GB": 312e12,
28
  "NVIDIA-H100-PCIe-80GB": 756e12,
29
- "NVIDIA-RTX-A5000-24GB": 222.2e12,
30
- "NVIDIA-RTX-A6000-48GB": 309.7e12
31
  },
32
  "float16":{
33
  "NVIDIA-A100-PCIe-80GB": 624e12,
34
- "NVIDIA-A100-SXM4-80GB": 624e12,
35
  "NVIDIA-H100-PCIe-80GB": 1513e12,
36
- "NVIDIA-RTX-A5000-24GB": 222.2e12,
37
- "NVIDIA-RTX-A6000-48GB": 309.7e12
38
  },
39
  "bfloat16":{
40
  "NVIDIA-A100-PCIe-80GB": 624e12,
41
- "NVIDIA-A100-SXM4-80GB": 624e12,
42
  "NVIDIA-H100-PCIe-80GB": 1513e12,
43
- "NVIDIA-RTX-A5000-24GB": 222.2e12,
44
- "NVIDIA-RTX-A6000-48GB": 309.7e12
45
  },
46
- "int8":{
47
  "NVIDIA-A100-PCIe-80GB": 1248e12,
48
- "NVIDIA-A100-SXM4-80GB": 1248e12,
49
  "NVIDIA-H100-PCIe-80GB": 3026e12,
50
- "NVIDIA-RTX-A5000-24GB": 222.2e12,
51
- "NVIDIA-RTX-A6000-48GB": 309.7e12
52
  },
53
- "fp8":{
54
- "NVIDIA-A100-PCIe-80GB": 1248e12,
55
- "NVIDIA-A100-SXM4-80GB": 1248e12,
56
- "NVIDIA-H100-PCIe-80GB": 3026e12,
57
- "NVIDIA-RTX-A5000-24GB": 0,
58
- "NVIDIA-RTX-A6000-48GB": 0
59
- },
60
- "fp4": {
61
- "NVIDIA-A100-PCIe-80GB": 1248e12,
62
- "NVIDIA-A100-SXM4-80GB": 1248e12,
63
- "NVIDIA-H100-PCIe-80GB": 3026e12,
64
- "NVIDIA-RTX-A5000-24GB": 0,
65
- "NVIDIA-RTX-A6000-48GB": 0
66
- },
67
- "int4": {
68
- "NVIDIA-A100-PCIe-80GB": 1248e12,
69
- "NVIDIA-A100-SXM4-80GB": 1248e12,
70
- "NVIDIA-H100-PCIe-80GB": 3026e12,
71
- "NVIDIA-RTX-A5000-24GB": 222.2e12,
72
- "NVIDIA-RTX-A6000-48GB": 309.7e12
73
  }
 
74
  }
75
 
76
  def my_snapshot_download(repo_id, revision, local_dir, repo_type, max_workers):
@@ -118,7 +97,7 @@ def parse_nvidia_smi():
118
  # print(f"gpu_indices: {gpu_indices}")
119
  gpu_stats = []
120
 
121
- gpu_info_pattern = re.compile(r'(\d+)C\s+P\d+\s+(\d+)W\s*/\s*\d+W\s*\|\s*(\d+)MiB\s*/\s*\d+MiB\s*\|\s*(\d+)%')
122
  # gpu_name_pattern = re.compile(r'NVIDIA\s+([\w\s]+\d+(?:\s*GB)?)')
123
  gpu_name_pattern = re.compile(r'NVIDIA\s+(RTX\s+)?([A-Z0-9]+)')
124
 
@@ -216,790 +195,17 @@ def get_peak_bw(gpu_name):
216
  def get_peak_flops(gpu_name, precision):
217
  return PEAK_FLOPS_DICT[precision][gpu_name]
218
 
219
- def _calculate_batch_metrics(outputs, decoding_tp, n_layers, d_model,
220
- n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
221
- avg_activated_experts, hf_config, num_gpus, model_name,
222
- used_dtype, batch_size, precision):
223
- """Calculate metrics for a batch of outputs"""
224
- gpu_type = get_gpu_details()
225
- hardware_specs = {
226
- "peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
227
- "peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
228
- }
229
- kvs = []
230
- true_kvs = []
231
- attn_score = []
232
-
233
- # Calculate KV sizes
234
- per_token_kv_size = 2 * n_layers * d_head * n_kv_heads # Default calculation
235
-
236
- if "DeepSeek" in model_name:
237
- if hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
238
- per_token_kv_size = n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
239
-
240
- # Process each output
241
- for x in outputs:
242
- output_len = len(x.outputs[0].token_ids)
243
- context_prefill_size = len(x.prompt_token_ids)
244
-
245
- # Calculate attention scores
246
- if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim"):
247
- q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
248
- origin_per_token_k_state_size = n_layers * n_attn_heads * q_head_dim
249
- origin_per_token_v_state_size = n_layers * n_attn_heads * hf_config.v_head_dim
250
- attention_score = context_prefill_size * origin_per_token_k_state_size + (output_len - 1) * origin_per_token_k_state_size / 2
251
- attention_score += context_prefill_size * origin_per_token_v_state_size + (output_len - 1) * origin_per_token_v_state_size / 2
252
- attention_score = attention_score / 1e12
253
- else:
254
- origin_per_token_kv_states_size = n_layers * n_attn_heads * d_head
255
- attention_score = context_prefill_size * origin_per_token_kv_states_size + (output_len - 1) * origin_per_token_kv_states_size / 2
256
- attention_score = attention_score * 2 / 1e12
257
-
258
- # Store attention scores and KV sizes
259
- attn_score.append(attention_score)
260
- kv_size = context_prefill_size * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2
261
- kv_size = kv_size / 1e12
262
- true_kv = (context_prefill_size * per_token_kv_size + output_len * per_token_kv_size) / 1e12 * 1e3
263
- kvs.append(kv_size)
264
- true_kvs.append(true_kv)
265
-
266
- # Calculate aggregate values
267
- kv_size = sum(kvs)
268
- true_kv_size = sum(true_kvs) * 1e3
269
- attention_score = sum(attn_score) / len(attn_score)
270
-
271
- # Calculate attention size per token
272
- if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim") and hasattr(hf_config, "kv_lora_rank"):
273
- q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
274
- if not hasattr(hf_config, "q_lora_rank") or not hf_config.q_lora_rank:
275
- attention_size_per_token = (d_model * n_attn_heads * q_head_dim) + \
276
- (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
277
- (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
278
- (hf_config.v_head_dim * n_attn_heads * d_model)
279
- attention_size_per_token = attention_size_per_token / 1e12
280
- else:
281
- attention_size_per_token = (d_model * hf_config.q_lora_rank) + \
282
- (hf_config.q_lora_rank * n_attn_heads * q_head_dim) + \
283
- (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
284
- (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
285
- (hf_config.v_head_dim * n_attn_heads * d_model)
286
- attention_size_per_token = attention_size_per_token / 1e12
287
- else:
288
- attention_size_per_token = d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) + n_attn_heads * d_head * d_model
289
- attention_size_per_token = attention_size_per_token / 1e12
290
-
291
- # Calculate expert sizes
292
- expert_size = d_ff * 3 * d_model / 1e12
293
- shared_experts_size_total = 0
294
- deepseek_dense_ffn_size = 0
295
- deepseek_sparse_layer_num = 0
296
-
297
- if "Qwen" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "shared_expert_intermediate_size"):
298
- d_ff = hf_config.moe_intermediate_size
299
- d_ff_share = hf_config.shared_expert_intermediate_size
300
- shared_experts_size = d_ff_share * 3 * d_model
301
- expert_size = d_ff * 3 * d_model
302
- shared_experts_size_total = shared_experts_size / 1e12
303
- expert_size = expert_size / 1e12
304
- elif "Qwen3" in model_name and hasattr(hf_config, "moe_intermediate_size"):
305
- d_ff = hf_config.moe_intermediate_size
306
- expert_size = d_ff * 3 * d_model
307
- expert_size = expert_size / 1e12
308
- elif "DeepSeek" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "intermediate_size") and hasattr(hf_config, "first_k_dense_replace"):
309
- d_ff = hf_config.moe_intermediate_size
310
- d_ff_dense = hf_config.intermediate_size
311
- deepseek_num_dense_layer = hf_config.first_k_dense_replace
312
- shared_experts_size = d_ff * 3 * d_model
313
- expert_size = d_ff * 3 * d_model
314
- shared_experts = 2
315
- shared_experts_size_total = shared_experts_size * shared_experts / 1e12
316
- expert_size = expert_size / 1e12
317
- deepseek_sparse_layer_num = n_layers - deepseek_num_dense_layer
318
- deepseek_dense_ffn_size = d_ff_dense * 3 * d_model / 1e12
319
-
320
- # Calculate S-MBU and S-MFU
321
- if "Qwen" in model_name and not "Qwen3" in model_name:
322
- smbu = ((n_layers*(avg_activated_experts * expert_size + shared_experts_size_total + attention_size_per_token) +
323
- kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
324
- smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size + shared_experts_size_total) + attention_score) \
325
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
326
- elif "Qwen3" in model_name:
327
- smbu = ((n_layers * (avg_activated_experts * expert_size + attention_size_per_token) +
328
- kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
329
- smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
330
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
331
- elif "DeepSeek" in model_name:
332
- smbu = ((n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
333
- (avg_activated_experts * expert_size + shared_experts_size_total) + \
334
- deepseek_num_dense_layer * deepseek_dense_ffn_size + \
335
- kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
336
- smfu = (n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
337
- (n_experts_per_tok * expert_size + shared_experts_size_total) + \
338
- deepseek_num_dense_layer * deepseek_dense_ffn_size + attention_score) \
339
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
340
- else:
341
- smbu = ((n_layers*(avg_activated_experts * expert_size + attention_size_per_token) +
342
- kv_size) * precision/ (batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
343
- smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
344
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
345
-
346
- return {
347
- 'smbu': smbu,
348
- 'smfu': smfu,
349
- 'kv_size': true_kv_size,
350
- 'decoding_throughput': decoding_tp
351
- }
352
-
353
- def _calculate_batch_metrics_sglang(outputs, decoding_tp, n_layers, d_model,
354
- n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
355
- avg_activated_experts, hf_config, num_gpus, model_name,
356
- used_dtype, batch_size, precision, ttft=None, prefill_tp=None):
357
- """Calculate metrics for a batch of outputs"""
358
- # Initialize hardware specs and output lists
359
- hardware_specs = _get_hardware_specs(used_dtype)
360
- output_data = _extract_output_data(outputs)
361
-
362
- # Calculate model-specific sizes
363
- per_token_kv_size = _calculate_kv_size(model_name, hf_config, n_layers, d_head, n_kv_heads)
364
- attention_size_per_token = _calculate_attention_size(model_name, hf_config, d_model, n_attn_heads, d_head, n_kv_heads)
365
- expert_config = _calculate_expert_config(model_name, hf_config, d_ff, d_model, n_layers)
366
-
367
- # Process outputs and calculate metrics
368
- metrics_data = _process_outputs(output_data, per_token_kv_size, attention_size_per_token,
369
- model_name, hf_config, n_layers, n_attn_heads, d_head)
370
-
371
- # Calculate throughput metrics
372
- if ttft is None or prefill_tp is None:
373
- ttft, prefill_tp = _calculate_throughput_metrics(batch_size, output_data['prefill_lengths'],
374
- output_data['max_duration'])
375
-
376
-
377
- # Calculate S-MBU and S-MFU
378
- smbu_smfu_metrics = _calculate_smbu_smfu(model_name, n_layers, attention_size_per_token,
379
- expert_config, avg_activated_experts, metrics_data,
380
- hardware_specs, num_gpus, precision, ttft, prefill_tp,
381
- batch_size, decoding_tp)
382
-
383
- return {
384
- 'prefill_smbu': smbu_smfu_metrics['prefill_smbu'],
385
- 'prefill_smfu': smbu_smfu_metrics['prefill_smfu'],
386
- 'decoding_smbu': smbu_smfu_metrics['decoding_smbu'],
387
- 'decoding_smfu': smbu_smfu_metrics['decoding_smfu'],
388
- 'kv_size': metrics_data['true_kv_size'],
389
- 'decoding_throughput': decoding_tp,
390
- 'prefill_tp': prefill_tp,
391
- 'ttft': ttft
392
- }
393
-
394
-
395
- def _get_hardware_specs(used_dtype):
396
- """Get hardware specifications"""
397
- gpu_type = get_gpu_details()
398
- return {
399
- "peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
400
- "peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
401
- }
402
-
403
-
404
- def _extract_output_data(outputs):
405
- """Extract relevant data from outputs"""
406
- prefill_lengths = []
407
- output_lengths = []
408
- max_duration = 0.0
409
-
410
- for x in outputs:
411
- output_lengths.append(x['meta_info']['completion_tokens'])
412
- prefill_lengths.append(x['meta_info']['prompt_tokens'])
413
- max_duration = max(max_duration, x['meta_info']['e2e_latency'])
414
-
415
- return {
416
- 'prefill_lengths': prefill_lengths,
417
- 'output_lengths': output_lengths,
418
- 'max_duration': max_duration
419
- }
420
-
421
-
422
- def _calculate_kv_size(model_name, hf_config, n_layers, d_head, n_kv_heads):
423
- """Calculate per-token KV size based on model type"""
424
- if "DeepSeek" in model_name and hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
425
- return n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
426
- return 2 * n_layers * d_head * n_kv_heads
427
-
428
-
429
- def _calculate_attention_size(model_name, hf_config, d_model, n_attn_heads, d_head, n_kv_heads):
430
- """Calculate attention size per token based on model type"""
431
- if ("DeepSeek" in model_name and
432
- hasattr(hf_config, "qk_rope_head_dim") and
433
- hasattr(hf_config, "qk_nope_head_dim") and
434
- hasattr(hf_config, "v_head_dim") and
435
- hasattr(hf_config, "kv_lora_rank")):
436
-
437
- return _calculate_deepseek_attention_size(hf_config, d_model, n_attn_heads)
438
-
439
- return (d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) +
440
- n_attn_heads * d_head * d_model) / 1e12
441
-
442
-
443
- def _calculate_deepseek_attention_size(hf_config, d_model, n_attn_heads):
444
- """Calculate DeepSeek-specific attention size"""
445
- q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
446
-
447
- base_size = ((d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) +
448
- (hf_config.kv_lora_rank * n_attn_heads *
449
- (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) +
450
- (hf_config.v_head_dim * n_attn_heads * d_model))
451
-
452
- if hasattr(hf_config, "q_lora_rank") and hf_config.q_lora_rank:
453
- q_size = (d_model * hf_config.q_lora_rank +
454
- hf_config.q_lora_rank * n_attn_heads * q_head_dim)
455
- else:
456
- q_size = d_model * n_attn_heads * q_head_dim
457
-
458
- return (base_size + q_size) / 1e12
459
-
460
-
461
- def _calculate_expert_config(model_name, hf_config, d_ff, d_model, n_layers):
462
- """Calculate expert configuration based on model type"""
463
- config = {
464
- 'expert_size': d_ff * 3 * d_model / 1e12,
465
- 'shared_experts_size_total': 0,
466
- 'deepseek_dense_ffn_size': 0,
467
- 'deepseek_sparse_layer_num': 0,
468
- 'deepseek_num_dense_layer': 0
469
- }
470
-
471
- if "Qwen" in model_name and not "Qwen3" in model_name:
472
- config.update(_get_qwen_expert_config(hf_config, d_model))
473
- elif "Qwen3" in model_name:
474
- config.update(_get_qwen3_expert_config(hf_config, d_model))
475
- elif "DeepSeek" in model_name:
476
- config.update(_get_deepseek_expert_config(hf_config, d_model, n_layers))
477
-
478
- return config
479
-
480
-
481
- def _get_qwen_expert_config(hf_config, d_model):
482
- """Get Qwen-specific expert configuration"""
483
- if (hasattr(hf_config, "moe_intermediate_size") and
484
- hasattr(hf_config, "shared_expert_intermediate_size")):
485
-
486
- return {
487
- 'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12,
488
- 'shared_experts_size_total': hf_config.shared_expert_intermediate_size * 3 * d_model / 1e12
489
- }
490
- return {}
491
-
492
-
493
- def _get_qwen3_expert_config(hf_config, d_model):
494
- """Get Qwen3-specific expert configuration"""
495
- if hasattr(hf_config, "moe_intermediate_size"):
496
- return {
497
- 'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12
498
- }
499
- return {}
500
-
501
-
502
- def _get_deepseek_expert_config(hf_config, d_model, n_layers):
503
- """Get DeepSeek-specific expert configuration"""
504
- if (hasattr(hf_config, "moe_intermediate_size") and
505
- hasattr(hf_config, "intermediate_size") and
506
- hasattr(hf_config, "first_k_dense_replace")):
507
-
508
- deepseek_num_dense_layer = hf_config.first_k_dense_replace
509
- return {
510
- 'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12,
511
- 'shared_experts_size_total': hf_config.moe_intermediate_size * 3 * d_model * 2 / 1e12,
512
- 'deepseek_dense_ffn_size': hf_config.intermediate_size * 3 * d_model / 1e12,
513
- 'deepseek_sparse_layer_num': n_layers - deepseek_num_dense_layer,
514
- 'deepseek_num_dense_layer': deepseek_num_dense_layer
515
- }
516
- return {}
517
-
518
-
519
- def _process_outputs(output_data, per_token_kv_size, attention_size_per_token,
520
- model_name, hf_config, n_layers, n_attn_heads, d_head):
521
- """Process outputs to calculate KV sizes and attention scores"""
522
- kvs = []
523
- true_kvs = []
524
- attn_scores = []
525
-
526
- for prefill_len, output_len in zip(output_data['prefill_lengths'], output_data['output_lengths']):
527
- # Calculate attention score
528
- attn_score = _calculate_attention_score(model_name, hf_config, prefill_len, output_len,
529
- n_layers, n_attn_heads, d_head)
530
- attn_scores.append(attn_score)
531
-
532
- # Calculate KV sizes
533
- kv_size = (prefill_len * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2) / 1e12
534
- true_kv = (prefill_len * per_token_kv_size + output_len * per_token_kv_size) / 1e9
535
-
536
- kvs.append(kv_size)
537
- true_kvs.append(true_kv)
538
-
539
- return {
540
- 'kv_size': sum(kvs),
541
- 'true_kv_size': sum(true_kvs) * 1e3,
542
- 'attention_score': sum(attn_scores) / len(attn_scores)
543
- }
544
-
545
-
546
- def _calculate_attention_score(model_name, hf_config, prefill_len, output_len,
547
- n_layers, n_attn_heads, d_head):
548
- """Calculate attention score for a single output"""
549
- if ("DeepSeek" in model_name and
550
- hasattr(hf_config, "qk_rope_head_dim") and
551
- hasattr(hf_config, "qk_nope_head_dim") and
552
- hasattr(hf_config, "v_head_dim")):
553
-
554
- q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
555
- k_size = n_layers * n_attn_heads * q_head_dim
556
- v_size = n_layers * n_attn_heads * hf_config.v_head_dim
557
-
558
- score = (prefill_len * k_size + (output_len - 1) * k_size / 2 +
559
- prefill_len * v_size + (output_len - 1) * v_size / 2)
560
- else:
561
- kv_size = n_layers * n_attn_heads * d_head
562
- score = (prefill_len * kv_size + (output_len - 1) * kv_size / 2) * 2
563
-
564
- return score / 1e12
565
-
566
-
567
- def _calculate_throughput_metrics(batch_size, prefill_lengths, max_duration):
568
- """Calculate throughput metrics"""
569
- total_prefill = sum(prefill_lengths)
570
- prefill_tp = total_prefill / (max_duration)
571
- ttft = max_duration / batch_size
572
- return ttft, prefill_tp
573
-
574
-
575
- def _calculate_smbu_smfu(model_name, n_layers, attention_size_per_token, expert_config,
576
- avg_activated_experts, metrics_data, hardware_specs, num_gpus,
577
- precision, ttft, prefill_tp, batch_size, decoding_tp):
578
- """Calculate S-MBU and S-MFU metrics"""
579
- prefill_activation = avg_activated_experts[1]
580
- decode_steps_activation = avg_activated_experts[2:]
581
-
582
- # Calculate prefill metrics
583
- prefill_smbu, prefill_smfu = _calculate_prefill_metrics(
584
- model_name, n_layers, attention_size_per_token, expert_config,
585
- prefill_activation, metrics_data['attention_score'], hardware_specs,
586
- num_gpus, precision, ttft, prefill_tp
587
- )
588
-
589
- # Calculate decoding metrics
590
- decoding_smbu, decoding_smfu = _calculate_decoding_metrics(
591
- model_name, n_layers, attention_size_per_token, expert_config,
592
- decode_steps_activation, metrics_data, hardware_specs,
593
- num_gpus, precision, batch_size, decoding_tp
594
- )
595
-
596
- return {
597
- 'prefill_smbu': prefill_smbu,
598
- 'prefill_smfu': prefill_smfu,
599
- 'decoding_smbu': decoding_smbu,
600
- 'decoding_smfu': decoding_smfu
601
- }
602
-
603
-
604
- def _calculate_prefill_metrics(model_name, n_layers, attention_size_per_token, expert_config,
605
- prefill_activation, attention_score, hardware_specs,
606
- num_gpus, precision, ttft, prefill_tp):
607
- """Calculate prefill S-MBU and S-MFU"""
608
- model_calculators = {
609
- 'Qwen': _calculate_qwen_prefill,
610
- 'Qwen3': _calculate_qwen3_prefill,
611
- 'DeepSeek': _calculate_deepseek_prefill
612
- }
613
-
614
- for model_type, calculator in model_calculators.items():
615
- if model_type in model_name and (model_type != 'Qwen' or 'Qwen3' not in model_name):
616
- return calculator(n_layers, attention_size_per_token, expert_config,
617
- prefill_activation, attention_score, hardware_specs,
618
- num_gpus, precision, ttft, prefill_tp)
619
-
620
- # Default case
621
- return _calculate_default_prefill(n_layers, attention_size_per_token, expert_config,
622
- prefill_activation, attention_score, hardware_specs,
623
- num_gpus, precision, ttft, prefill_tp)
624
-
625
-
626
- def _calculate_decoding_metrics(model_name, n_layers, attention_size_per_token, expert_config,
627
- decode_steps_activation, metrics_data, hardware_specs,
628
- num_gpus, precision, batch_size, decoding_tp):
629
- """Calculate decoding S-MBU and S-MFU"""
630
- decoding_smbus = []
631
-
632
- for activation in decode_steps_activation:
633
- if "Qwen" in model_name and "Qwen3" not in model_name:
634
- smbu, smfu = _calculate_qwen_decoding(n_layers, attention_size_per_token, expert_config,
635
- activation, metrics_data, hardware_specs, num_gpus,
636
- precision, batch_size, decoding_tp)
637
- elif "Qwen3" in model_name:
638
- smbu, smfu = _calculate_qwen3_decoding(n_layers, attention_size_per_token, expert_config,
639
- activation, metrics_data, hardware_specs, num_gpus,
640
- precision, batch_size, decoding_tp)
641
- elif "DeepSeek" in model_name:
642
- smbu, smfu = _calculate_deepseek_decoding(n_layers, attention_size_per_token, expert_config,
643
- activation, metrics_data, hardware_specs, num_gpus,
644
- precision, batch_size, decoding_tp)
645
- else:
646
- smbu, smfu = _calculate_default_decoding(n_layers, attention_size_per_token, expert_config,
647
- activation, metrics_data, hardware_specs, num_gpus,
648
- precision, batch_size, decoding_tp)
649
- decoding_smbus.append(smbu)
650
-
651
- return sum(decoding_smbus) / len(decoding_smbus), smfu
652
-
653
-
654
- # Helper functions for specific model calculations
655
- def _calculate_qwen_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
656
- attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
657
- smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
658
- expert_config['shared_experts_size_total'] +
659
- attention_size_per_token)) * precision / ttft
660
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
661
-
662
- smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size'] +
663
- expert_config['shared_experts_size_total']) + attention_score) * 2 * prefill_tp
664
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
665
-
666
- return smbu, smfu
667
-
668
-
669
- def _calculate_qwen3_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
670
- attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
671
- smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
672
- attention_size_per_token)) * precision / ttft
673
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
674
-
675
- smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size']) +
676
- attention_score) * 2 * prefill_tp
677
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
678
-
679
- return smbu, smfu
680
-
681
-
682
- def _calculate_deepseek_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
683
- attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
684
- smbu_numerator = ((n_layers * attention_size_per_token +
685
- expert_config['deepseek_sparse_layer_num'] *
686
- (prefill_activation * expert_config['expert_size'] +
687
- expert_config['shared_experts_size_total']) +
688
- expert_config['deepseek_num_dense_layer'] *
689
- expert_config['deepseek_dense_ffn_size']) * precision / ttft)
690
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
691
-
692
- smfu_numerator = ((n_layers * attention_size_per_token +
693
- expert_config['deepseek_sparse_layer_num'] *
694
- (expert_config['expert_size'] + expert_config['shared_experts_size_total']) +
695
- expert_config['deepseek_num_dense_layer'] *
696
- expert_config['deepseek_dense_ffn_size'] + attention_score) * 2 * prefill_tp)
697
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
698
-
699
- return smbu, smfu
700
-
701
-
702
- def _calculate_default_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
703
- attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
704
- # Default implementation
705
- smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
706
- attention_size_per_token)) * precision / ttft
707
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
708
-
709
- smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size']) +
710
- attention_score) * 2 * prefill_tp
711
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
712
-
713
- return smbu, smfu
714
-
715
-
716
- def _calculate_qwen_decoding(n_layers, attention_size_per_token, expert_config, activation,
717
- metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
718
- smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
719
- expert_config['shared_experts_size_total'] +
720
- attention_size_per_token) +
721
- metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
722
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
723
-
724
- smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size'] +
725
- expert_config['shared_experts_size_total']) +
726
- metrics_data['attention_score']) * 2 * decoding_tp)
727
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
728
-
729
- return smbu, smfu
730
-
731
-
732
- def _calculate_qwen3_decoding(n_layers, attention_size_per_token, expert_config, activation,
733
- metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
734
- smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
735
- attention_size_per_token) +
736
- metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
737
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
738
-
739
- smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size']) +
740
- metrics_data['attention_score']) * 2 * decoding_tp)
741
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
742
-
743
- return smbu, smfu
744
-
745
-
746
- def _calculate_deepseek_decoding(n_layers, attention_size_per_token, expert_config, activation,
747
- metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
748
- smbu_numerator = ((n_layers * attention_size_per_token +
749
- expert_config['deepseek_sparse_layer_num'] *
750
- (activation * expert_config['expert_size'] +
751
- expert_config['shared_experts_size_total']) +
752
- expert_config['deepseek_num_dense_layer'] *
753
- expert_config['deepseek_dense_ffn_size'] +
754
- metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
755
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
756
-
757
- smfu_numerator = ((n_layers * attention_size_per_token +
758
- expert_config['deepseek_sparse_layer_num'] *
759
- (expert_config['expert_size'] + expert_config['shared_experts_size_total']) +
760
- expert_config['deepseek_num_dense_layer'] *
761
- expert_config['deepseek_dense_ffn_size'] +
762
- metrics_data['attention_score']) * 2 * decoding_tp)
763
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
764
-
765
- return smbu, smfu
766
-
767
-
768
- def _calculate_default_decoding(n_layers, attention_size_per_token, expert_config, activation,
769
- metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
770
- smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
771
- attention_size_per_token) +
772
- metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
773
- smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
774
-
775
- smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size']) +
776
- metrics_data['attention_score']) * 2 * decoding_tp)
777
- smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
778
-
779
- return smbu, smfu
780
-
781
- def _calculate_batch_metrics_hflm(output_len, context_prefill_size, decoding_tp, n_layers, d_model,
782
- n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
783
- avg_activated_experts, hf_config, num_gpus, model_name,
784
- used_dtype, batch_size, precision):
785
- """Calculate metrics for a batch of outputs"""
786
- gpu_type = get_gpu_details()
787
- hardware_specs = {
788
- "peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
789
- "peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
790
- }
791
-
792
- # Calculate KV sizes
793
- per_token_kv_size = 2 * n_layers * d_head * n_kv_heads # Default calculation
794
-
795
- if "DeepSeek" in model_name:
796
- if hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
797
- per_token_kv_size = n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
798
-
799
-
800
- # Calculate attention scores
801
- if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim"):
802
- q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
803
- origin_per_token_k_state_size = n_layers * n_attn_heads * q_head_dim
804
- origin_per_token_v_state_size = n_layers * n_attn_heads * hf_config.v_head_dim
805
- attention_score = context_prefill_size * origin_per_token_k_state_size + (output_len - 1) * origin_per_token_k_state_size / 2
806
- attention_score += context_prefill_size * origin_per_token_v_state_size + (output_len - 1) * origin_per_token_v_state_size / 2
807
- attention_score = attention_score / 1e12
808
  else:
809
- origin_per_token_kv_states_size = n_layers * n_attn_heads * d_head
810
- attention_score = context_prefill_size * origin_per_token_kv_states_size + (output_len - 1) * origin_per_token_kv_states_size / 2
811
- attention_score = attention_score * 2 / 1e12
812
-
813
- # Store attention scores and KV sizes
814
- kv_size = context_prefill_size * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2
815
- kv_size = kv_size / 1e12
816
- true_kv = (context_prefill_size * per_token_kv_size + output_len * per_token_kv_size) / 1e12 * 1e3
817
-
818
- # Calculate aggregate values
819
- kv_size = kv_size * batch_size
820
- true_kv_size = true_kv * batch_size * 1e3
821
- # Calculate attention size per token
822
- if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim") and hasattr(hf_config, "kv_lora_rank"):
823
- q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
824
- if not hasattr(hf_config, "q_lora_rank") or not hf_config.q_lora_rank:
825
- attention_size_per_token = (d_model * n_attn_heads * q_head_dim) + \
826
- (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
827
- (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
828
- (hf_config.v_head_dim * n_attn_heads * d_model)
829
- attention_size_per_token = attention_size_per_token / 1e12
830
- else:
831
- attention_size_per_token = (d_model * hf_config.q_lora_rank) + \
832
- (hf_config.q_lora_rank * n_attn_heads * q_head_dim) + \
833
- (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
834
- (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
835
- (hf_config.v_head_dim * n_attn_heads * d_model)
836
- attention_size_per_token = attention_size_per_token / 1e12
837
- else:
838
- attention_size_per_token = d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) + n_attn_heads * d_head * d_model
839
- attention_size_per_token = attention_size_per_token / 1e12
840
-
841
- # Calculate expert sizes
842
- expert_size = d_ff * 3 * d_model / 1e12
843
- shared_experts_size_total = 0
844
- deepseek_dense_ffn_size = 0
845
- deepseek_sparse_layer_num = 0
846
-
847
- if "Qwen" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "shared_expert_intermediate_size"):
848
- d_ff = hf_config.moe_intermediate_size
849
- d_ff_share = hf_config.shared_expert_intermediate_size
850
- shared_experts_size = d_ff_share * 3 * d_model
851
- expert_size = d_ff * 3 * d_model
852
- shared_experts_size_total = shared_experts_size / 1e12
853
- expert_size = expert_size / 1e12
854
- elif "Qwen3" in model_name and hasattr(hf_config, "moe_intermediate_size"):
855
- d_ff = hf_config.moe_intermediate_size
856
- expert_size = d_ff * 3 * d_model
857
- expert_size = expert_size / 1e12
858
- elif "DeepSeek" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "intermediate_size") and hasattr(hf_config, "first_k_dense_replace"):
859
- d_ff = hf_config.moe_intermediate_size
860
- d_ff_dense = hf_config.intermediate_size
861
- deepseek_num_dense_layer = hf_config.first_k_dense_replace
862
- shared_experts_size = d_ff * 3 * d_model
863
- expert_size = d_ff * 3 * d_model
864
- shared_experts = 2
865
- shared_experts_size_total = shared_experts_size * shared_experts / 1e12
866
- expert_size = expert_size / 1e12
867
- deepseek_sparse_layer_num = n_layers - deepseek_num_dense_layer
868
- deepseek_dense_ffn_size = d_ff_dense * 3 * d_model / 1e12
869
-
870
- # Calculate S-MBU and S-MFU
871
- if "Qwen" in model_name:
872
- smbu = ((n_layers*(avg_activated_experts * expert_size + shared_experts_size_total + attention_size_per_token) +
873
- kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
874
- smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size + shared_experts_size_total) + attention_score) \
875
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
876
- elif "Qwen3" in model_name:
877
- smbu = ((n_layers * (avg_activated_experts * expert_size + attention_size_per_token) +
878
- kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
879
- smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
880
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
881
- elif "DeepSeek" in model_name:
882
- smbu = ((n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
883
- (avg_activated_experts * expert_size + shared_experts_size_total) + \
884
- deepseek_num_dense_layer * deepseek_dense_ffn_size + \
885
- kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
886
- smfu = (n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
887
- (n_experts_per_tok * expert_size + shared_experts_size_total) + \
888
- deepseek_num_dense_layer * deepseek_dense_ffn_size + attention_score) \
889
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
890
- else:
891
- smbu = ((n_layers*(avg_activated_experts * expert_size + attention_size_per_token) +
892
- kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
893
- smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
894
- * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
895
-
896
- return {
897
- 'smbu': smbu,
898
- 'smfu': smfu,
899
- 'kv_size': true_kv_size,
900
- 'decoding_throughput': decoding_tp,
901
- 'ttft': 0
902
- }
903
- class ModelInfoRetriever:
904
- def __init__(self, model_name: str, precision: str = 'float16'):
905
- if precision not in ['float32', 'float16', 'bfloat16', 'int8', 'int4', 'awq', 'gptq', 'fp8', 'fp4']:
906
- raise ValueError("Precision must be one of ['float32', 'float16', 'bfloat16', 'int8', 'int4', 'awq', 'gptq', 'fp8', 'fp4']")
907
- self.model_name = model_name
908
- self.precision = precision
909
- self.config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
910
- self.model_type = self.config.model_type
911
-
912
- def get_model_precision_bits(self):
913
- """Returns bit width used by the given quantization format."""
914
- if self.precision == 'float32':
915
- return 4
916
- if self.precision in ['float16', 'bfloat16']:
917
- return 2
918
- if self.precision in ['int8', 'fp8']:
919
- return 1
920
- if self.precision in ['int4', 'fp4', 'gptq', 'awq']:
921
- return 0.5
922
- raise ValueError(f"Unsupported precision: {self.precision}")
923
-
924
- def get_attention_info(self):
925
- """Returns attention-related info"""
926
- return {
927
- 'num_attention_heads': getattr(self.config, "num_attention_heads", None),
928
- 'num_key_value_heads': getattr(self.config, "num_key_value_heads", getattr(self.config, "num_kv_heads", None)),
929
- 'head_dim': getattr(self.config, "head_dim", getattr(self.config, "hidden_size", None) // getattr(self.config, "num_attention_heads", 1))
930
- }
931
-
932
- def get_rope_info(self):
933
- """Returns RoPE (rotary embedding) info if available"""
934
- if hasattr(self.config, "rope_scaling"):
935
- return {
936
- "type": self.config.rope_scaling.get("type"),
937
- "factor": self.config.rope_scaling.get("factor")
938
- }
939
- elif hasattr(self.config, "use_alibi"):
940
- return {"type": "alibi", "enabled": self.config.use_alibi}
941
- else:
942
- return {"type": "none"}
943
-
944
- def get_moe_info(self, d_model=None):
945
- """Returns MoE configuration such as number of experts and FFN dim"""
946
- if d_model is None:
947
- d_model = getattr(self.config, "hidden_size", None)
948
-
949
- num_experts = (
950
- getattr(self.config, "num_local_experts", None) or
951
- getattr(self.config, "num_experts", None) or
952
- getattr(self.config, "n_routed_experts", None) or
953
- getattr(getattr(self.config, "ffn_config", {}), "moe_num_experts", None) or
954
- 1
955
- )
956
- n_experts_per_tok = (
957
- getattr(self.config, "num_experts_per_tok", None) or
958
- getattr(self.config, "num_selected_experts", None) or
959
- getattr(getattr(self.config, "ffn_config", {}), "moe_top_k", None) or
960
- 1
961
- )
962
- d_ff = (
963
- getattr(self.config, "ffn_dim", None) or
964
- getattr(self.config, "intermediate_size", None) or
965
- getattr(self.config, "d_ff", None) or
966
- (d_model * getattr(self.config, "ff_ratio", 4)) or
967
- getattr(getattr(self.config, "ffn_config", {}), "ffn_hidden_size", None) or
968
- (4 * d_model)
969
- )
970
-
971
- return {
972
- "num_experts": num_experts,
973
- "experts_per_token": n_experts_per_tok,
974
- "ffn_dim": d_ff
975
- }
976
-
977
- def get_architecture_info(self):
978
- """Returns model-wide architecture info"""
979
- return {
980
- "model_type": self.model_type,
981
- "hidden_size": getattr(self.config, "hidden_size", None),
982
- "num_hidden_layers": getattr(self.config, "num_hidden_layers", None),
983
- "max_position_embeddings": getattr(self.config, "max_position_embeddings", None),
984
- "vocab_size": getattr(self.config, "vocab_size", None),
985
- "architectures": getattr(self.config, "architectures", []),
986
- }
987
-
988
- def summarize(self):
989
- """Aggregate all extracted info in a dictionary"""
990
- d_model = getattr(self.config, "hidden_size", None)
991
- return {
992
- "model_name": self.model_name,
993
- "model_type": self.model_type,
994
- "precision_bits": self.get_model_precision_bits(),
995
- "architecture": self.get_architecture_info(),
996
- "attention": self.get_attention_info(),
997
- "rope": self.get_rope_info(),
998
- "moe": self.get_moe_info(d_model)
999
- }
1000
-
1001
-
1002
 
1003
- # if __name__ == "__main__":
1004
- # print(analyze_gpu_stats(parse_nvidia_smi()))
1005
- # print(get_gpu_details())
 
4
  import re
5
  import os
6
  import GPUtil
 
 
7
 
8
  try:
9
  from src.display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
 
12
  from display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
13
 
14
  MEM_BW_DICT ={
15
+ "NVIDIA-A100-PCIe-80GB": 1935,
16
+ "NVIDIA-A100-SXM-80GB": 2039,
17
+ "NVIDIA-H100-PCIe-80GB": 2039,
18
+ "NVIDIA-RTX-A5000-24GB": 768
 
19
  }
20
 
21
  PEAK_FLOPS_DICT = {
22
  "float32":{
23
  "NVIDIA-A100-PCIe-80GB": 312e12,
24
+ "NVIDIA-A100-SXM-80GB": 312e12,
25
  "NVIDIA-H100-PCIe-80GB": 756e12,
26
+ "NVIDIA-RTX-A5000-24GB": 222.2e12
 
27
  },
28
  "float16":{
29
  "NVIDIA-A100-PCIe-80GB": 624e12,
30
+ "NVIDIA-A100-SXM-80GB": 624e12,
31
  "NVIDIA-H100-PCIe-80GB": 1513e12,
32
+ "NVIDIA-RTX-A5000-24GB": 444.4e12
 
33
  },
34
  "bfloat16":{
35
  "NVIDIA-A100-PCIe-80GB": 624e12,
36
+ "NVIDIA-A100-SXM-80GB": 624e12,
37
  "NVIDIA-H100-PCIe-80GB": 1513e12,
38
+ "NVIDIA-RTX-A5000-24GB": 444.4e12
 
39
  },
40
+ "8bit":{
41
  "NVIDIA-A100-PCIe-80GB": 1248e12,
42
+ "NVIDIA-A100-SXM-80GB": 1248e12,
43
  "NVIDIA-H100-PCIe-80GB": 3026e12,
44
+ "NVIDIA-RTX-A5000-24GB": 889e12
 
45
  },
46
+ "4bit": {
47
+ "NVIDIA-A100-PCIe-80GB": 2496e12,
48
+ "NVIDIA-A100-SXM-80GB": 2496e12,
49
+ "NVIDIA-H100-PCIe-80GB": 6052e12,
50
+ "NVIDIA-RTX-A5000-24GB": 1778e12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  }
52
+
53
  }
54
 
55
  def my_snapshot_download(repo_id, revision, local_dir, repo_type, max_workers):
 
97
  # print(f"gpu_indices: {gpu_indices}")
98
  gpu_stats = []
99
 
100
+ gpu_info_pattern = re.compile(r'(\d+)C\s+P\d+\s+(\d+)W / \d+W\s+\|\s+(\d+)MiB / \d+MiB\s+\|\s+(\d+)%')
101
  # gpu_name_pattern = re.compile(r'NVIDIA\s+([\w\s]+\d+(?:\s*GB)?)')
102
  gpu_name_pattern = re.compile(r'NVIDIA\s+(RTX\s+)?([A-Z0-9]+)')
103
 
 
195
  def get_peak_flops(gpu_name, precision):
196
  return PEAK_FLOPS_DICT[precision][gpu_name]
197
 
198
+ def transfer_precision2bytes(precision):
199
+ if precision == "float32":
200
+ return 4
201
+ elif precision in ["float16", "bfloat16"]:
202
+ return 2
203
+ elif precision == "8bit":
204
+ return 1
205
+ elif precision == "4bit":
206
+ return 0.5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
  else:
208
+ raise ValueError(f"Unsupported precision: {precision}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
 
210
+ if __name__ == "__main__":
211
+ print(analyze_gpu_stats(parse_nvidia_smi()))