Grigori Fursin commited on
Commit
4d2d78e
·
unverified ·
1 Parent(s): 37e9e3e

first commit

Browse files
Files changed (10) hide show
  1. .python-version +1 -0
  2. README.md +31 -2
  3. __init__.py +0 -0
  4. app.py +1445 -0
  5. cost_calculator.py +137 -0
  6. data.json +0 -0
  7. predictor.py +900 -0
  8. recommender.py +97 -0
  9. requirements.txt +11 -0
  10. utils.py +115 -0
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: FlexBoard
3
- emoji: 🏢
4
  colorFrom: blue
5
- colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 5.30.0
8
  app_file: app.py
@@ -12,3 +12,32 @@ short_description: FlexBoard to analyze FlexBench and MLPerf results
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: FlexBoard
3
+ emoji: 🐢
4
  colorFrom: blue
5
+ colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 5.30.0
8
  app_file: app.py
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ # FlexBoard
17
+
18
+ ## Installation
19
+
20
+ ```bash
21
+ # Create a virtual environment
22
+ python -m venv .venv
23
+
24
+ # Activate the virtual environment
25
+ source .venv/bin/activate
26
+
27
+ # Install the required packages
28
+ pip install -r requirements.txt
29
+
30
+ # Run the application
31
+ python -m app
32
+ ```
33
+
34
+
35
+ ## License and Copyright
36
+
37
+ This project is licensed under the [Apache License 2.0](LICENSE.md).
38
+
39
+ © 2025 FlexAI
40
+
41
+ ## Authors and maintaners
42
+
43
+ [Daniel Altunay](https://www.linkedin.com/in/daltunay) and [Grigori Fursin](https://cKnowledge.org/gfursin) (FCS Labs)
__init__.py ADDED
File without changes
app.py ADDED
@@ -0,0 +1,1445 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MLPerf Hardware Configuration Finder application."""
2
+
3
+ import logging
4
+ import os
5
+
6
+ import gradio as gr
7
+ import pandas as pd
8
+ import plotly.graph_objects as go
9
+ import polars as pl
10
+ from cost_calculator import (
11
+ calculate_costs,
12
+ get_device_costs,
13
+ initialize_device_costs,
14
+ update_device_costs,
15
+ )
16
+ from plotly.subplots import make_subplots
17
+ from predictor import PerformancePredictor
18
+ from recommender import ConfigurationFinder
19
+
20
+ from utils import get_feature_type, load_data
21
+
22
+ logging.basicConfig(level=logging.INFO)
23
+ logger = logging.getLogger(__name__)
24
+
25
+ logger.info("Loading benchmark data...")
26
+ df = load_data()
27
+ pd_df = df.to_pandas() if not df.is_empty() else pd.DataFrame()
28
+ logger.info(f"Loaded {len(pd_df)} benchmark records total")
29
+
30
+ initialize_device_costs(pd_df)
31
+
32
+ predictor = PerformancePredictor(pd_df) if not pd_df.empty else None
33
+ config_finder = ConfigurationFinder(pd_df) if not pd_df.empty else None
34
+
35
+
36
+ def extract_metadata(df: pl.DataFrame) -> dict:
37
+ """Extract metadata for UI filters from dataset."""
38
+ metadata = {}
39
+ if df.is_empty():
40
+ return metadata
41
+
42
+ metadata["architectures"] = sorted(
43
+ df.filter(pl.col("model.architecture").is_not_null())
44
+ .get_column("model.architecture")
45
+ .unique()
46
+ .to_list()
47
+ )
48
+
49
+ model_sizes = sorted(
50
+ df.filter(pl.col("model.number_of_parameters").is_not_null())
51
+ .get_column("model.number_of_parameters")
52
+ .unique()
53
+ .to_list()
54
+ )
55
+ if model_sizes:
56
+ metadata["model_sizes"] = model_sizes
57
+ metadata["model_size_min"] = min(model_sizes)
58
+ metadata["model_size_max"] = max(model_sizes)
59
+ metadata["model_size_values"] = sorted(model_sizes)
60
+
61
+ metadata["weight_data_types"] = sorted(
62
+ df.filter(pl.col("model.weight_data_types").is_not_null())
63
+ .get_column("model.weight_data_types")
64
+ .unique()
65
+ .to_list()
66
+ )
67
+
68
+ metadata["accelerator_vendors"] = sorted(
69
+ df.filter(pl.col("system.accelerator.vendor").is_not_null())
70
+ .get_column("system.accelerator.vendor")
71
+ .unique()
72
+ .to_list()
73
+ )
74
+
75
+ metadata["cpu_vendors"] = sorted(
76
+ df.filter(pl.col("system.cpu.vendor").is_not_null())
77
+ .get_column("system.cpu.vendor")
78
+ .unique()
79
+ .to_list()
80
+ )
81
+
82
+ metadata["accelerator_models"] = sorted(
83
+ df.filter(pl.col("system.accelerator.name").is_not_null())
84
+ .get_column("system.accelerator.name")
85
+ .unique()
86
+ .to_list()
87
+ )
88
+
89
+ metadata["cpu_models"] = sorted(
90
+ df.filter(pl.col("system.cpu.model").is_not_null())
91
+ .get_column("system.cpu.model")
92
+ .unique()
93
+ .to_list()
94
+ )
95
+
96
+ memory_values = df.filter(
97
+ pl.col("system.accelerator.memory_capacity").is_not_null()
98
+ )
99
+ metadata["gpu_memory_min"] = max(
100
+ 1,
101
+ round(
102
+ float(memory_values.get_column("system.accelerator.memory_capacity").min())
103
+ ),
104
+ )
105
+ metadata["gpu_memory_max"] = min(
106
+ 1024,
107
+ round(
108
+ float(memory_values.get_column("system.accelerator.memory_capacity").max())
109
+ ),
110
+ )
111
+
112
+ memory_values = df.filter(pl.col("system.memory.capacity").is_not_null())
113
+ metadata["cpu_memory_min"] = max(
114
+ 1, round(float(memory_values.get_column("system.memory.capacity").min()))
115
+ )
116
+ metadata["cpu_memory_max"] = min(
117
+ 16384, round(float(memory_values.get_column("system.memory.capacity").max()))
118
+ )
119
+
120
+ metadata["interconnect_types"] = sorted(
121
+ df.filter(pl.col("system.interconnect.accelerator").is_not_null())
122
+ .get_column("system.interconnect.accelerator")
123
+ .unique()
124
+ .to_list()
125
+ )
126
+
127
+ acc_counts = sorted(
128
+ df.filter(pl.col("system.accelerator.total_count").is_not_null())
129
+ .get_column("system.accelerator.total_count")
130
+ .unique()
131
+ .cast(pl.Int64)
132
+ .to_list()
133
+ )
134
+ metadata["accelerator_counts"] = acc_counts
135
+ metadata["min_accelerators"] = min(acc_counts)
136
+ metadata["max_accelerators"] = max(acc_counts)
137
+
138
+ metadata["node_counts"] = sorted(
139
+ df.filter(pl.col("system.number_of_nodes").is_not_null())
140
+ .get_column("system.number_of_nodes")
141
+ .unique()
142
+ .cast(pl.Int64)
143
+ .to_list()
144
+ )
145
+
146
+ frameworks = []
147
+ for col in df.columns:
148
+ if col.startswith("software.framework.") and col != "software.framework":
149
+ framework_name = col.replace("software.framework.", "")
150
+ frameworks.append(framework_name)
151
+ versions = (
152
+ df.filter(pl.col(col).is_not_null()).get_column(col).unique().to_list()
153
+ )
154
+ if versions:
155
+ metadata[f"{framework_name}_versions"] = sorted(versions)
156
+
157
+ metadata["frameworks"] = sorted(frameworks)
158
+
159
+ metadata["operating_systems"] = sorted(
160
+ df.filter(pl.col("software.operating_system").is_not_null())
161
+ .get_column("software.operating_system")
162
+ .unique()
163
+ .to_list()
164
+ )
165
+
166
+ result_per_acc = df.filter(pl.col("metrics.result_per_accelerator").is_not_null())
167
+ metadata["result_per_accelerator_ranges"] = {
168
+ "min": float(result_per_acc.get_column("metrics.result_per_accelerator").min()),
169
+ "max": float(result_per_acc.get_column("metrics.result_per_accelerator").max()),
170
+ "median": float(
171
+ result_per_acc.get_column("metrics.result_per_accelerator").median()
172
+ ),
173
+ }
174
+
175
+ return metadata
176
+
177
+
178
+ metadata = extract_metadata(df)
179
+
180
+
181
+ def apply_continuous_feature_tolerance(
182
+ df: pd.DataFrame, feature: str, value: float, tolerance: float = 0.1
183
+ ) -> pd.DataFrame:
184
+ """Apply tolerance for continuous feature searches."""
185
+ lower_bound = value * (1 - tolerance)
186
+ upper_bound = value * (1 + tolerance)
187
+ return df[(df[feature] >= lower_bound) & (df[feature] <= upper_bound)]
188
+
189
+
190
+ def find_best_configs(
191
+ workload_specs: dict,
192
+ constraints: dict,
193
+ include_predictions: bool = True,
194
+ optimization_metric: str = "performance",
195
+ ) -> pd.DataFrame:
196
+ """Find best hardware configurations for workload."""
197
+ if pd_df.empty:
198
+ return pd.DataFrame()
199
+
200
+ filtered_df = pd_df.copy()
201
+
202
+ if workload_specs.get("model_size") is not None:
203
+ filtered_df = apply_continuous_feature_tolerance(
204
+ filtered_df,
205
+ "model.number_of_parameters",
206
+ float(workload_specs["model_size"]),
207
+ )
208
+
209
+ if (
210
+ workload_specs.get("weight_data_type")
211
+ and workload_specs["weight_data_type"] != "Any"
212
+ ):
213
+ filtered_df = filtered_df[
214
+ filtered_df["model.weight_data_types"] == workload_specs["weight_data_type"]
215
+ ]
216
+
217
+ if workload_specs.get("architecture") and workload_specs["architecture"] != "Any":
218
+ filtered_df = filtered_df[
219
+ filtered_df["model.architecture"] == workload_specs["architecture"]
220
+ ]
221
+
222
+ clean_constraints = {k: v for k, v in constraints.items() if v and v != "Any"}
223
+
224
+ for feature, value in clean_constraints.items():
225
+ if feature in filtered_df.columns:
226
+ if get_feature_type(feature) == "continuous":
227
+ filtered_df = apply_continuous_feature_tolerance(
228
+ filtered_df, feature, float(value)
229
+ )
230
+ else:
231
+ filtered_df = filtered_df[filtered_df[feature] == value]
232
+
233
+ if constraints.get("min_gpu_memory") is not None:
234
+ filtered_df = filtered_df[
235
+ filtered_df["system.accelerator.memory_capacity"]
236
+ >= constraints["min_gpu_memory"]
237
+ ]
238
+
239
+ if constraints.get("max_gpu_memory") is not None:
240
+ filtered_df = filtered_df[
241
+ filtered_df["system.accelerator.memory_capacity"]
242
+ <= constraints["max_gpu_memory"]
243
+ ]
244
+
245
+ if constraints.get("min_cpu_memory") is not None:
246
+ filtered_df = filtered_df[
247
+ filtered_df["system.memory.capacity"] >= constraints["min_cpu_memory"]
248
+ ]
249
+
250
+ if constraints.get("max_cpu_memory") is not None:
251
+ filtered_df = filtered_df[
252
+ filtered_df["system.memory.capacity"] <= constraints["max_cpu_memory"]
253
+ ]
254
+
255
+ if constraints.get("min_accelerators") is not None:
256
+ filtered_df = filtered_df[
257
+ filtered_df["system.accelerator.total_count"]
258
+ >= constraints["min_accelerators"]
259
+ ]
260
+
261
+ if constraints.get("max_accelerators") is not None:
262
+ filtered_df = filtered_df[
263
+ filtered_df["system.accelerator.total_count"]
264
+ <= constraints["max_accelerators"]
265
+ ]
266
+
267
+ if (
268
+ include_predictions
269
+ and predictor
270
+ and workload_specs.get("model_size")
271
+ and workload_specs.get("architecture")
272
+ ):
273
+ predicted_df = predictor.generate_predictions(
274
+ architecture=workload_specs["architecture"],
275
+ parameters=float(workload_specs["model_size"]),
276
+ constraints=clean_constraints,
277
+ num_configs=20,
278
+ )
279
+
280
+ if not predicted_df.empty:
281
+ predicted_df = calculate_costs(predicted_df)
282
+
283
+ if not filtered_df.empty:
284
+ filtered_df = calculate_costs(filtered_df)
285
+ filtered_df["predicted"] = False
286
+ combined_df = pd.concat([filtered_df, predicted_df], ignore_index=True)
287
+ else:
288
+ combined_df = predicted_df
289
+
290
+ sort_col = (
291
+ "cost_per_million_tokens"
292
+ if optimization_metric == "cost"
293
+ else "metrics.result_per_accelerator"
294
+ )
295
+ asc = optimization_metric == "cost"
296
+ return combined_df.sort_values(by=sort_col, ascending=asc)
297
+
298
+ if not filtered_df.empty:
299
+ filtered_df = calculate_costs(filtered_df)
300
+ filtered_df["predicted"] = False
301
+
302
+ sort_col = (
303
+ "cost_per_million_tokens"
304
+ if optimization_metric == "cost"
305
+ else "metrics.result_per_accelerator"
306
+ )
307
+ asc = optimization_metric == "cost"
308
+ return filtered_df.sort_values(by=sort_col, ascending=asc)
309
+
310
+ return pd.DataFrame()
311
+
312
+
313
+ def format_recommendations(configs_df: pd.DataFrame) -> pd.DataFrame:
314
+ """Format recommendations for display."""
315
+ if configs_df.empty:
316
+ return pd.DataFrame(
317
+ columns=[
318
+ "System",
319
+ "Accelerator",
320
+ "Count",
321
+ "Nodes",
322
+ "GPU Memory (GB)",
323
+ "Model",
324
+ "Architecture",
325
+ "Parameters (B)",
326
+ "Weight Data Type",
327
+ "Total Performance (Tokens/s)",
328
+ "Per-GPU Performance (Tokens/s)",
329
+ "Hourly Cost ($)",
330
+ "Cost/Million Tokens",
331
+ "Predicted",
332
+ ]
333
+ )
334
+
335
+ display_columns = {
336
+ "system.name": "System",
337
+ "system.accelerator.name": "Accelerator",
338
+ "system.accelerator.total_count": "Count",
339
+ "system.number_of_nodes": "Nodes",
340
+ "system.accelerator.memory_capacity": "GPU Memory (GB)",
341
+ "model.name": "Model",
342
+ "model.architecture": "Architecture",
343
+ "model.number_of_parameters": "Parameters (B)",
344
+ "model.weight_data_types": "Weight Data Type",
345
+ "metrics.result": "Total Performance (Tokens/s)",
346
+ "metrics.result_per_accelerator": "Per-GPU Performance (Tokens/s)",
347
+ "hourly_cost": "Hourly Cost ($)",
348
+ "cost_per_million_tokens": "Cost/Million Tokens",
349
+ "predicted": "Predicted",
350
+ }
351
+
352
+ result_df = pd.DataFrame()
353
+ for col_name, display_name in display_columns.items():
354
+ if col_name in configs_df.columns:
355
+ result_df[display_name] = configs_df[col_name]
356
+ else:
357
+ result_df[display_name] = "N/A" if col_name != "predicted" else "No"
358
+
359
+ numeric_columns = [
360
+ "Count",
361
+ "Nodes",
362
+ "GPU Memory (GB)",
363
+ "Parameters (B)",
364
+ "Total Performance (Tokens/s)",
365
+ "Per-GPU Performance (Tokens/s)",
366
+ "Hourly Cost ($)",
367
+ "Cost/Million Tokens",
368
+ ]
369
+
370
+ for col in numeric_columns:
371
+ if col in result_df.columns:
372
+ result_df[col] = pd.to_numeric(result_df[col], errors="coerce")
373
+
374
+ result_df["Total Performance (Tokens/s)"] = result_df[
375
+ "Total Performance (Tokens/s)"
376
+ ].round(4)
377
+ result_df["Per-GPU Performance (Tokens/s)"] = result_df[
378
+ "Per-GPU Performance (Tokens/s)"
379
+ ].round(4)
380
+ result_df["GPU Memory (GB)"] = result_df["GPU Memory (GB)"].round(2)
381
+ result_df["Cost/Million Tokens"] = result_df["Cost/Million Tokens"].round(4)
382
+ result_df["Hourly Cost ($)"] = result_df["Hourly Cost ($)"].round(4)
383
+
384
+ if "Parameters (B)" in result_df.columns:
385
+ result_df["Parameters (B)"] = result_df["Parameters (B)"].round(2)
386
+
387
+ if "Predicted" in result_df.columns:
388
+ result_df["Predicted"] = result_df["Predicted"].map(
389
+ lambda x: "Yes" if x is True else "No"
390
+ )
391
+
392
+ result_df = result_df.drop_duplicates()
393
+
394
+ return result_df
395
+
396
+
397
+ def get_top_config_details(configs_df: pd.DataFrame) -> pd.DataFrame:
398
+ """Extract details for the top recommendation."""
399
+ if configs_df.empty:
400
+ return pd.DataFrame(columns=["Feature", "Value"])
401
+
402
+ top_config = configs_df.iloc[0]
403
+ is_predicted = "predicted" in top_config and top_config["predicted"]
404
+
405
+ details = {
406
+ "Feature": [
407
+ "System",
408
+ "Accelerator",
409
+ "Accelerator Count",
410
+ "Accelerator Vendor",
411
+ "Memory Capacity",
412
+ "CPU",
413
+ "CPU Vendor",
414
+ "Nodes",
415
+ "Devices per Node",
416
+ "Interconnect",
417
+ "Total Performance (Tokens/s)",
418
+ "Per-Accelerator Performance (Tokens/s)",
419
+ "Hourly Cost (estimated)",
420
+ "Cost per Million Tokens",
421
+ "Prediction Status",
422
+ ],
423
+ "Value": [
424
+ top_config.get("system.name", "N/A"),
425
+ top_config.get("system.accelerator.name", "N/A"),
426
+ top_config.get("system.accelerator.total_count", "N/A"),
427
+ top_config.get("system.accelerator.vendor", "N/A"),
428
+ (
429
+ f"{float(top_config.get('system.accelerator.memory_capacity', 0)):.1f}GB"
430
+ if top_config.get("system.accelerator.memory_capacity") is not None
431
+ else "N/A"
432
+ ),
433
+ top_config.get("system.cpu.model", "N/A"),
434
+ top_config.get("system.cpu.vendor", "N/A"),
435
+ top_config.get("system.number_of_nodes", "N/A"),
436
+ top_config.get("system.accelerator.count_per_node", "N/A"),
437
+ top_config.get("system.interconnect.accelerator", "N/A"),
438
+ (
439
+ f"{float(top_config.get('metrics.result', 0)):.4f}"
440
+ if top_config.get("metrics.result") is not None
441
+ else "N/A"
442
+ ),
443
+ (
444
+ f"{float(top_config.get('metrics.result_per_accelerator', 0)):.4f}"
445
+ if top_config.get("metrics.result_per_accelerator") is not None
446
+ else "N/A"
447
+ ),
448
+ (
449
+ f"${float(top_config.get('hourly_cost', 0)):.4f}"
450
+ if top_config.get("hourly_cost") is not None
451
+ else "N/A"
452
+ ),
453
+ (
454
+ f"${float(top_config.get('cost_per_million_tokens', 0)):.4f}"
455
+ if top_config.get("cost_per_million_tokens") is not None
456
+ else "N/A"
457
+ ),
458
+ "Predicted" if is_predicted else "Actual data",
459
+ ],
460
+ }
461
+
462
+ return pd.DataFrame(details)
463
+
464
+
465
+ def create_top_configs_plot(
466
+ configs_df: pd.DataFrame, optimization_metric: str = "performance", top_n: int = 10
467
+ ) -> go.Figure:
468
+ """Create a bar plot of top configurations based on the optimization metric."""
469
+ if configs_df.empty:
470
+ fig = go.Figure()
471
+ fig.update_layout(
472
+ title="No configurations found",
473
+ xaxis_title="Value",
474
+ yaxis_title="Rank",
475
+ template="plotly_white",
476
+ height=600,
477
+ )
478
+ return fig
479
+
480
+ if optimization_metric == "cost":
481
+ sort_col = "cost_per_million_tokens"
482
+ display_col = "Cost/Million Tokens ($)"
483
+ configs_df = configs_df.sort_values(by=sort_col, ascending=True)
484
+ else:
485
+ sort_col = "metrics.result_per_accelerator"
486
+ display_col = "Performance (Tokens/s per device)"
487
+ configs_df = configs_df.sort_values(by=sort_col, ascending=False)
488
+
489
+ top_configs = configs_df.head(top_n)
490
+
491
+ ranks = [f"#{i + 1}" for i in range(len(top_configs))]
492
+
493
+ if optimization_metric == "cost":
494
+ x_values = top_configs["cost_per_million_tokens"]
495
+ color = "crimson"
496
+ else:
497
+ x_values = top_configs["metrics.result_per_accelerator"]
498
+ color = "royalblue"
499
+
500
+ hover_text = []
501
+ for _, row in top_configs.iterrows():
502
+ system = row.get("system.name", "Unknown")
503
+ acc_name = row.get("system.accelerator.name", "Unknown")
504
+ acc_count = row.get("system.accelerator.total_count", "?")
505
+ total_perf = row.get("metrics.result", 0)
506
+ per_acc_perf = row.get("metrics.result_per_accelerator", 0)
507
+ cost = row.get("hourly_cost", 0)
508
+ cost_per_million = row.get("cost_per_million_tokens", 0) or 0
509
+ predicted = "Yes" if row.get("predicted", False) else "No"
510
+
511
+ info = f"System: {system}<br>"
512
+ info += f"Config: {acc_count}× {acc_name}<br>"
513
+ info += f"Tokens/s (total): {total_perf:.4f}<br>"
514
+ info += f"Tokens/s (per device): {per_acc_perf:.4f}<br>"
515
+ info += f"Hourly cost: ${cost:.4f}<br>"
516
+ info += f"Cost per million tokens: ${cost_per_million:.4f}<br>"
517
+ info += f"Predicted: {predicted}"
518
+ hover_text.append(info)
519
+
520
+ fig = go.Figure()
521
+ fig.add_trace(
522
+ go.Bar(
523
+ y=ranks,
524
+ x=x_values,
525
+ text=x_values.apply(lambda x: f"{x:.4f}"),
526
+ textposition="auto",
527
+ marker=dict(color=color),
528
+ hovertext=hover_text,
529
+ hoverinfo="text",
530
+ orientation="h",
531
+ )
532
+ )
533
+
534
+ title = f"Top {len(ranks)} Configurations by {'Cost' if optimization_metric == 'cost' else 'Performance'}"
535
+ fig.update_layout(
536
+ title=title,
537
+ xaxis_title=display_col,
538
+ yaxis_title="Rank",
539
+ template="plotly_white",
540
+ height=max(400, min(20 * len(ranks), 800)),
541
+ margin=dict(l=50),
542
+ )
543
+
544
+ return fig
545
+
546
+
547
+ def recommend_hardware(
548
+ model_size: float,
549
+ weight_data_type: str,
550
+ architecture: str,
551
+ accelerator_vendor: str,
552
+ accelerator_model: str,
553
+ min_gpu_memory: float | None,
554
+ max_gpu_memory: float | None,
555
+ interconnect: str,
556
+ min_accelerators: int | None,
557
+ max_accelerators: int | None,
558
+ cpu_vendor: str,
559
+ cpu_model: str,
560
+ nodes: str,
561
+ min_cpu_memory: float | None,
562
+ max_cpu_memory: float | None,
563
+ os: str,
564
+ include_predictions: bool = True,
565
+ optimization_metric: str = "performance",
566
+ top_n_configs: int = 10,
567
+ **framework_versions,
568
+ ) -> tuple[pd.DataFrame, pd.DataFrame, str, go.Figure]:
569
+ """Find hardware configurations matching requirements."""
570
+ workload_specs = {
571
+ "model_size": model_size,
572
+ "weight_data_type": weight_data_type,
573
+ "architecture": architecture,
574
+ }
575
+
576
+ constraints = {
577
+ "system.accelerator.vendor": accelerator_vendor,
578
+ "system.accelerator.name": accelerator_model,
579
+ "system.interconnect.accelerator": interconnect,
580
+ "system.cpu.vendor": cpu_vendor,
581
+ "system.cpu.model": cpu_model,
582
+ "system.number_of_nodes": nodes if nodes != "Any" else None,
583
+ "software.operating_system": os,
584
+ "min_gpu_memory": min_gpu_memory,
585
+ "max_gpu_memory": max_gpu_memory,
586
+ "min_cpu_memory": min_cpu_memory,
587
+ "max_cpu_memory": max_cpu_memory,
588
+ "min_accelerators": min_accelerators,
589
+ "max_accelerators": max_accelerators,
590
+ }
591
+
592
+ for fw_name, version in framework_versions.items():
593
+ if version != "Any":
594
+ constraints[f"software.framework.{fw_name}"] = version
595
+
596
+ best_configs = find_best_configs(
597
+ workload_specs, constraints, include_predictions, optimization_metric
598
+ )
599
+ recommendations_df = format_recommendations(best_configs)
600
+ details_df = get_top_config_details(best_configs)
601
+
602
+ top_configs_chart = create_top_configs_plot(
603
+ best_configs, optimization_metric, top_n_configs
604
+ )
605
+
606
+ if best_configs.empty:
607
+ summary = "No matching configurations found. Try relaxing some constraints or changing the model parameters."
608
+ else:
609
+ actual_count = (
610
+ sum(~best_configs["predicted"])
611
+ if "predicted" in best_configs.columns
612
+ else len(best_configs)
613
+ )
614
+ predicted_count = (
615
+ sum(best_configs["predicted"]) if "predicted" in best_configs.columns else 0
616
+ )
617
+
618
+ top_config = best_configs.iloc[0]
619
+ is_predicted = "predicted" in top_config and top_config["predicted"]
620
+
621
+ if optimization_metric == "cost":
622
+ metric_value = f"${float(top_config.get('cost_per_million_tokens', 0)):.4f} per million tokens"
623
+ metric_name = "cost"
624
+ else:
625
+ metric_value = f"{float(top_config.get('metrics.result_per_accelerator', 0)):.4f} tokens/s per device"
626
+ metric_name = "performance"
627
+
628
+ acc = top_config.get("system.accelerator.name", "Unknown")
629
+ count = top_config.get("system.accelerator.total_count", "Unknown")
630
+
631
+ summary = f"Found {actual_count} actual and {predicted_count} predicted configurations. "
632
+ summary += f"\nTop recommendation optimized for {metric_name}: {count}× {acc} with {metric_value}"
633
+ if is_predicted:
634
+ summary += " (Predicted)"
635
+
636
+ return recommendations_df, details_df, summary, top_configs_chart
637
+
638
+
639
+ def create_model_performance_plot(
640
+ predictor: PerformancePredictor,
641
+ ) -> tuple[go.Figure, dict, pd.DataFrame]:
642
+ """Create performance visualization for the ML model using Plotly."""
643
+ logger.info("Starting to create model performance plot")
644
+
645
+ empty_metrics = {"rmse": 0, "mae": 0, "r2": 0, "mape": 0}
646
+ empty_df = pd.DataFrame(columns=["Feature", "Importance"])
647
+
648
+ empty_fig = make_subplots(
649
+ rows=2,
650
+ cols=2,
651
+ subplot_titles=(
652
+ "Predicted vs Actual Performance",
653
+ "Residual Plot (% Error)",
654
+ "Distribution of Prediction Errors",
655
+ "Top 10 Feature Importance",
656
+ ),
657
+ )
658
+ empty_fig.update_layout(
659
+ height=800,
660
+ width=1200,
661
+ showlegend=False,
662
+ title_text="No Model Evaluation Data Available",
663
+ annotations=[
664
+ dict(
665
+ text="Train the model with test data to see evaluation metrics",
666
+ showarrow=False,
667
+ xref="paper",
668
+ yref="paper",
669
+ x=0.5,
670
+ y=0.5,
671
+ )
672
+ ],
673
+ )
674
+
675
+ if predictor is None:
676
+ logger.warning("No predictor available for performance plot")
677
+ return empty_fig, empty_metrics, empty_df
678
+
679
+ if (
680
+ not hasattr(predictor, "evaluation_data")
681
+ or predictor.evaluation_data is None
682
+ or predictor.evaluation_data.empty
683
+ ):
684
+ logger.warning("Evaluation data not found, attempting to re-train model")
685
+ try:
686
+ predictor._train_model()
687
+ except Exception as e:
688
+ logger.error(f"Error re-training model: {e}")
689
+
690
+ eval_data = predictor.get_evaluation_data()
691
+ metrics = predictor.get_evaluation_metrics()
692
+ feature_importance = predictor.get_feature_importance()
693
+
694
+ logger.info(f"Retrieved evaluation data: {type(eval_data)}")
695
+ if eval_data is not None:
696
+ logger.info(
697
+ f"Evaluation data shape: {eval_data.shape if not eval_data.empty else 'empty'}"
698
+ )
699
+
700
+ if eval_data is None or eval_data.empty:
701
+ logger.warning("Evaluation data is not available")
702
+ return (
703
+ empty_fig,
704
+ empty_metrics,
705
+ feature_importance if feature_importance is not None else empty_df,
706
+ )
707
+
708
+ logger.info(f"First few rows of evaluation data: {eval_data.head(3).to_dict()}")
709
+
710
+ fig = make_subplots(
711
+ rows=2,
712
+ cols=2,
713
+ subplot_titles=(
714
+ "Predicted vs Actual Performance",
715
+ "Residual Plot (% Error)",
716
+ "Distribution of Prediction Errors",
717
+ "Top 10 Feature Importance",
718
+ ),
719
+ )
720
+
721
+ hover_text = [
722
+ f"Accelerator: {acc}<br>"
723
+ f"Vendor: {vendor}<br>"
724
+ f"Count: {count}<br>"
725
+ f"Actual: {actual:.4f}<br>"
726
+ f"Predicted: {pred:.4f}<br>"
727
+ f"Error: {error:.2f} ({err_pct:.2f}%)"
728
+ for acc, vendor, count, actual, pred, error, err_pct in zip(
729
+ eval_data["system.accelerator.name"],
730
+ eval_data["system.accelerator.vendor"],
731
+ eval_data["system.accelerator.total_count"],
732
+ eval_data["actual"],
733
+ eval_data["predicted"],
734
+ eval_data["error"],
735
+ eval_data["error_percent"],
736
+ )
737
+ ]
738
+
739
+ fig.add_trace(
740
+ go.Scatter(
741
+ x=eval_data["actual"],
742
+ y=eval_data["predicted"],
743
+ mode="markers",
744
+ marker=dict(
745
+ opacity=0.6,
746
+ color=eval_data["error_percent"],
747
+ colorscale="RdBu_r",
748
+ colorbar=dict(title="Error %"),
749
+ cmin=-30,
750
+ cmax=30,
751
+ ),
752
+ text=hover_text,
753
+ hoverinfo="text",
754
+ name="Predictions",
755
+ ),
756
+ row=1,
757
+ col=1,
758
+ )
759
+
760
+ max_val = max(eval_data["actual"].max(), eval_data["predicted"].max())
761
+ min_val = min(eval_data["actual"].min(), eval_data["predicted"].min())
762
+
763
+ fig.add_trace(
764
+ go.Scatter(
765
+ x=[min_val, max_val],
766
+ y=[min_val, max_val],
767
+ mode="lines",
768
+ line=dict(color="red", dash="dash"),
769
+ name="Perfect Prediction",
770
+ hoverinfo="none",
771
+ ),
772
+ row=1,
773
+ col=1,
774
+ )
775
+
776
+ fig.add_trace(
777
+ go.Scatter(
778
+ x=eval_data["predicted"],
779
+ y=eval_data["error_percent"],
780
+ mode="markers",
781
+ marker=dict(
782
+ opacity=0.6,
783
+ color=eval_data["error_percent"],
784
+ colorscale="RdBu_r",
785
+ colorbar=dict(title="Error %"),
786
+ showscale=False,
787
+ cmin=-30,
788
+ cmax=30,
789
+ ),
790
+ text=hover_text,
791
+ hoverinfo="text",
792
+ name="Errors",
793
+ ),
794
+ row=1,
795
+ col=2,
796
+ )
797
+
798
+ fig.add_trace(
799
+ go.Histogram(
800
+ x=eval_data["error_percent"],
801
+ nbinsx=20,
802
+ marker=dict(color="blue", opacity=0.7, line=dict(color="black", width=1)),
803
+ name="Error Distribution",
804
+ ),
805
+ row=2,
806
+ col=1,
807
+ )
808
+
809
+ fig.add_vline(x=0, line_dash="dash", line_color="red", row=2, col=1)
810
+
811
+ top_features = feature_importance.head(10).sort_values("Importance")
812
+
813
+ fig.add_trace(
814
+ go.Bar(
815
+ y=top_features["Feature"],
816
+ x=top_features["Importance"],
817
+ orientation="h",
818
+ marker=dict(color="blue"),
819
+ name="Feature Importance",
820
+ ),
821
+ row=2,
822
+ col=2,
823
+ )
824
+
825
+ fig.update_xaxes(title_text="Actual Performance (tokens/s)", row=1, col=1)
826
+ fig.update_yaxes(title_text="Predicted Performance (tokens/s)", row=1, col=1)
827
+
828
+ fig.update_xaxes(title_text="Predicted Value", row=1, col=2)
829
+ fig.update_yaxes(title_text="Error (%)", row=1, col=2)
830
+
831
+ fig.update_xaxes(title_text="Prediction Error (%)", row=2, col=1)
832
+ fig.update_yaxes(title_text="Frequency", row=2, col=1)
833
+
834
+ fig.update_xaxes(title_text="Importance", row=2, col=2)
835
+
836
+ fig.update_layout(
837
+ height=800,
838
+ width=1200,
839
+ autosize=True,
840
+ showlegend=False,
841
+ title_text="Model Performance Analysis",
842
+ )
843
+
844
+ logger.info("Successfully created model performance plot")
845
+ return fig, metrics, feature_importance.head(10)
846
+
847
+
848
+ with gr.Blocks(title="MLPerf Configuration Finder") as interface:
849
+ gr.Markdown(
850
+ """
851
+ # 🔍 MLPerf Configuration Finder (ongoing preliminary work)
852
+
853
+ Find the optimal configurations for your AI workloads by specifying your model and constraints.
854
+ Results are ranked by performance and include both real benchmark data and AI-generated predictions.
855
+
856
+ *All configurations include a ±10% tolerance for continuous features like model size, memory capacity, etc.*
857
+ """
858
+ )
859
+
860
+ with gr.Row():
861
+ status_msg = gr.Markdown(
862
+ "*Ready to search. Enter your criteria and click 'Search Configurations'.*"
863
+ )
864
+
865
+ with gr.Tabs():
866
+ with gr.TabItem("Workload Specifications"):
867
+ with gr.Accordion("Model Specifications", open=True):
868
+ with gr.Row():
869
+ architecture = gr.Dropdown(
870
+ choices=["Any"] + metadata.get("architectures", []),
871
+ label="Architecture",
872
+ value="LLM",
873
+ info="Model architecture type",
874
+ )
875
+ weight_data_type = gr.Dropdown(
876
+ choices=["Any"] + metadata.get("weight_data_types", []),
877
+ label="Weight Data Type",
878
+ value="Any",
879
+ info="Precision format for model weights",
880
+ )
881
+
882
+ model_size = gr.Slider(
883
+ minimum=metadata.get("model_size_min"),
884
+ maximum=metadata.get("model_size_max"),
885
+ value=70,
886
+ step=1,
887
+ label="Model Size (billions of parameters)",
888
+ info="Number of parameters in billions",
889
+ )
890
+
891
+ with gr.Accordion("Accelerator (GPU/TPU) Specifications", open=False):
892
+ with gr.Row():
893
+ accelerator_vendor = gr.Dropdown(
894
+ choices=["Any"] + metadata.get("accelerator_vendors", []),
895
+ label="Vendor",
896
+ value="Any",
897
+ info="Hardware manufacturer",
898
+ )
899
+ accelerator_model = gr.Dropdown(
900
+ choices=["Any"] + metadata.get("accelerator_models", []),
901
+ label="Model",
902
+ value="Any",
903
+ info="Specific accelerator model",
904
+ )
905
+
906
+ with gr.Row():
907
+ min_gpu_memory = gr.Slider(
908
+ minimum=metadata.get("gpu_memory_min"),
909
+ maximum=metadata.get("gpu_memory_max"),
910
+ value=metadata.get("gpu_memory_min"),
911
+ step=1,
912
+ label="Min GPU Memory (GB)",
913
+ info="Minimum GPU memory capacity needed",
914
+ )
915
+ max_gpu_memory = gr.Slider(
916
+ minimum=metadata.get("gpu_memory_min"),
917
+ maximum=metadata.get("gpu_memory_max"),
918
+ value=metadata.get("gpu_memory_max"),
919
+ step=1,
920
+ label="Max GPU Memory (GB)",
921
+ info="Maximum GPU memory capacity to consider",
922
+ )
923
+
924
+ with gr.Row():
925
+ interconnect = gr.Dropdown(
926
+ choices=["Any"] + metadata.get("interconnect_types", []),
927
+ label="Interconnect",
928
+ value="Any",
929
+ info="GPU-to-GPU connection type",
930
+ )
931
+
932
+ with gr.Row():
933
+ min_accelerators = gr.Slider(
934
+ minimum=metadata.get("min_accelerators"),
935
+ maximum=metadata.get("max_accelerators"),
936
+ value=metadata.get("min_accelerators"),
937
+ step=1,
938
+ label="Minimum Accelerators",
939
+ info="Minimum number of accelerators needed",
940
+ )
941
+ max_accelerators = gr.Slider(
942
+ minimum=metadata.get("min_accelerators"),
943
+ maximum=metadata.get("max_accelerators"),
944
+ value=metadata.get("max_accelerators"),
945
+ step=1,
946
+ label="Maximum Accelerators",
947
+ info="Maximum number of accelerators to consider",
948
+ )
949
+
950
+ with gr.Accordion("CPU & System Specifications", open=False):
951
+ with gr.Row():
952
+ cpu_vendor = gr.Dropdown(
953
+ choices=["Any"] + metadata.get("cpu_vendors", []),
954
+ label="CPU Vendor",
955
+ value="Any",
956
+ info="CPU manufacturer",
957
+ )
958
+ cpu_model = gr.Dropdown(
959
+ choices=["Any"] + metadata.get("cpu_models", []),
960
+ label="CPU Model",
961
+ value="Any",
962
+ info="Specific CPU model",
963
+ )
964
+
965
+ nodes = gr.Dropdown(
966
+ choices=["Any"] + [str(n) for n in metadata.get("node_counts", [])],
967
+ label="Number of Nodes",
968
+ value="Any",
969
+ info="Number of physical servers in the system",
970
+ )
971
+
972
+ with gr.Row():
973
+ min_cpu_memory = gr.Slider(
974
+ minimum=metadata.get("cpu_memory_min"),
975
+ maximum=metadata.get("cpu_memory_max"),
976
+ value=metadata.get("cpu_memory_min"),
977
+ step=1,
978
+ label="Min System Memory (GB)",
979
+ info="Minimum system RAM needed",
980
+ )
981
+ max_cpu_memory = gr.Slider(
982
+ minimum=metadata.get("cpu_memory_min"),
983
+ maximum=metadata.get("cpu_memory_max"),
984
+ value=metadata.get("cpu_memory_max"),
985
+ step=1,
986
+ label="Max System Memory (GB)",
987
+ info="Maximum system RAM to consider",
988
+ )
989
+
990
+ with gr.Accordion("Software Environment", open=False):
991
+ os = gr.Dropdown(
992
+ choices=["Any"] + metadata.get("operating_systems", []),
993
+ label="Operating System",
994
+ value="Any",
995
+ info="Host operating system",
996
+ )
997
+
998
+ frameworks = [
999
+ fw
1000
+ for fw in metadata.get("frameworks", [])
1001
+ if f"{fw}_versions" in metadata
1002
+ ]
1003
+ n_frameworks = len(frameworks)
1004
+ column_size = (n_frameworks + 1) // 2
1005
+
1006
+ framework_dropdowns = []
1007
+ with gr.Row():
1008
+ for i in range(0, 2):
1009
+ with gr.Column():
1010
+ start_idx = i * column_size
1011
+ end_idx = min((i + 1) * column_size, n_frameworks)
1012
+
1013
+ if start_idx < n_frameworks:
1014
+ column_frameworks = frameworks[start_idx:end_idx]
1015
+ for fw in column_frameworks:
1016
+ version_key = f"{fw}_versions"
1017
+ dropdown = gr.Dropdown(
1018
+ choices=["Any"] + metadata.get(version_key),
1019
+ label=fw,
1020
+ value="Any",
1021
+ info=f"Select {fw} framework version",
1022
+ )
1023
+ framework_dropdowns.append((fw, dropdown))
1024
+
1025
+ with gr.TabItem("Device Cost Settings 💰"):
1026
+ gr.Markdown(
1027
+ """
1028
+ ## Configure Device Hourly Costs
1029
+
1030
+ Customize the hourly cost (in USD) for each accelerator type. These values will be used to
1031
+ calculate the cost metrics for hardware configurations.
1032
+
1033
+ Default values may not reflect actual current market prices. Please adjust them according to your needs.
1034
+ """
1035
+ )
1036
+
1037
+ with gr.Column():
1038
+ with gr.Row():
1039
+ save_costs_button = gr.Button(
1040
+ "💾 Save Cost Settings", variant="primary"
1041
+ )
1042
+ reset_costs_button = gr.Button("↻ Reset to Defaults")
1043
+
1044
+ current_costs = get_device_costs()
1045
+ cost_data = pd.DataFrame(
1046
+ {
1047
+ "Device": list(current_costs.keys()),
1048
+ "Hourly Cost ($)": list(current_costs.values()),
1049
+ }
1050
+ ).sort_values("Device")
1051
+
1052
+ device_costs_df = gr.DataFrame(
1053
+ value=cost_data,
1054
+ datatype=["str", "number"],
1055
+ col_count=(2, "fixed"),
1056
+ interactive=True,
1057
+ wrap=True,
1058
+ show_copy_button=True,
1059
+ show_search="filter",
1060
+ )
1061
+
1062
+ costs_status = gr.Markdown("*Device costs ready for customization*")
1063
+
1064
+ def update_costs_callback(df):
1065
+ """Update device costs with values from dataframe."""
1066
+ if isinstance(df, list):
1067
+ new_costs = {
1068
+ row[0]: float(row[1]) for row in df if len(row) >= 2
1069
+ }
1070
+ else:
1071
+ new_costs = {
1072
+ df.loc[i, "Device"]: float(df.loc[i, "Hourly Cost ($)"])
1073
+ for i in range(len(df))
1074
+ }
1075
+
1076
+ update_device_costs(new_costs)
1077
+ return "*Device costs successfully updated!*"
1078
+
1079
+ def reset_costs_callback():
1080
+ """Reset all costs to defaults."""
1081
+ initialize_device_costs(pd_df)
1082
+ current_costs = get_device_costs()
1083
+ cost_data = pd.DataFrame(
1084
+ {
1085
+ "Device": list(current_costs.keys()),
1086
+ "Hourly Cost ($)": list(current_costs.values()),
1087
+ }
1088
+ ).sort_values("Device")
1089
+ return cost_data, "*Device costs reset to defaults*"
1090
+
1091
+ save_costs_button.click(
1092
+ fn=update_costs_callback,
1093
+ inputs=device_costs_df,
1094
+ outputs=costs_status,
1095
+ )
1096
+
1097
+ reset_costs_button.click(
1098
+ fn=reset_costs_callback,
1099
+ inputs=[],
1100
+ outputs=[device_costs_df, costs_status],
1101
+ )
1102
+
1103
+ with gr.Row():
1104
+ with gr.Accordion("Options", open=True):
1105
+ with gr.Row():
1106
+ include_predictions = gr.Checkbox(
1107
+ label="Include AI-generated predictions",
1108
+ value=True,
1109
+ info="When enabled, AI will predict performance for configurations not in the benchmark database",
1110
+ )
1111
+ optimization_metric = gr.Radio(
1112
+ choices=["performance", "cost"],
1113
+ label="Optimization Target",
1114
+ value="performance",
1115
+ info="Choose whether to optimize for highest performance or lowest cost per token",
1116
+ )
1117
+
1118
+ with gr.Row():
1119
+ search_button = gr.Button(
1120
+ "🔍 Search Configurations", variant="primary", scale=3
1121
+ )
1122
+
1123
+ with gr.Group():
1124
+ summary = gr.Markdown(
1125
+ "Enter your requirements and click 'Search Configurations' to find suitable hardware.",
1126
+ label="Summary",
1127
+ )
1128
+
1129
+ with gr.Tabs():
1130
+ with gr.TabItem("Top Configuration Details 🏆"):
1131
+ details = gr.DataFrame(
1132
+ headers=["Feature", "Value"],
1133
+ datatype=["str", "str"],
1134
+ label="Configuration Details",
1135
+ )
1136
+
1137
+ with gr.TabItem("All Matching Configurations 📊"):
1138
+ recommendations = gr.DataFrame(
1139
+ headers=[
1140
+ "System",
1141
+ "Accelerator",
1142
+ "Count",
1143
+ "Nodes",
1144
+ "GPU Memory (GB)",
1145
+ "Model",
1146
+ "Architecture",
1147
+ "Parameters (B)",
1148
+ "Weight Data Type",
1149
+ "Total Performance (Tokens/s)",
1150
+ "Per-GPU Performance (Tokens/s)",
1151
+ "Hourly Cost ($)",
1152
+ "Cost/Million Tokens",
1153
+ "Predicted",
1154
+ ],
1155
+ datatype=[
1156
+ "str",
1157
+ "str",
1158
+ "number",
1159
+ "number",
1160
+ "number",
1161
+ "str",
1162
+ "str",
1163
+ "number",
1164
+ "str",
1165
+ "number",
1166
+ "number",
1167
+ "number",
1168
+ "number",
1169
+ "str",
1170
+ ],
1171
+ label="Hardware Configurations",
1172
+ )
1173
+
1174
+ with gr.TabItem("ML Model Performance 📈"):
1175
+ gr.Markdown(
1176
+ """
1177
+ ## Model Performance Analysis
1178
+ This tab shows how well our machine learning model can predict performance for unseen hardware configurations.
1179
+ The evaluation is based on a test set that was not used to train the model.
1180
+
1181
+ **Hover over data points in the plots to see detailed information about each prediction.**
1182
+ """
1183
+ )
1184
+
1185
+ model_metrics = gr.Dataframe(
1186
+ headers=["Metric", "Value"],
1187
+ value=[
1188
+ ["Root Mean Squared Error (RMSE)", 0],
1189
+ ["Mean Absolute Error (MAE)", 0],
1190
+ ["R² Score", 0],
1191
+ ["Mean Absolute Percentage Error (MAPE)", 0],
1192
+ ],
1193
+ label="Model Performance Metrics",
1194
+ )
1195
+
1196
+ feature_importance_df = gr.Dataframe(
1197
+ headers=["Feature", "Importance"], label="Feature Importance"
1198
+ )
1199
+
1200
+ performance_plot = gr.Plot(
1201
+ label="Performance Visualization", elem_id="performance_plot"
1202
+ )
1203
+
1204
+ with gr.Row():
1205
+ gr.Markdown("## Top Configurations Comparison")
1206
+
1207
+ with gr.Row():
1208
+ top_n_configs = gr.Slider(
1209
+ minimum=1,
1210
+ maximum=100,
1211
+ value=10,
1212
+ step=1,
1213
+ label="Number of configurations to show",
1214
+ info="Adjust to see more or fewer configurations in the chart",
1215
+ )
1216
+
1217
+ with gr.Row():
1218
+ top_configs_chart = gr.Plot(label="")
1219
+
1220
+ current_configs_state = gr.State(pd.DataFrame())
1221
+
1222
+ all_inputs = [
1223
+ model_size,
1224
+ weight_data_type,
1225
+ architecture,
1226
+ accelerator_vendor,
1227
+ accelerator_model,
1228
+ min_gpu_memory,
1229
+ max_gpu_memory,
1230
+ interconnect,
1231
+ min_accelerators,
1232
+ max_accelerators,
1233
+ cpu_vendor,
1234
+ cpu_model,
1235
+ nodes,
1236
+ min_cpu_memory,
1237
+ max_cpu_memory,
1238
+ os,
1239
+ include_predictions,
1240
+ optimization_metric,
1241
+ top_n_configs,
1242
+ ]
1243
+
1244
+ framework_input_components = [dropdown for _, dropdown in framework_dropdowns]
1245
+
1246
+ def process_framework_inputs(*args):
1247
+ base_args = args[: -len(framework_dropdowns)]
1248
+ framework_args = args[-len(framework_dropdowns) :]
1249
+
1250
+ framework_versions = {}
1251
+ for (framework_name, _), version in zip(framework_dropdowns, framework_args):
1252
+ if version != "Any":
1253
+ framework_versions[framework_name] = version
1254
+
1255
+ opt_metric = base_args[16]
1256
+
1257
+ results = recommend_hardware(*base_args, **framework_versions)
1258
+ recommendations_df, details_df, summary, top_chart = results
1259
+
1260
+ best_configs = find_best_configs(
1261
+ {
1262
+ "model_size": base_args[0],
1263
+ "weight_data_type": base_args[1],
1264
+ "architecture": base_args[2],
1265
+ },
1266
+ constraints=get_constraints_from_args(*base_args),
1267
+ include_predictions=base_args[15],
1268
+ optimization_metric=opt_metric,
1269
+ )
1270
+
1271
+ return (
1272
+ recommendations_df,
1273
+ details_df,
1274
+ summary,
1275
+ top_chart,
1276
+ best_configs,
1277
+ )
1278
+
1279
+ def get_constraints_from_args(*args):
1280
+ """Helper function to convert args to constraints dict."""
1281
+ return {
1282
+ "system.accelerator.vendor": args[3],
1283
+ "system.accelerator.name": args[4],
1284
+ "system.interconnect.accelerator": args[7],
1285
+ "system.cpu.vendor": args[10],
1286
+ "system.cpu.model": args[11],
1287
+ "system.number_of_nodes": args[12] if args[12] != "Any" else None,
1288
+ "software.operating_system": args[15],
1289
+ "min_gpu_memory": args[5],
1290
+ "max_gpu_memory": args[6],
1291
+ "min_cpu_memory": args[13],
1292
+ "max_cpu_memory": args[14],
1293
+ "min_accelerators": args[8],
1294
+ "max_accelerators": args[9],
1295
+ }
1296
+
1297
+ def update_chart(n: int, configs_df: pd.DataFrame, metric: str) -> go.Figure:
1298
+ """Update the configurations chart based on the slider value."""
1299
+ return create_top_configs_plot(configs_df, metric, n)
1300
+
1301
+ search_button.click(
1302
+ fn=process_framework_inputs,
1303
+ inputs=all_inputs + framework_input_components,
1304
+ outputs=[
1305
+ recommendations,
1306
+ details,
1307
+ summary,
1308
+ top_configs_chart,
1309
+ current_configs_state,
1310
+ ],
1311
+ show_progress="full",
1312
+ )
1313
+
1314
+ top_n_configs.change(
1315
+ fn=update_chart,
1316
+ inputs=[top_n_configs, current_configs_state, optimization_metric],
1317
+ outputs=top_configs_chart,
1318
+ )
1319
+
1320
+ def initial_load():
1321
+ logger.info("Starting initial load of app")
1322
+ default_values = []
1323
+ for input_component in all_inputs:
1324
+ default_values.append(input_component.value)
1325
+
1326
+ for _, dropdown in framework_dropdowns:
1327
+ default_values.append(dropdown.value)
1328
+
1329
+ (
1330
+ recommendations_df,
1331
+ details_df,
1332
+ summary_text,
1333
+ top_chart,
1334
+ best_configs,
1335
+ ) = process_framework_inputs(*default_values)
1336
+
1337
+ if not recommendations_df.empty:
1338
+ top_n_configs.maximum = min(100, len(recommendations_df))
1339
+
1340
+ if predictor:
1341
+ logger.info("Predictor available, generating performance visualization")
1342
+ try:
1343
+ plot_fig, metrics, feature_importance = create_model_performance_plot(
1344
+ predictor
1345
+ )
1346
+
1347
+ metrics_df = pd.DataFrame(
1348
+ {
1349
+ "Metric": [
1350
+ "Root Mean Squared Error (RMSE)",
1351
+ "Mean Absolute Error (MAE)",
1352
+ "R² Score",
1353
+ "Mean Absolute Percentage Error (MAPE)",
1354
+ ],
1355
+ "Value": [
1356
+ f"{metrics.get('rmse', 0):.4f}",
1357
+ f"{metrics.get('mae', 0):.4f}",
1358
+ f"{metrics.get('r2', 0):.4f}",
1359
+ f"{metrics.get('mape', 0):.2f}%",
1360
+ ],
1361
+ }
1362
+ )
1363
+ logger.info(f"Created metrics_df with values: {metrics_df.to_dict()}")
1364
+ except Exception as e:
1365
+ logger.error(f"Error creating performance plot: {e}", exc_info=True)
1366
+ plot_fig = go.Figure()
1367
+ metrics_df = pd.DataFrame(
1368
+ {
1369
+ "Metric": [
1370
+ "Root Mean Squared Error (RMSE)",
1371
+ "Mean Absolute Error (MAE)",
1372
+ "R² Score",
1373
+ "Mean Absolute Percentage Error (MAPE)",
1374
+ ],
1375
+ "Value": ["N/A", "N/A", "N/A", "N/A"],
1376
+ }
1377
+ )
1378
+ feature_importance = pd.DataFrame(columns=["Feature", "Importance"])
1379
+ else:
1380
+ logger.warning("No predictor available for initial load")
1381
+ plot_fig = go.Figure()
1382
+ plot_fig.update_layout(
1383
+ title="No model available",
1384
+ annotations=[
1385
+ dict(
1386
+ text="No prediction model available",
1387
+ showarrow=False,
1388
+ xref="paper",
1389
+ yref="paper",
1390
+ x=0.5,
1391
+ y=0.5,
1392
+ )
1393
+ ],
1394
+ )
1395
+
1396
+ metrics_df = pd.DataFrame(
1397
+ {
1398
+ "Metric": [
1399
+ "Root Mean Squared Error (RMSE)",
1400
+ "Mean Absolute Error (MAE)",
1401
+ "R² Score",
1402
+ "Mean Absolute Percentage Error (MAPE)",
1403
+ ],
1404
+ "Value": ["N/A", "N/A", "N/A", "N/A"],
1405
+ }
1406
+ )
1407
+ feature_importance = pd.DataFrame(columns=["Feature", "Importance"])
1408
+
1409
+ logger.info("Completed initial load")
1410
+ return (
1411
+ recommendations_df,
1412
+ details_df,
1413
+ summary_text,
1414
+ plot_fig,
1415
+ metrics_df,
1416
+ feature_importance,
1417
+ top_chart,
1418
+ best_configs,
1419
+ )
1420
+
1421
+ interface.load(
1422
+ fn=initial_load,
1423
+ outputs=[
1424
+ recommendations,
1425
+ details,
1426
+ summary,
1427
+ performance_plot,
1428
+ model_metrics,
1429
+ feature_importance_df,
1430
+ top_configs_chart,
1431
+ current_configs_state,
1432
+ ],
1433
+ api_name=False,
1434
+ )
1435
+
1436
+ gr.Markdown("---")
1437
+ gr.HTML("""
1438
+ <div style="text-align: center;">
1439
+ Authors: <a href="https://www.linkedin.com/in/daltunay">Daniel Altunay</a> and
1440
+ <a href="https://cKnowledge.org/gfursin">Grigori Fursin</a> (FCS Labs)
1441
+ </div>
1442
+ """)
1443
+
1444
+ if __name__ == "__main__":
1445
+ interface.launch(share=False)
cost_calculator.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Cost calculation module for MLPerf configurations.
3
+ """
4
+
5
+ import logging
6
+
7
+ import pandas as pd
8
+
9
+ logger = logging.getLogger(__name__)
10
+
11
+ DEFAULT_HOURLY_COST = 1.0
12
+
13
+ DEFAULT_DEVICE_COSTS = {
14
+ "NVIDIA H100": 3.00,
15
+ "NVIDIA H200": 4.00,
16
+ "NVIDIA GH200": 5.00,
17
+ "NVIDIA B200/GB200": 7.00,
18
+ "AMD MI300X": 3.50,
19
+ "AMD MI325X": 4.50,
20
+ "NVIDIA RTX 4090": 1.20,
21
+ "NVIDIA L40S": 1.80,
22
+ "NVIDIA Jetson AGX": 0.30,
23
+ }
24
+
25
+ device_costs = {}
26
+
27
+
28
+ def normalize_gpu_name(name: str) -> str:
29
+ """Normalize GPU names by identifying common patterns for the same device families."""
30
+ if not name:
31
+ return name
32
+
33
+ name_upper = name.upper()
34
+
35
+ gpu_families = {
36
+ "H100": "NVIDIA H100",
37
+ "H200": "NVIDIA H200",
38
+ "GH200": "NVIDIA GH200",
39
+ "GRACE HOPPER": "NVIDIA GH200",
40
+ "B200": "NVIDIA B200/GB200",
41
+ "GB200": "NVIDIA B200/GB200",
42
+ "MI300X": "AMD MI300X",
43
+ "MI325X": "AMD MI325X",
44
+ "RTX 4090": "NVIDIA RTX 4090",
45
+ "L40S": "NVIDIA L40S",
46
+ }
47
+
48
+ if "JETSON" in name_upper and ("ORIN" in name_upper or "THOR" in name_upper):
49
+ return "NVIDIA Jetson AGX"
50
+
51
+ for keyword, normalized_name in gpu_families.items():
52
+ if keyword in name_upper:
53
+ return normalized_name
54
+
55
+ return name
56
+
57
+
58
+ def initialize_device_costs(df: pd.DataFrame) -> None:
59
+ """Initialize device costs from dataset with default values."""
60
+ global device_costs
61
+
62
+ accelerators = set()
63
+
64
+ if df is not None and not df.empty and "system.accelerator.name" in df.columns:
65
+ for acc in df["system.accelerator.name"].dropna().unique():
66
+ normalized_name = normalize_gpu_name(acc)
67
+ accelerators.add(normalized_name)
68
+
69
+ device_costs = {}
70
+ for device in accelerators:
71
+ if device in DEFAULT_DEVICE_COSTS:
72
+ device_costs[device] = DEFAULT_DEVICE_COSTS[device]
73
+ else:
74
+ device_costs[device] = DEFAULT_HOURLY_COST
75
+
76
+ logger.info(f"Initialized costs for {len(device_costs)} unique device families")
77
+
78
+
79
+ def get_device_costs() -> dict[str, float]:
80
+ """Return a copy of the current device costs."""
81
+ return device_costs.copy()
82
+
83
+
84
+ def update_device_costs(new_costs: dict[str, float]) -> None:
85
+ """Update device costs with new values."""
86
+ global device_costs
87
+ device_costs.update(new_costs)
88
+ logger.info(f"Updated costs for {len(new_costs)} devices")
89
+
90
+
91
+ def calculate_costs(df: pd.DataFrame) -> pd.DataFrame:
92
+ """Add cost metrics to the DataFrame."""
93
+ if df is None or df.empty:
94
+ return df
95
+
96
+ result_df = df.copy()
97
+
98
+ result_df["hourly_cost"] = None
99
+ result_df["cost_per_million_tokens"] = None
100
+
101
+ for idx, row in result_df.iterrows():
102
+ hourly_cost = estimate_hourly_cost(row)
103
+ result_df.at[idx, "hourly_cost"] = hourly_cost
104
+
105
+ if hourly_cost and "metrics.result" in row and row["metrics.result"]:
106
+ tokens_per_hour = row["metrics.result"] * 3600
107
+ if tokens_per_hour > 0:
108
+ cost_per_million = (hourly_cost / tokens_per_hour) * 1000000
109
+ result_df.at[idx, "cost_per_million_tokens"] = cost_per_million
110
+
111
+ return result_df
112
+
113
+
114
+ def estimate_hourly_cost(row: pd.Series) -> float:
115
+ """Estimate hourly cost for a single configuration."""
116
+ try:
117
+ acc_name = row.get("system.accelerator.name")
118
+ acc_vendor = row.get("system.accelerator.vendor")
119
+ acc_count = row.get("system.accelerator.total_count")
120
+
121
+ if not acc_count:
122
+ return None
123
+
124
+ base_cost = DEFAULT_HOURLY_COST
125
+
126
+ if acc_name:
127
+ normalized_name = normalize_gpu_name(acc_name)
128
+ if normalized_name in device_costs:
129
+ base_cost = device_costs[normalized_name]
130
+ elif acc_vendor and acc_vendor in device_costs:
131
+ base_cost = device_costs[acc_vendor]
132
+
133
+ return base_cost * acc_count
134
+
135
+ except Exception as e:
136
+ logger.warning(f"Error calculating cost: {e}")
137
+ return None
data.json ADDED
The diff for this file is too large to render. See raw diff
 
predictor.py ADDED
@@ -0,0 +1,900 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Simplified performance predictor for MLPerf configurations using XGBoost."""
2
+
3
+ import logging
4
+ import random
5
+ from collections import Counter, defaultdict
6
+
7
+ import numpy as np
8
+ import pandas as pd
9
+ import xgboost as xgb
10
+ from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
11
+ from sklearn.model_selection import train_test_split
12
+ from utils import FEATURE_TYPES
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+
17
+ class PerformancePredictor:
18
+ """Predicts performance for hardware configurations."""
19
+
20
+ def __init__(self, dataset: pd.DataFrame, test_size: float = 0.2):
21
+ """Initialize with benchmark dataset."""
22
+ self.df = dataset
23
+ self.model = None
24
+ self.target = "metrics.result_per_accelerator"
25
+ self.features = []
26
+ self.test_size = test_size
27
+
28
+ self.evaluation_data = pd.DataFrame()
29
+ self.evaluation_metrics = {}
30
+ self.feature_importance = pd.DataFrame(columns=["Feature", "Importance"])
31
+
32
+ self.excluded_features = {
33
+ "model.name",
34
+ "model.mlperf_name",
35
+ "software.framework",
36
+ "system.name",
37
+ }
38
+
39
+ self.excluded_features.update(
40
+ {
41
+ col
42
+ for col in dataset.columns
43
+ if col.startswith("submission.") or col.startswith("metrics.")
44
+ }
45
+ )
46
+
47
+ self.distributions = {}
48
+
49
+ self.max_accelerators = int(dataset["system.accelerator.total_count"].max())
50
+ self.max_gpu_memory = float(dataset["system.accelerator.memory_capacity"].max())
51
+ self.max_cpu_memory = float(dataset["system.memory.capacity"].max())
52
+
53
+ self.frameworks = sorted(
54
+ list(
55
+ set(
56
+ col.replace("software.framework.", "")
57
+ for col in dataset.columns
58
+ if col.startswith("software.framework.")
59
+ and col != "software.framework"
60
+ )
61
+ )
62
+ )
63
+ logger.info(
64
+ f"Found {len(self.frameworks)} unique frameworks: {', '.join(self.frameworks)}"
65
+ )
66
+
67
+ self._identify_features()
68
+ self._analyze_data_distributions()
69
+ self._train_model()
70
+
71
+ def _identify_features(self):
72
+ """Identify features for model training."""
73
+ all_columns = set(self.df.columns)
74
+ available_features = all_columns - self.excluded_features - {self.target}
75
+ self.features = [f for f in available_features if not self.df[f].isna().all()]
76
+ logger.info(f"Identified {len(self.features)} features for model training")
77
+
78
+ def _analyze_data_distributions(self):
79
+ """Analyze feature distributions for realistic data generation."""
80
+ categorical_features = {
81
+ col
82
+ for col in self.df.columns
83
+ if self.df[col].dtype == "object"
84
+ or col in FEATURE_TYPES.get("categorical", [])
85
+ }
86
+
87
+ for feature in categorical_features:
88
+ values = self.df[feature].dropna().tolist()
89
+ if values:
90
+ counter = Counter(values)
91
+ total = sum(counter.values())
92
+ self.distributions[feature] = {
93
+ value: count / total for value, count in counter.items()
94
+ }
95
+
96
+ continuous_features = {
97
+ col
98
+ for col in self.df.columns
99
+ if col in FEATURE_TYPES.get("continuous", [])
100
+ or pd.api.types.is_numeric_dtype(self.df[col].dtype)
101
+ if col not in categorical_features and not col.startswith("metrics.")
102
+ }
103
+
104
+ for feature in continuous_features:
105
+ values = self.df[feature].dropna()
106
+ if len(values) > 0:
107
+ self.distributions[feature] = {
108
+ "min": float(values.min()),
109
+ "max": float(values.max()),
110
+ "mean": float(values.mean()),
111
+ "std": float(values.std()),
112
+ "median": float(values.median()),
113
+ "values": sorted(values.unique().tolist()),
114
+ }
115
+
116
+ self._analyze_feature_relationships()
117
+ logger.info(f"Analyzed distributions for {len(self.distributions)} features")
118
+
119
+ def _analyze_feature_relationships(self):
120
+ """Analyze relationships between related features."""
121
+ self._analyze_vendor_accelerator_relations()
122
+ self._analyze_vendor_cpu_relations()
123
+ self._analyze_accelerator_memory_relations()
124
+ self._analyze_interconnect_relations()
125
+ self._analyze_software_relations()
126
+ self._analyze_device_counts()
127
+
128
+ def _analyze_vendor_accelerator_relations(self):
129
+ """Map vendors to their accelerators."""
130
+ vendor_accelerators = defaultdict(list)
131
+ for _, row in self.df.iterrows():
132
+ vendor = row.get("system.accelerator.vendor")
133
+ acc = row.get("system.accelerator.name")
134
+ if vendor and acc:
135
+ vendor_accelerators[vendor].append(acc)
136
+
137
+ self.distributions["vendor_accelerators"] = {}
138
+ for vendor, accelerators in vendor_accelerators.items():
139
+ counter = Counter(accelerators)
140
+ total = sum(counter.values())
141
+ self.distributions["vendor_accelerators"][vendor] = {
142
+ acc: count / total for acc, count in counter.items()
143
+ }
144
+
145
+ def _analyze_vendor_cpu_relations(self):
146
+ """Map CPU vendors to their models."""
147
+ vendor_cpus = defaultdict(list)
148
+ for _, row in self.df.iterrows():
149
+ vendor = row.get("system.cpu.vendor")
150
+ model = row.get("system.cpu.model")
151
+ if vendor and model:
152
+ vendor_cpus[vendor].append(model)
153
+
154
+ self.distributions["vendor_cpus"] = {}
155
+ for vendor, models in vendor_cpus.items():
156
+ counter = Counter(models)
157
+ total = sum(counter.values())
158
+ self.distributions["vendor_cpus"][vendor] = {
159
+ model: count / total for model, count in counter.items()
160
+ }
161
+
162
+ def _analyze_accelerator_memory_relations(self):
163
+ """Map accelerator models to memory capacities."""
164
+ acc_memory = defaultdict(list)
165
+ for _, row in self.df.iterrows():
166
+ acc = row.get("system.accelerator.name")
167
+ memory = row.get("system.accelerator.memory_capacity")
168
+ if acc and memory:
169
+ acc_memory[acc].append(memory)
170
+
171
+ self.distributions["accelerator_memory"] = {}
172
+ for acc, memories in acc_memory.items():
173
+ if memories:
174
+ counter = Counter(memories)
175
+ most_common = counter.most_common(1)[0][0] if counter else None
176
+ self.distributions["accelerator_memory"][acc] = {
177
+ "min": min(memories),
178
+ "max": max(memories),
179
+ "mean": sum(memories) / len(memories),
180
+ "most_common": most_common,
181
+ "values": sorted(set(memories)),
182
+ }
183
+
184
+ def _analyze_interconnect_relations(self):
185
+ """Map vendors to interconnect types."""
186
+ vendor_interconnects = defaultdict(list)
187
+ for _, row in self.df.iterrows():
188
+ vendor = row.get("system.accelerator.vendor")
189
+ interconnect = row.get("system.interconnect.accelerator")
190
+ if vendor and interconnect:
191
+ vendor_interconnects[vendor].append(interconnect)
192
+
193
+ self.distributions["vendor_interconnects"] = {}
194
+ for vendor, interconnects in vendor_interconnects.items():
195
+ counter = Counter(interconnects)
196
+ total = sum(counter.values())
197
+ self.distributions["vendor_interconnects"][vendor] = {
198
+ ic: count / total for ic, count in counter.items()
199
+ }
200
+
201
+ def _analyze_software_relations(self):
202
+ """Map vendors to software stacks."""
203
+ vendor_software = defaultdict(lambda: defaultdict(list))
204
+ for _, row in self.df.iterrows():
205
+ vendor = row.get("system.accelerator.vendor")
206
+ if not vendor:
207
+ continue
208
+
209
+ os = row.get("software.operating_system")
210
+ if os:
211
+ vendor_software[vendor]["os"].append(os)
212
+
213
+ for col in self.df.columns:
214
+ if (
215
+ col.startswith("software.framework.")
216
+ and col != "software.framework"
217
+ ):
218
+ framework = col.replace("software.framework.", "")
219
+ version = row.get(col)
220
+ if version:
221
+ vendor_software[vendor][framework].append(version)
222
+
223
+ self.distributions["vendor_software"] = {}
224
+ for vendor, software_dict in vendor_software.items():
225
+ self.distributions["vendor_software"][vendor] = {}
226
+ for software_type, values in software_dict.items():
227
+ counter = Counter(values)
228
+ total = sum(counter.values())
229
+ self.distributions["vendor_software"][vendor][software_type] = {
230
+ value: count / total for value, count in counter.items()
231
+ }
232
+
233
+ def _analyze_device_counts(self):
234
+ """Analyze distribution of device counts."""
235
+ counts = self.df["system.accelerator.total_count"].dropna().astype(int).tolist()
236
+ if counts:
237
+ counter = Counter(counts)
238
+ total = sum(counter.values())
239
+ self.distributions["device_count"] = {
240
+ count: freq / total for count, freq in counter.items()
241
+ }
242
+ self.distributions["device_count_values"] = sorted(list(set(counts)))
243
+
244
+ node_counts = self.df["system.number_of_nodes"].dropna().astype(int).tolist()
245
+ if node_counts:
246
+ counter = Counter(node_counts)
247
+ total = sum(counter.values())
248
+ self.distributions["node_count"] = {
249
+ count: freq / total for count, freq in counter.items()
250
+ }
251
+ self.distributions["node_count_values"] = sorted(list(set(node_counts)))
252
+
253
+ device_node_pairs = [
254
+ (
255
+ int(row["system.number_of_nodes"]),
256
+ int(row["system.accelerator.total_count"]),
257
+ )
258
+ for _, row in self.df.iterrows()
259
+ if row.get("system.number_of_nodes")
260
+ and row.get("system.accelerator.total_count")
261
+ ]
262
+
263
+ node_to_devices = defaultdict(list)
264
+ for nodes, devices in device_node_pairs:
265
+ node_to_devices[nodes].append(devices)
266
+
267
+ self.distributions["node_device_relation"] = {}
268
+ for node_count, device_counts in node_to_devices.items():
269
+ counter = Counter(device_counts)
270
+ total = sum(counter.values())
271
+ self.distributions["node_device_relation"][node_count] = {
272
+ count: freq / total for count, freq in counter.items()
273
+ }
274
+
275
+ def _train_model(self):
276
+ """Train XGBoost model on available data with train/test split."""
277
+ df_clean = self.df.dropna(subset=[self.target])
278
+
279
+ X = df_clean[self.features]
280
+ y = df_clean[self.target]
281
+
282
+ for col in X.select_dtypes(include=["object"]).columns:
283
+ with pd.option_context("mode.chained_assignment", None):
284
+ X[col] = X[col].astype("category")
285
+
286
+ try:
287
+ strat_column = df_clean["system.accelerator.name"].fillna("unknown")
288
+ X_train, X_test, y_train, y_test = train_test_split(
289
+ X, y, test_size=self.test_size, stratify=strat_column, random_state=42
290
+ )
291
+ logger.info(
292
+ f"Created stratified train/test split ({100 - self.test_size * 100:.0f}%/{self.test_size * 100:.0f}%) with {len(X_train)} training and {len(X_test)} test samples"
293
+ )
294
+ except ValueError:
295
+ X_train, X_test, y_train, y_test = train_test_split(
296
+ X, y, test_size=self.test_size, random_state=42
297
+ )
298
+ logger.info(
299
+ f"Created regular train/test split with {len(X_train)} training and {len(X_test)} test samples"
300
+ )
301
+
302
+ self.model = xgb.XGBRegressor(
303
+ objective="reg:squarederror",
304
+ n_estimators=100,
305
+ max_depth=6,
306
+ learning_rate=0.1,
307
+ subsample=0.8,
308
+ enable_categorical=True,
309
+ )
310
+
311
+ self.model.fit(X_train, y_train)
312
+ logger.info(f"Trained XGBoost model on {len(X_train)} rows")
313
+
314
+ self._evaluate_model(X_test, y_test, df_clean.loc[X_test.index])
315
+
316
+ def _evaluate_model(self, X_test, y_test, test_df):
317
+ """Evaluate model performance on test set."""
318
+ if X_test.empty:
319
+ logger.warning("No test data available for evaluation")
320
+ return
321
+
322
+ y_pred = self.model.predict(X_test)
323
+
324
+ rmse = np.sqrt(mean_squared_error(y_test, y_pred))
325
+ mae = mean_absolute_error(y_test, y_pred)
326
+ r2 = r2_score(y_test, y_pred)
327
+
328
+ mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
329
+
330
+ self.evaluation_metrics = {
331
+ "rmse": rmse,
332
+ "mae": mae,
333
+ "r2": r2,
334
+ "mape": mape,
335
+ "test_size": len(y_test),
336
+ "training_size": len(self.df) - len(y_test),
337
+ }
338
+
339
+ eval_data = test_df[
340
+ [
341
+ "system.accelerator.name",
342
+ "system.accelerator.vendor",
343
+ "system.accelerator.total_count",
344
+ ]
345
+ ].copy()
346
+ eval_data["actual"] = y_test
347
+ eval_data["predicted"] = y_pred
348
+ eval_data["error"] = y_pred - y_test
349
+ eval_data["error_percent"] = (y_pred - y_test) / y_test * 100
350
+
351
+ self.evaluation_data = eval_data.copy()
352
+
353
+ logger.info(
354
+ f"Model evaluation - RMSE: {rmse:.2f}, MAE: {mae:.2f}, R²: {r2:.3f}, MAPE: {mape:.2f}%"
355
+ )
356
+ logger.info(
357
+ f"Evaluation data shape: {eval_data.shape}, with columns: {list(eval_data.columns)}"
358
+ )
359
+ logger.info(f"Evaluation data sample: {eval_data.head(2).to_dict()}")
360
+ logger.info(
361
+ f"Evaluation data stored as class attribute with shape: {self.evaluation_data.shape}"
362
+ )
363
+
364
+ importance = self.model.feature_importances_
365
+ feature_importance = pd.DataFrame(
366
+ {"Feature": self.model.feature_names_in_, "Importance": importance}
367
+ ).sort_values("Importance", ascending=False)
368
+ self.feature_importance = feature_importance.head(10).copy()
369
+
370
+ logger.info(
371
+ f"Top 5 important features: {', '.join(self.feature_importance['Feature'].head(5).tolist())}"
372
+ )
373
+
374
+ def get_evaluation_metrics(self) -> dict:
375
+ """Return model evaluation metrics."""
376
+ logger.info(f"Getting evaluation metrics: {self.evaluation_metrics}")
377
+ return self.evaluation_metrics.copy() if self.evaluation_metrics else {}
378
+
379
+ def get_evaluation_data(self) -> pd.DataFrame:
380
+ """Return evaluation data for visualization."""
381
+ data_shape = (
382
+ "empty" if self.evaluation_data.empty else self.evaluation_data.shape
383
+ )
384
+ logger.info(f"Getting evaluation data with shape: {data_shape}")
385
+ return self.evaluation_data.copy() if not self.evaluation_data.empty else None
386
+
387
+ def get_feature_importance(self) -> pd.DataFrame:
388
+ """Return feature importance data."""
389
+ logger.info(
390
+ f"Getting feature importance with shape: {self.feature_importance.shape}"
391
+ )
392
+ return (
393
+ self.feature_importance.copy()
394
+ if not self.feature_importance.empty
395
+ else pd.DataFrame(columns=["Feature", "Importance"])
396
+ )
397
+
398
+ def generate_predictions(
399
+ self,
400
+ architecture: str,
401
+ parameters: float,
402
+ constraints: dict = None,
403
+ num_configs: int = 10,
404
+ ) -> pd.DataFrame:
405
+ """Generate and predict performance for hardware configurations."""
406
+ constraints = constraints or {}
407
+ logger.info(
408
+ f"Generating {num_configs} predictions for {architecture} model with {parameters}B parameters"
409
+ )
410
+
411
+ configs = self._generate_configs(
412
+ architecture, parameters, constraints, num_configs
413
+ )
414
+ if not configs:
415
+ return pd.DataFrame()
416
+
417
+ configs_df = pd.DataFrame(configs)
418
+ model_features = getattr(self.model, "feature_names_in_", self.features)
419
+
420
+ for feature in model_features:
421
+ if feature not in configs_df.columns:
422
+ configs_df[feature] = None
423
+
424
+ X_pred = configs_df[model_features]
425
+ for col in X_pred.select_dtypes(include=["object"]).columns:
426
+ with pd.option_context("mode.chained_assignment", None):
427
+ X_pred[col] = X_pred[col].astype("category")
428
+
429
+ configs_df[self.target] = self.model.predict(X_pred)
430
+ configs_df["predicted"] = True
431
+ configs_df["metrics.result"] = (
432
+ configs_df[self.target] * configs_df["system.accelerator.total_count"]
433
+ )
434
+ configs_df["system.name"] = "Hypothetical system - ongoing work"
435
+
436
+ logger.info(
437
+ f"Performance range: {configs_df[self.target].min():.2f} - {configs_df[self.target].max():.2f} tokens/s per accelerator"
438
+ )
439
+ return configs_df
440
+
441
+ def _sample_from_distribution(self, distribution: dict) -> any:
442
+ """Sample a value from a categorical distribution."""
443
+ items = list(distribution.keys())
444
+ probabilities = list(distribution.values())
445
+ return np.random.choice(items, p=probabilities)
446
+
447
+ def _sample_continuous_value(self, feature: str) -> float:
448
+ """Sample a continuous value from the feature distribution."""
449
+ dist = self.distributions[feature]
450
+
451
+ if "values" in dist and dist["values"]:
452
+ if len(dist["values"]) > 3:
453
+ value = np.random.normal(dist["mean"], max(dist["std"], 1.0))
454
+ value = max(dist["min"], min(dist["max"], value))
455
+ closest_idx = min(
456
+ range(len(dist["values"])),
457
+ key=lambda i: abs(dist["values"][i] - value),
458
+ )
459
+ return dist["values"][closest_idx]
460
+ else:
461
+ return random.choice(dist["values"])
462
+
463
+ elif all(k in dist for k in ["min", "max", "mean", "std"]):
464
+ value = np.random.normal(dist["mean"], max(dist["std"], 1.0))
465
+ return max(dist["min"], min(dist["max"], value))
466
+
467
+ return np.random.uniform(dist["min"], dist["max"])
468
+
469
+ def _get_device_count(self, min_devices=None, max_devices=None) -> int:
470
+ """Get a realistic device count based on distribution and constraints."""
471
+ valid_counts = [
472
+ count
473
+ for count in self.distributions["device_count_values"]
474
+ if (min_devices is None or count >= min_devices)
475
+ and (max_devices is None or count <= max_devices)
476
+ ]
477
+
478
+ if valid_counts:
479
+ probs = {
480
+ count: self.distributions["device_count"][count]
481
+ for count in valid_counts
482
+ if count in self.distributions["device_count"]
483
+ }
484
+
485
+ if probs:
486
+ total = sum(probs.values())
487
+ items = list(probs.keys())
488
+ weights = [probs[item] / total for item in items]
489
+ return np.random.choice(items, p=weights)
490
+
491
+ return random.choice(valid_counts)
492
+
493
+ if min_devices is not None and max_devices is not None:
494
+ valid_powers = [
495
+ 2**i for i in range(10) if min_devices <= 2**i <= max_devices
496
+ ]
497
+ if valid_powers:
498
+ return random.choice(valid_powers)
499
+ return random.randint(min_devices, max_devices)
500
+
501
+ return random.choice([1, 2, 4, 8, 16])
502
+
503
+ def _get_vendor_accelerator(self, vendor=None) -> tuple:
504
+ """Get a vendor and accelerator pair."""
505
+ if vendor is None or vendor == "Any":
506
+ vendor = self._sample_from_distribution(
507
+ self.distributions["system.accelerator.vendor"]
508
+ )
509
+
510
+ if vendor in self.distributions["vendor_accelerators"]:
511
+ accelerator = self._sample_from_distribution(
512
+ self.distributions["vendor_accelerators"][vendor]
513
+ )
514
+ else:
515
+ accelerator = self._sample_from_distribution(
516
+ self.distributions["system.accelerator.name"]
517
+ )
518
+
519
+ return vendor, accelerator
520
+
521
+ def _get_memory_for_accelerator(
522
+ self, vendor: str, accelerator: str, min_memory=None, max_memory=None
523
+ ) -> float:
524
+ """Get appropriate memory capacity for a given accelerator."""
525
+ if accelerator in self.distributions["accelerator_memory"]:
526
+ mem_dist = self.distributions["accelerator_memory"][accelerator]
527
+
528
+ if "values" in mem_dist:
529
+ valid_values = [
530
+ m
531
+ for m in mem_dist["values"]
532
+ if (min_memory is None or m >= min_memory)
533
+ and (max_memory is None or m <= max_memory)
534
+ ]
535
+ if valid_values:
536
+ return random.choice(valid_values)
537
+
538
+ if "most_common" in mem_dist:
539
+ most_common = mem_dist["most_common"]
540
+ if (min_memory is None or most_common >= min_memory) and (
541
+ max_memory is None or most_common <= max_memory
542
+ ):
543
+ return most_common
544
+
545
+ dist = self.distributions["system.accelerator.memory_capacity"]
546
+ valid_values = [
547
+ m
548
+ for m in dist["values"]
549
+ if (min_memory is None or m >= min_memory)
550
+ and (max_memory is None or m <= max_memory)
551
+ ]
552
+
553
+ if valid_values:
554
+ return random.choice(valid_values)
555
+
556
+ min_val = max(dist["min"], min_memory or dist["min"])
557
+ max_val = min(dist["max"], max_memory or dist["max"])
558
+
559
+ if min_val <= max_val:
560
+ mean = min(max(dist["mean"], min_val), max_val)
561
+ std = max(dist["std"], 1.0)
562
+
563
+ for _ in range(5):
564
+ value = np.random.normal(mean, std)
565
+ if min_val <= value <= max_val:
566
+ return value
567
+
568
+ return np.random.uniform(min_val, max_val)
569
+
570
+ return None
571
+
572
+ def _get_node_config(self, total_devices: int) -> tuple:
573
+ """Determine number of nodes and devices per node."""
574
+ VALID_GPUS_PER_NODE = [1, 2, 4, 8]
575
+
576
+ for gpus_per_node in sorted(VALID_GPUS_PER_NODE, reverse=True):
577
+ if total_devices % gpus_per_node == 0:
578
+ return total_devices // gpus_per_node, gpus_per_node
579
+
580
+ for gpus_per_node in sorted(VALID_GPUS_PER_NODE, reverse=True):
581
+ if gpus_per_node <= total_devices:
582
+ nodes = total_devices // gpus_per_node
583
+ return nodes, gpus_per_node
584
+
585
+ return 1, 1
586
+
587
+ def _get_cpu_config(self) -> dict:
588
+ """Generate a CPU configuration."""
589
+ cpu_config = {}
590
+ cpu_config["system.cpu.vendor"] = self._sample_from_distribution(
591
+ self.distributions["system.cpu.vendor"]
592
+ )
593
+
594
+ cpu_vendor = cpu_config["system.cpu.vendor"]
595
+ if cpu_vendor in self.distributions["vendor_cpus"]:
596
+ cpu_config["system.cpu.model"] = self._sample_from_distribution(
597
+ self.distributions["vendor_cpus"][cpu_vendor]
598
+ )
599
+ else:
600
+ cpu_config["system.cpu.model"] = self._sample_from_distribution(
601
+ self.distributions["system.cpu.model"]
602
+ )
603
+
604
+ for feature in [
605
+ "system.cpu.core_count",
606
+ "system.cpu.count_per_node",
607
+ "system.cpu.frequency",
608
+ ]:
609
+ value = self._sample_continuous_value(feature)
610
+ if value is not None:
611
+ if feature in ["system.cpu.core_count", "system.cpu.count_per_node"]:
612
+ value = int(value)
613
+ cpu_config[feature] = value
614
+
615
+ if "system.cpu.caches" in self.distributions:
616
+ cpu_config["system.cpu.caches"] = self._sample_from_distribution(
617
+ self.distributions["system.cpu.caches"]
618
+ )
619
+
620
+ return cpu_config
621
+
622
+ def _get_software_config(self, vendor: str, constraints=None) -> dict:
623
+ """Generate a software configuration based on hardware vendor."""
624
+ constraints = constraints or {}
625
+ software_config = {}
626
+
627
+ if vendor in self.distributions["vendor_software"]:
628
+ vendor_sw = self.distributions["vendor_software"][vendor]
629
+
630
+ if "os" in vendor_sw:
631
+ os_constraint = constraints.get("software.operating_system")
632
+ if os_constraint and os_constraint != "Any":
633
+ software_config["software.operating_system"] = os_constraint
634
+ else:
635
+ software_config["software.operating_system"] = (
636
+ self._sample_from_distribution(vendor_sw["os"])
637
+ )
638
+
639
+ for framework, versions in vendor_sw.items():
640
+ if framework != "os":
641
+ framework_key = f"software.framework.{framework}"
642
+ version_constraint = constraints.get(framework_key)
643
+ if version_constraint and version_constraint != "Any":
644
+ software_config[framework_key] = version_constraint
645
+ else:
646
+ software_config[framework_key] = self._sample_from_distribution(
647
+ versions
648
+ )
649
+
650
+ if (
651
+ "software.operating_system" not in software_config
652
+ and "software.operating_system" in self.distributions
653
+ ):
654
+ os_constraint = constraints.get("software.operating_system")
655
+ if os_constraint and os_constraint != "Any":
656
+ software_config["software.operating_system"] = os_constraint
657
+ else:
658
+ software_config["software.operating_system"] = (
659
+ self._sample_from_distribution(
660
+ self.distributions["software.operating_system"]
661
+ )
662
+ )
663
+
664
+ for framework in self.frameworks:
665
+ framework_key = f"software.framework.{framework}"
666
+ if (
667
+ framework_key not in software_config
668
+ and framework_key in self.distributions
669
+ ):
670
+ version_constraint = constraints.get(framework_key)
671
+ if version_constraint and version_constraint != "Any":
672
+ software_config[framework_key] = version_constraint
673
+ else:
674
+ software_config[framework_key] = self._sample_from_distribution(
675
+ self.distributions[framework_key]
676
+ )
677
+
678
+ return software_config
679
+
680
+ def _get_memory_config(self, min_memory=None, max_memory=None) -> dict:
681
+ """Generate a memory configuration."""
682
+ memory_config = {}
683
+ dist = self.distributions["system.memory.capacity"]
684
+
685
+ if "values" in dist:
686
+ valid_values = [
687
+ m
688
+ for m in dist["values"]
689
+ if (min_memory is None or m >= min_memory)
690
+ and (max_memory is None or m <= max_memory)
691
+ ]
692
+ if valid_values:
693
+ memory_config["system.memory.capacity"] = random.choice(valid_values)
694
+
695
+ if "system.memory.capacity" not in memory_config:
696
+ min_val = max(dist["min"], min_memory or dist["min"])
697
+ max_val = min(dist["max"], max_memory or dist["max"])
698
+
699
+ if min_val <= max_val:
700
+ mean = min(max(dist["mean"], min_val), max_val)
701
+ std = max(dist["std"], (max_val - min_val) / 6.0)
702
+
703
+ value = np.random.normal(mean, std)
704
+ if min_val <= value <= max_val:
705
+ memory_config["system.memory.capacity"] = value
706
+ else:
707
+ memory_config["system.memory.capacity"] = np.random.uniform(
708
+ min_val, max_val
709
+ )
710
+
711
+ if "system.memory.configuration" in self.distributions:
712
+ memory_config["system.memory.configuration"] = (
713
+ self._sample_from_distribution(
714
+ self.distributions["system.memory.configuration"]
715
+ )
716
+ )
717
+
718
+ return memory_config
719
+
720
+ def _get_interconnect_config(self, vendor: str) -> dict:
721
+ """Generate interconnect configuration based on vendor."""
722
+ interconnect_config = {}
723
+
724
+ if vendor in self.distributions["vendor_interconnects"]:
725
+ interconnect_config["system.interconnect.accelerator"] = (
726
+ self._sample_from_distribution(
727
+ self.distributions["vendor_interconnects"][vendor]
728
+ )
729
+ )
730
+ elif "system.interconnect.accelerator" in self.distributions:
731
+ interconnect_config["system.interconnect.accelerator"] = (
732
+ self._sample_from_distribution(
733
+ self.distributions["system.interconnect.accelerator"]
734
+ )
735
+ )
736
+
737
+ if "system.interconnect.accelerator_host" in self.distributions:
738
+ interconnect_config["system.interconnect.accelerator_host"] = (
739
+ self._sample_from_distribution(
740
+ self.distributions["system.interconnect.accelerator_host"]
741
+ )
742
+ )
743
+
744
+ return interconnect_config
745
+
746
+ def _generate_configs(
747
+ self, architecture: str, parameters: float, constraints=None, count: int = 10
748
+ ) -> list:
749
+ """Generate configurations respecting user constraints."""
750
+ constraints = constraints or {}
751
+ configs = []
752
+
753
+ vendor = constraints.get("system.accelerator.vendor")
754
+ acc_name = constraints.get("system.accelerator.name")
755
+
756
+ def apply_margin(value, is_min=True):
757
+ if value is None or not isinstance(value, (int, float)) or value == "Any":
758
+ return None
759
+ return value * (0.9 if is_min else 1.1)
760
+
761
+ min_gpu_memory = apply_margin(constraints.get("min_gpu_memory"), is_min=True)
762
+ max_gpu_memory = apply_margin(
763
+ constraints.get("max_gpu_memory"), is_min=False
764
+ ) or (self.max_gpu_memory * 1.1)
765
+
766
+ min_cpu_memory = apply_margin(constraints.get("min_cpu_memory"), is_min=True)
767
+ max_cpu_memory = apply_margin(
768
+ constraints.get("max_cpu_memory"), is_min=False
769
+ ) or (self.max_cpu_memory * 1.1)
770
+
771
+ min_devices = apply_margin(constraints.get("min_accelerators"), is_min=True)
772
+ max_devices = (
773
+ apply_margin(constraints.get("max_accelerators"), is_min=False)
774
+ or self.max_accelerators
775
+ )
776
+
777
+ interconnect = constraints.get("system.interconnect.accelerator")
778
+ nodes = constraints.get("system.number_of_nodes")
779
+
780
+ VALID_GPUS_PER_NODE = [1, 2, 4, 8]
781
+
782
+ for _ in range(count * 3):
783
+ if len(configs) >= count:
784
+ break
785
+
786
+ device_count = self._get_device_count(min_devices, max_devices)
787
+ acc_vendor, acc_model = self._get_vendor_accelerator(vendor)
788
+
789
+ if acc_name and acc_name != "Any":
790
+ acc_model = acc_name
791
+
792
+ if nodes and nodes != "Any":
793
+ node_count = int(nodes)
794
+ valid_device_counts = []
795
+ for gpus in VALID_GPUS_PER_NODE:
796
+ if node_count * gpus >= (
797
+ min_devices or 1
798
+ ) and node_count * gpus <= (max_devices or float("inf")):
799
+ valid_device_counts.append(gpus)
800
+
801
+ if not valid_device_counts:
802
+ continue
803
+
804
+ devices_per_node = random.choice(valid_device_counts)
805
+ device_count = node_count * devices_per_node
806
+ else:
807
+ valid_count = False
808
+ for gpus_per_node in sorted(VALID_GPUS_PER_NODE, reverse=True):
809
+ if device_count % gpus_per_node == 0:
810
+ valid_count = True
811
+ break
812
+
813
+ if not valid_count:
814
+ node_count, devices_per_node = self._get_node_config(device_count)
815
+ device_count = node_count * devices_per_node
816
+ else:
817
+ node_count, devices_per_node = (
818
+ device_count // gpus_per_node,
819
+ gpus_per_node,
820
+ )
821
+
822
+ config = {
823
+ "model.architecture": architecture,
824
+ "model.number_of_parameters": parameters,
825
+ "system.accelerator.vendor": acc_vendor,
826
+ "system.accelerator.name": acc_model,
827
+ "system.accelerator.total_count": device_count,
828
+ "system.number_of_nodes": node_count,
829
+ "system.accelerator.count_per_node": devices_per_node,
830
+ }
831
+
832
+ gpu_memory = self._get_memory_for_accelerator(
833
+ acc_vendor,
834
+ acc_model,
835
+ min_memory=min_gpu_memory,
836
+ max_memory=max_gpu_memory,
837
+ )
838
+
839
+ if gpu_memory is None:
840
+ continue
841
+
842
+ config["system.accelerator.memory_capacity"] = gpu_memory
843
+
844
+ if "system.accelerator.memory_config" in self.distributions:
845
+ config["system.accelerator.memory_config"] = (
846
+ self._sample_from_distribution(
847
+ self.distributions["system.accelerator.memory_config"]
848
+ )
849
+ )
850
+
851
+ interconnect_config = self._get_interconnect_config(acc_vendor)
852
+
853
+ if interconnect and interconnect != "Any":
854
+ interconnect_config["system.interconnect.accelerator"] = interconnect
855
+
856
+ config.update(interconnect_config)
857
+ config.update(self._get_cpu_config())
858
+
859
+ memory_config = self._get_memory_config(
860
+ min_memory=min_cpu_memory, max_memory=max_cpu_memory
861
+ )
862
+ if "system.memory.capacity" not in memory_config:
863
+ continue
864
+
865
+ config.update(memory_config)
866
+
867
+ for feature_name in [
868
+ "system.type",
869
+ "system.cooling",
870
+ "model.weight_data_types",
871
+ ]:
872
+ if feature_name in self.distributions:
873
+ config[feature_name] = self._sample_from_distribution(
874
+ self.distributions[feature_name]
875
+ )
876
+
877
+ config.update(self._get_software_config(acc_vendor, constraints))
878
+
879
+ for key, value in constraints.items():
880
+ if (
881
+ not key.startswith("software.framework.")
882
+ and key != "software.operating_system"
883
+ and key
884
+ not in [
885
+ "min_gpu_memory",
886
+ "max_gpu_memory",
887
+ "min_cpu_memory",
888
+ "max_cpu_memory",
889
+ "min_accelerators",
890
+ "max_accelerators",
891
+ ]
892
+ and key not in config
893
+ and value != "Any"
894
+ and value is not None
895
+ ):
896
+ config[key] = value
897
+
898
+ configs.append(config)
899
+
900
+ return configs[:count]
recommender.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration recommendation module for MLPerf benchmarks."""
2
+
3
+ import logging
4
+
5
+ import pandas as pd
6
+ from utils import get_feature_type
7
+
8
+ logger = logging.getLogger(__name__)
9
+
10
+
11
+ class ConfigurationFinder:
12
+ """Finds optimal hardware configurations based on user requirements."""
13
+
14
+ def __init__(self, dataset: pd.DataFrame):
15
+ """Initialize with benchmark dataset."""
16
+ self.df = dataset
17
+ self.perf_metric = "metrics.result_per_accelerator"
18
+ self.cost_metric = "cost_per_million_tokens"
19
+ self.total_perf_metric = "metrics.result"
20
+
21
+ def is_within_tolerance(
22
+ self, value1: float, value2: float, tolerance: float = 0.1
23
+ ) -> bool:
24
+ """Check if two values are within a specified percentage tolerance."""
25
+ if value1 is None or value2 is None:
26
+ return False
27
+
28
+ try:
29
+ if value1 == 0 or value2 == 0:
30
+ return value1 == value2
31
+ percentage_diff = abs(value1 - value2) / max(abs(value1), abs(value2))
32
+ return percentage_diff <= tolerance
33
+ except:
34
+ return False
35
+
36
+ def find_configurations(
37
+ self, constraints: dict, tolerance: float = 0.1
38
+ ) -> pd.DataFrame:
39
+ """Find configurations matching the given constraints."""
40
+ if self.df.empty:
41
+ return pd.DataFrame()
42
+
43
+ filtered_df = self.df.copy()
44
+
45
+ for feature, value in constraints.items():
46
+ if feature not in filtered_df.columns or value is None or value == "Any":
47
+ continue
48
+
49
+ if get_feature_type(feature) == "continuous":
50
+ try:
51
+ target_value = float(value)
52
+ lower_bound = target_value * (1 - tolerance)
53
+ upper_bound = target_value * (1 + tolerance)
54
+ filtered_df = filtered_df[
55
+ (filtered_df[feature] >= lower_bound)
56
+ & (filtered_df[feature] <= upper_bound)
57
+ ]
58
+ except:
59
+ filtered_df = filtered_df[filtered_df[feature] == value]
60
+ else:
61
+ filtered_df = filtered_df[filtered_df[feature] == value]
62
+
63
+ if "min_accelerators" in constraints and constraints["min_accelerators"]:
64
+ min_acc = constraints["min_accelerators"]
65
+ filtered_df = filtered_df[
66
+ filtered_df["system.accelerator.total_count"] >= min_acc
67
+ ]
68
+
69
+ if "max_accelerators" in constraints and constraints["max_accelerators"]:
70
+ max_acc = constraints["max_accelerators"]
71
+ filtered_df = filtered_df[
72
+ filtered_df["system.accelerator.total_count"] <= max_acc
73
+ ]
74
+
75
+ return filtered_df
76
+
77
+ def rank_configurations(
78
+ self,
79
+ df: pd.DataFrame,
80
+ metric: str = "metrics.result_per_accelerator",
81
+ ascending: bool = False,
82
+ ) -> pd.DataFrame:
83
+ """Rank configurations by the specified metric."""
84
+ if df.empty or metric not in df.columns:
85
+ return df
86
+ return df.sort_values(by=metric, ascending=ascending)
87
+
88
+ def recommend(self, constraints: dict, top_n: int = 10) -> pd.DataFrame:
89
+ """Find and rank configurations based on constraints."""
90
+ filtered_configs = self.find_configurations(constraints)
91
+ ranked_configs = self.rank_configurations(
92
+ filtered_configs, metric=self.perf_metric, ascending=False
93
+ )
94
+
95
+ if len(ranked_configs) > top_n:
96
+ return ranked_configs.head(top_n)
97
+ return ranked_configs
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ datasets
2
+ gradio
3
+ nbformat
4
+ numpy
5
+ pandas
6
+ plotly
7
+ polars
8
+ pyarrow
9
+ scikit-learn
10
+ xgboost
11
+ cmind
utils.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import logging
3
+
4
+ import polars as pl
5
+
6
+ logger = logging.getLogger(__name__)
7
+
8
+
9
+ FEATURES = {
10
+ "Performance": {
11
+ "metrics.result": "continuous",
12
+ "metrics.result_per_accelerator": "continuous",
13
+ "metrics.accuracy": "continuous",
14
+ },
15
+ "Model": {
16
+ "model.name": "categorical",
17
+ "model.mlperf_name": "categorical",
18
+ "model.architecture": "categorical",
19
+ "model.number_of_parameters": "continuous",
20
+ "model.weight_data_types": "categorical",
21
+ },
22
+ "Accelerator": {
23
+ "system.accelerator.vendor": "categorical",
24
+ "system.accelerator.name": "categorical",
25
+ "system.accelerator.count_per_node": "continuous",
26
+ "system.accelerator.total_count": "continuous",
27
+ "system.accelerator.memory_capacity": "continuous",
28
+ "system.accelerator.memory_config": "text",
29
+ "system.interconnect.accelerator": "categorical",
30
+ },
31
+ "CPU": {
32
+ "system.cpu.vendor": "categorical",
33
+ "system.cpu.model": "categorical",
34
+ "system.cpu.core_count": "continuous",
35
+ "system.cpu.count_per_node": "continuous",
36
+ "system.cpu.frequency": "continuous",
37
+ "system.cpu.caches": "text",
38
+ "system.cpu.vcpu_count": "continuous",
39
+ },
40
+ "System": {
41
+ "system.name": "text",
42
+ "system.type": "categorical",
43
+ "system.cooling": "categorical",
44
+ "system.number_of_nodes": "continuous",
45
+ "system.memory.capacity": "continuous",
46
+ "system.memory.configuration": "text",
47
+ "system.interconnect.accelerator_host": "categorical",
48
+ },
49
+ "Software": {
50
+ "software.framework": "categorical",
51
+ "software.version": "categorical",
52
+ "software.operating_system": "categorical",
53
+ },
54
+ "Submission": {
55
+ "submission.organization": "categorical",
56
+ "submission.division": "categorical",
57
+ "submission.scenario": "categorical",
58
+ "submission.availability": "boolean",
59
+ },
60
+ }
61
+
62
+
63
+ def get_features_by_type(feature_type: str) -> list[str]:
64
+ """Get all features of a specific type."""
65
+ result = []
66
+ for group in FEATURES.values():
67
+ for feature, typ in group.items():
68
+ if typ == feature_type:
69
+ result.append(feature)
70
+ return result
71
+
72
+
73
+ FEATURE_TYPES = {
74
+ "continuous": get_features_by_type("continuous"),
75
+ "categorical": get_features_by_type("categorical"),
76
+ "boolean": get_features_by_type("boolean"),
77
+ "text": get_features_by_type("text"),
78
+ }
79
+
80
+ UI_FEATURE_GROUPS = {
81
+ group: list(features.keys()) for group, features in FEATURES.items()
82
+ }
83
+
84
+
85
+ def get_feature_type(feature_name: str) -> str:
86
+ """Get the type of a feature from the FEATURES dictionary."""
87
+ for group in FEATURES.values():
88
+ if feature_name in group:
89
+ return group[feature_name]
90
+ return "categorical"
91
+
92
+
93
+ def load_data(file_path: str = "data.json") -> pl.DataFrame:
94
+ """Load processed benchmark data."""
95
+ logger.info(f"Loading processed data from {file_path}")
96
+
97
+ try:
98
+ with open(file_path, "r") as f:
99
+ data = json.load(f)
100
+
101
+ for item in data:
102
+ for key, value in item.items():
103
+ if isinstance(value, str):
104
+ if value.isdigit():
105
+ item[key] = int(value)
106
+ elif value.replace(".", "", 1).isdigit():
107
+ item[key] = float(value)
108
+
109
+ df = pl.DataFrame(data, infer_schema_length=None)
110
+ logger.info(f"Loaded {len(df)} benchmark results")
111
+ return df
112
+
113
+ except Exception as e:
114
+ logger.error(f"Error loading data: {e}")
115
+ return pl.DataFrame()