Spaces:

auto-cap
/

MoE-CAP-Dashboard

Running

App Files Files Community

vllm

#33

by zhiminy - opened Jun 18, 2024

base: refs/heads/main

←

from: refs/pr/33

Discussion Files changed

+662

-1545

This PR is in draft mode

Files changed (14) hide show

Dockerfile +8 -16
README.md +51 -74
app.py +463 -494
backend-cli.py +1 -3
moe-cap-results +0 -0
requirements.txt +35 -5
src/backend/run_eval_suite.py +8 -22
src/backend/tasks/arena_hard/task.py +1 -1
src/backend/tasks/measurement_task_utils.py +8 -21
src/display/about.py +4 -7
src/display/utils.py +40 -64
src/leaderboard/read_evals.py +10 -11
src/populate.py +1 -1
src/utils.py +32 -826

Dockerfile CHANGED Viewed

@@ -1,16 +1,8 @@
-FROM python:3.10
-RUN apt-get update && apt-get install -y git
-RUN pip install --no-cache-dir \
-        gradio>=4.44.0 \
-        pandas>=2.0.0 \
-        datasets>=2.16.0 \
-        huggingface_hub>=0.20.0 \
-        uvicorn>=0.23.0 \
-        fastapi \
-        spaces
-COPY app.py /app/app.py
-CMD ["python", "/app/app.py"]

+# Use specific python image
+FROM registry.hf.space/sparse-generative-ai-open-moe-llm-leaderboard:latest
+RUN pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity --no-cache-dir
+# To fix pydantic version
+RUN pip install pydantic==2.6.4 --no-cache-dir
+# To fix selfcheck (selfchatgpt) dataset missing
+RUN python -m spacy download en

README.md CHANGED Viewed

@@ -1,108 +1,85 @@
 ---
-title: MoE-CAP Dashboard
 emoji: 🔥
 colorFrom: green
 colorTo: indigo
 sdk: gradio
-sdk_version: 5.49.1
 app_file: app.py
 pinned: true
 license: apache-2.0
 fullWidth: true
-allow_embedding: true
 tags:
-- leaderboard
 ---
-# MoE-CAP
-MoE-CAP is a benchmarking method designed to evaluate sparse MoE systems by integrating Cost, Accuracy, and Performance across these three dimensions.
-## News
-- MoE-CAP has been accepted to NeurIPS 2025 Dataset and Benchmark Track 🎉 See you in San Diego, US.
-## Requirements
-Python: >= 3.9
-## Installation
-```bash
-git clone https://github.com/sparse-generative-ai/MoE-CAP.git
-cd MoE-CAP
-pip install -e .
-```
-Then you can import `moe_cap` directly.
-## Quick Example
-### SGLang
-1. Launch our sglang custom server (e.g. H100)
-```bash
-python -m moe_cap.systems.sglang \
-        --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 \
-        --port 30000 \
-        --expert-distribution-recorder-mode stat \
-        --tp-size 8 \
-        --reasoning-parser deepseek-r1
-```
-2. Run our benchmark
-```bash
-python -m moe_cap.runner.sglang_profile \
-        --config-file configs/gsm8k_qwen3_235b_a22b.yaml \
-        --output_dir outputs/sglang/
-```
-The results will be stored under `outputs/sglang/`.
-### vLLM
-```bash
-python -m moe_cap.systems.vllm \
-        --model Qwen/Qwen3-235B-A22B-Thinking-2507 \
-        --port 8000 \
-        --host 0.0.0.0 \
-        --tensor-parallel-size 8 \
-        --reasoning-parser deepseek_r1 \
-        --max-num-batched-tokens 131072 # Set max-num-batched-tokens large referring to vLLM tuning guide.
-                                        # V1's mixed prefill-decode batching makes separate profiling difficult.
-```
 ```bash
-python -m moe_cap.runner.openai_api_profile \
-        --config-file configs/gsm8k_qwen3_235b_a22b.yaml \
-        --output_dir outputs/vllm/ \
-        --api-url http://0.0.0.0:8000/v1/completions
 ```
-The results will be stored under `outputs/vllm/`.
-## Contributing to MoE-CAP
-Thank you for your interest in contributing to the MoE-CAP project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yinsicheng Jiang, Yao Fu or Yeqi Huang via email at [ysc.jiang@ed.ac.uk](mailto:ysc.jiang@ed.ac.uk), [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk) or [yeqi.huang@ed.ac.uk](mailto:yeqi.huang@ed.ac.uk).
-### What We're Looking For in Contributions
-We are looking for contributions in several key areas to enhance the MoE-CAP project:
-1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
-2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
-3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
-4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
-Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
-Your contributions are crucial to the success and improvement of the MoE-CAP project. We look forward to collaborating with you.
-## Cite our paper
-```bibtex
-@misc{jiang2025moecapbenchmarkingcostaccuracy,
-      title={MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems},
-      author={Yinsicheng Jiang and Yao Fu and Yeqi Huang and Ping Nie and Zhan Lu and Leyang Xue and Congjie He and Man-Kit Sit and Jilong Xue and Li Dong and Ziming Miao and Dayou Du and Tairan Xu and Kai Zou and Edoardo Ponti and Luo Mai},
-      year={2025},
-      eprint={2412.07067},
-      archivePrefix={arXiv},
-      primaryClass={cs.LG},
-      url={https://arxiv.org/abs/2412.07067},
-}
-```

 ---
+title: OPEN-MOE-LLM-LEADERBOARD
 emoji: 🔥
 colorFrom: green
 colorTo: indigo
 sdk: gradio
+sdk_version: 4.26.0
 app_file: app.py
 pinned: true
 license: apache-2.0
 fullWidth: true
 tags:
+  - leaderboard
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Contributing to Open-MOE-LLM-Leaderboard
+Thank you for your interest in contributing to the Open-MOE-LLM-Leaderboard project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yao Fu via email at [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk).
+## What We're Looking For in Contributions
+We are looking for contributions in several key areas to enhance the Open-MOE-LLM-Leaderboard project:
+1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
+2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
+3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
+4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
+Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
+Your contributions are crucial to the success and improvement of the Open-MOE-LLM-Leaderboard project. We look forward to collaborating with you.
+## Development Setup
+To start contributing, set up your development environment as follows:
 ```bash
+conda create -n leaderboard python=3.10
+conda activate leaderboard
+pip install -r requirements.txt
+pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity
+pip install pydantic==2.6.4 # Resolves a dependency conflict with moe-infinity
+python -m spacy download en # Required for selfcheckgpt
 ```
+## Architecture Overview
+The Open-MOE-LLM-Leaderboard project uses the following architecture:
+- **User Interface (Gradio)** ->upload-> **HuggingFace Dataset (Request)** ->download-> **Backend GPU Server** ->upload-> **HuggingFace Dataset (Result)** ->download-> **User Interface (Gradio)**
+In brief:
+1. Users submit model benchmarking requests through the Gradio interface ([app.py](./app.py)). These requests are then recorded in a HuggingFace dataset ([sparse-generative-ai/requests](https://huggingface.co/datasets/sparse-generative-ai/requests)).
+2. The backend ([backend-cli.py](./backend-cli.py)), running on a GPU server, processes these requests, performs the benchmarking tasks, and uploads the results to another HuggingFace dataset ([sparse-generative-ai/results](https://huggingface.co/datasets/sparse-generative-ai/results)).
+3. Finally, the Gradio interface retrieves and displays these results to the users.
+## Running the Gradio Interface
+To launch the Gradio interface, execute:
+```bash
+python app.py
+```
+Then, open your browser and navigate to http://127.0.0.1:7860.
+## Running the Backend
+To start the backend process, use:
+```bash
+python backend-cli.py --debug
+```
+For additional details, please consult the [backend-cli.py](./backend-cli.py) script.
+---
+We look forward to your contributions and are here to help guide you through the process. Thank you for supporting the Open-MOE-LLM-Leaderboard project!

app.py CHANGED Viewed

@@ -1,527 +1,496 @@
 #!/usr/bin/env python
 import os
-import json
-from typing import List, Tuple
-os.environ["GRADIO_LANGUAGE"] = "en"
-RESULT_DIR = os.environ.get("MOECAP_RESULT_DIR")
-if not RESULT_DIR:
-    # For testing purposes, you can uncomment the line below:
-    # RESULT_DIR = "generic_result_dir"
-    # If you are running locally without this env var,
-    # ensure you handle this error or set the var.
-    pass
 import gradio as gr
 import pandas as pd
-from datasets import load_dataset
-import plotly.graph_objects as go
-def f2(x):
-    """Format to 2 decimal places if number, else return as-is."""
-    if isinstance(x, (int, float)):
-        return round(float(x), 2)
-    return x
-def normalize(val, vmin, vmax, baseline=20):
-    """Normalize value to baseline-100 range."""
-    if vmax == vmin:
-        return baseline + 40
-    return baseline + (val - vmin) / (vmax - vmin) * (100 - baseline)
-def normalize_cost(val, max_tick, baseline=20):
-    """Normalize cost (lower is better)."""
-    if max_tick == 0:
-        return baseline + 40
-    return baseline + (max_tick - min(val, max_tick)) / max_tick * (100 - baseline)
-def generate_radar_plot(selected_rows_data: List[dict]) -> go.Figure:
-    """Generate a CAP radar plot from selected rows."""
-    layout_settings = dict(
-        height=750,
-        autosize=True,
-        margin=dict(t=80, b=100, l=80, r=80),
-        paper_bgcolor='white',
-        plot_bgcolor='white',
-    )
-    if not selected_rows_data or len(selected_rows_data) == 0:
-        fig = go.Figure()
-        fig.add_annotation(
-            text="Please select 1-3 rows from the table to generate radar plot",
-            xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
-            font=dict(size=16, color="black"), # Ensure text is black
-            xanchor='center', yanchor='middle'
-        )
-        fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
-        return fig
-    if len(selected_rows_data) > 3:
-        fig = go.Figure()
-        fig.add_annotation(
-            text="Error: Please select no more than 3 rows!",
-            xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
-            font=dict(size=18, color="red"),
-            xanchor='center', yanchor='middle'
         )
-        fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
-        return fig
-    datasets = [row.get('Dataset', '') for row in selected_rows_data]
-    unique_datasets = set(datasets)
-    if len(unique_datasets) > 1:
-        fig = go.Figure()
-        fig.add_annotation(
-            text="Error: Please select rows from the same dataset!",
-            xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
-            font=dict(size=18, color="red"),
-            xanchor='center', yanchor='middle'
         )
-        fig.update_layout(xaxis=dict(visible=False), yaxis=dict(visible=False), **layout_settings)
-        return fig
-    dataset_name = datasets[0] if datasets else "Unknown"
-    data = {}
-    for row in selected_rows_data:
-        model_name = row.get('Model', 'Unknown')
-        if isinstance(model_name, str) and 'href' in model_name:
-            try:
-                model_name = model_name.split('>', 1)[1].split('<', 1)[0]
-            except:
-                pass
-        method = row.get('Method', '')
-        if isinstance(model_name, str) and '/' in model_name:
-            legend_name = model_name.split('/')[-1]
-        else:
-            legend_name = str(model_name)
-        if method and method not in ['Unknown', '-', '']:
-            legend_name = f"{legend_name}-{method}"
-        acc = row.get('Accuracy(%)', 0)
-        cost = row.get('Cost($)', 0)
-        throughput = row.get('Decoding T/s', 0)
-        try:
-            acc = float(acc) if acc not in [None, '-', ''] else 0
-            cost = float(cost) if cost not in [None, '-', ''] else 0
-            throughput = float(throughput) if throughput not in [None, '-', ''] else 0
-        except:
-            acc, cost, throughput = 0, 0, 0
-        data[legend_name] = {
-            'accuracy': acc / 100.0 if acc > 1 else acc,
-            'cost': cost,
-            'throughput': throughput
-        }
-    throughputs = [v['throughput'] for v in data.values()]
-    costs = [v['cost'] for v in data.values()]
-    accs = [v['accuracy'] for v in data.values()]
-    tp_min, tp_max = (min(throughputs), max(throughputs)) if throughputs else (0, 1)
-    cost_max = max(costs) if costs else 1
-    acc_min, acc_max = (min(accs), 1.0) if accs else (0, 1)
-    baseline = 20
-    categories = ['Throughput (T/s)', 'Cost ($)', 'Accuracy', 'Throughput (T/s)']
-    fig = go.Figure()
-    for system, values in data.items():
-        raw_vals = [values['throughput'], values['cost'], values['accuracy']]
-        norm_vals = [
-            normalize(values['throughput'], tp_min, tp_max, baseline),
-            normalize_cost(values['cost'], cost_max, baseline),
-            normalize(values['accuracy'], acc_min, acc_max, baseline)
-        ]
-        norm_vals += [norm_vals[0]]
-        hovertext = [
-            f"Throughput: {raw_vals[0]:.2f} T/s",
-            f"Cost: ${raw_vals[1]:.2f}",
-            f"Accuracy: {raw_vals[2]*100:.2f}%",
-            f"Throughput: {raw_vals[0]:.2f} T/s"
-        ]
-        fig.add_trace(go.Scatterpolar(
-            r=norm_vals,
-            theta=categories,
-            fill='toself',
-            name=system,
-            text=hovertext,
-            hoverinfo='text+name',
-            line=dict(width=2)
-        ))
-    fig.update_layout(
-        title=dict(text=f"CAP Radar Plot: {dataset_name}", x=0.5, xanchor='center', font=dict(size=20, color="black")),
-        polar=dict(
-            radialaxis=dict(
-                visible=True,
-                range=[0, 100],
-                tickfont=dict(size=12, color="black"),
-                gridcolor='lightgray',  # Add this
-                linecolor='gray',        # Add this
-                showline=True           # Add this
-            ),
-            angularaxis=dict(
-                tickfont=dict(size=14, color="black"),
-                rotation=90,
-                direction='clockwise',
-                gridcolor='lightgray',  # Add this
-                linecolor='gray',       # Add this
-                showline=True          # Add this
-            ),
-            bgcolor="white"
-        ),
-        legend=dict(orientation='h', yanchor='bottom', y=-0.15, xanchor='center', x=0.5, font=dict(size=13, color="black")),
-        **layout_settings
-        )
-    return fig
-def json_to_row(path: str, metrics: dict) -> dict:
-    model_name = metrics.get("model_name")
-    if not model_name:
-        model_name = "unknown-model"
-    dataset = metrics.get("dataset", "Unknown")
-    method = metrics.get("method", "Unknown")
-    precision = metrics.get("precision", "Unknown")
-    model_type = metrics.get("model_type", "Unknown")
-    e2e_s = metrics.get("e2e_s", None)
-    batch_size = metrics.get("batch_size", None)
-    gpu_type = metrics.get("gpu_type", "")
-    cost = metrics.get("cost", None)
-    em = metrics.get("exact_match")
-    correct = metrics.get("correct")
-    total = metrics.get("total")
-    if isinstance(correct, (int, float)) and isinstance(total, (int, float)) and total > 0:
-        acc = correct / total
-    else:
-        acc = em
-    def pct(x):
-        return round(x * 100, 2) if isinstance(x, (int, float)) else None
-    if isinstance(model_name, str) and "/" in model_name:
-        hf_url = f"https://huggingface.co/{model_name}"
-        model_cell = f"<a href='{hf_url}' target='_blank' style='color: #0366d6; text-decoration: none;'>{model_name}</a>"
-    else:
-        model_cell = model_name
-    row = {
-        "Model": model_cell,
-        "Dataset": dataset,
-        "Method": method,
-        "Model type": model_type,
-        "Precision": precision,
-        "E2E(s)": f2(e2e_s),
-        "GPU": gpu_type,
-        "Accuracy(%)": pct(acc),
-        "Cost($)": cost,
-        "Decoding T/s": f2(metrics.get("decoding_throughput")),
-        "Prefill T/s": f2(metrics.get("prefill_tp")),
-        "Prefill<br>S-MBU(%)": pct(metrics.get("prefill_smbu")),
-        "Prefill<br>S-MFU(%)": pct(metrics.get("prefill_smfu")),
-        "Decoding<br>S-MBU(%)": pct(metrics.get("decoding_smbu")),
-        "Decoding<br>S-MFU(%)": pct(metrics.get("decoding_smfu")),
-        "TTFT(s)": f2(metrics.get("ttft")),
-        "TPOT(s)": f2(metrics.get("tpot")),
-        "Batch size": batch_size,
-    }
-    return row
-def load_from_dir(dir_path: str, selected_tasks=None, selected_frameworks=None, selected_model_types=None, selected_precisions=None, search_keyword="", force_refresh=False):
-    if not dir_path:
-        return "<p style='color:black'>Result Directory not set.</p>", []
-    try:
-        pattern = f"hf://datasets/{dir_path}/**/*.json"
-        dl_mode = "force_redownload" if force_refresh else None
-        print(f"Fetching from {pattern} (mode={dl_mode})...")
-        ds = load_dataset("json", data_files={"train": pattern}, split="train", download_mode=dl_mode)
-    except Exception as e:
-        print(f"Error loading dataset: {e}")
-        return "<p style='color:black'>No files loaded or Dataset not found.</p>", []
-    rows = []
-    for i, example in enumerate(ds):
-        metrics = example.get("metrics") or example.get("json") or example
-        rows.append(json_to_row(f"{dir_path}#{i}", metrics))
-    if not rows:
-        return "<p style='color:black'>No records found.</p>", []
-    df = pd.DataFrame(rows)
-    # --- Filtering Logic ---
-    # This logic is consistent: if a filter is provided, we ONLY keep rows
-    # where the column value is inside the selected list.
-    if selected_tasks:
-        df = df[df["Dataset"].astype(str).str.lower().isin([x.lower() for x in selected_tasks])]
-    if selected_frameworks:
-        df = df[df["Method"].astype(str).str.lower().isin([str(x).lower() for x in selected_frameworks])]
-    if selected_model_types:
-        df = df[df["Model type"].astype(str).str.lower().isin([str(x).lower() for x in selected_model_types])]
-    if selected_precisions:
-        df = df[df["Precision"].astype(str).str.lower().isin([str(x).lower() for x in selected_precisions])]
-    if search_keyword and search_keyword.strip():
-        df = df[df.astype(str).apply(lambda row: row.str.lower().str.contains(search_keyword.strip().lower()).any(), axis=1)]
-    if df.empty:
-        return "<p style='color:black'>No records found.</p>", []
-    df = df.fillna("-")
-    df.insert(0, 'Row #', range(len(df)))
-    table_html = f'<div class="table-container">{df.to_html(escape=False, index=False, classes="metrics-table")}</div>'
-    df_without_rownum = df.drop('Row #', axis=1)
-    return table_html, df_without_rownum.to_dict('records')
-def auto_refresh_from_dir(dir_path, tasks, frameworks, types, precisions, search):
-    return load_from_dir(dir_path, tasks, frameworks, types, precisions, search, force_refresh=True)
-def parse_and_generate_plot(df_data, indices_str):
-    if not indices_str or not indices_str.strip():
-        return generate_radar_plot([])
-    try:
-        indices = [int(idx.strip()) for idx in indices_str.split(',') if idx.strip()][:3]
-        selected_rows = [df_data[i] for i in indices if 0 <= i < len(df_data)]
-        return generate_radar_plot(selected_rows)
-    except:
-        return generate_radar_plot([])
-def initial_load(dir_path, tasks, frameworks, types, precisions, search):
-    """Load data and generate initial radar plot with rows 0,1,2."""
-    table_html, df_data = auto_refresh_from_dir(dir_path, tasks, frameworks, types, precisions, search)
-    plot = parse_and_generate_plot(df_data, "0,1,2")
-    return table_html, df_data, plot
-def build_app() -> gr.Blocks:
-    # NUCLEAR CSS FIX: Overwrite all generic Gradio variables to force light mode
-    row_css = """
-    /* 1. FORCE LIGHT VARIABLES GLOBALLY */
-    :root, .gradio-container, body {
-        --body-background-fill: #f5f7fa !important;
-        --body-text-color: #374151 !important;
-        --background-fill-primary: #ffffff !important;
-        --background-fill-secondary: #f3f4f6 !important;
-        --border-color-primary: #e5e7eb !important;
-        --block-background-fill: #ffffff !important;
-        --block-label-text-color: #374151 !important;
-        --block-title-text-color: #1f2937 !important;
-        --input-background-fill: #ffffff !important;
-        --color-accent: #0366d6 !important;
-        /* Reset dark mode specific variables to light values */
-        --neutral-50: #f9fafb; --neutral-100: #f3f4f6; --neutral-200: #e5e7eb;
-        --neutral-300: #d1d5da; --neutral-400: #9ca3af; --neutral-500: #6b7280;
-        --neutral-600: #4b5563; --neutral-700: #374151; --neutral-800: #1f2937;
-    }
-    /* 2. RESET STANDARD CONTAINERS */
-    .gradio-container .block,
-    .gradio-container .panel,
-    .gradio-container .form {
-        background-color: white !important;
-        border-color: #e1e4e8 !important;
-    }
-    /* 3. SPECIFIC FIX FOR THE DARK "FILTERS" and "RADAR" SECTIONS */
-    .filter-section {
-        background-color: #ffffff !important;
-        border: 2px solid #e1e4e8 !important;
-        border-radius: 8px !important;
-        padding: 16px !important;
-        box-shadow: 0 2px 4px rgba(0,0,0,0.05) !important;
-        color: #24292e !important; /* Set default text color for the section */
-    }
-    /* Remove background color from text elements to prevent "dark blocks" */
-    .filter-section label,
-    .filter-section span,
-    .filter-section p {
-        background-color: transparent !important;
-    }
-    /* 4. BUTTON FIXES - TARGET BY ID FOR SPECIFICITY */
-    #gen_btn {
-        background-color: #0366d6 !important;
-        color: white !important;
-        border: none !important;
-    }
-    #gen_btn:hover {
-        opacity: 0.9;
-    }
-    /* 5. INPUTS & CHECKBOXES */
-    /* Re-apply white background to inputs specifically */
-    .filter-section input,
-    .filter-section textarea,
-    .filter-section select {
-        background-color: #ffffff !important;
-        border: 1px solid #d1d5da !important;
-        color: #24292e !important;
-    }
-    /* --- FIX FOR CHECKBOXES --- */
-    /* Use explicit styling for the checked state to ensure visibility */
-    .filter-section input[type="checkbox"] {
-        appearance: none !important;
-        -webkit-appearance: none !important;
-        width: 16px !important;
-        height: 16px !important;
-        background-color: white !important;
-        border: 1px solid #d1d5da !important;
-        border-radius: 3px !important;
-        position: relative !important;
-        cursor: pointer !important;
-    }
-    .filter-section input[type="checkbox"]:checked {
-        background-color: #0366d6 !important;
-        border-color: #0366d6 !important;
-        /* Draw the checkmark using an SVG data URI */
-        background-image: url("data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M12.207 4.793a1 1 0 010 1.414l-5 5a1 1 0 01-1.414 0l-2-2a1 1 0 011.414-1.414L6.5 9.086l4.293-4.293a1 1 0 011.414 0z'/%3e%3c/svg%3e") !important;
-        background-size: 100% 100% !important;
-        background-position: center !important;
-        background-repeat: no-repeat !important;
-    }
-    .filter-section label span {
-        color: #24292e !important;
-    }
-    /* 6. SEARCH BOX */
-    .search-box {
-        background: white !important;
-        padding: 16px !important;
-        border-radius: 6px;
-        border: 2px solid #e1e4e8 !important;
-        margin-bottom: 16px;
-    }
-    /* 7. TABLE STYLING */
-    .table-container {
-        overflow-x: auto;
-        max-height: 75vh;
-        border: 2px solid #e1e4e8;
-        border-radius: 6px;
-        background: white !important;
-    }
-    table.metrics-table {
-        width: 100%; border-collapse: collapse; background: white !important;
-    }
-    table.metrics-table th, table.metrics-table td {
-        padding: 10px 14px; border: 1px solid #e1e4e8;
-        white-space: nowrap; font-size: 13px; color: #24292e !important;
-    }
-    table.metrics-table th {
-        background: #f6f8fa !important; font-weight: 600; position: sticky; top: 0;
-    }
-    .metrics-table th:first-child, .metrics-table td:first-child {
-        background-color: #f0f0f0 !important; text-align: center;
-    }
-    /* 8. PLOT CONTAINER - FORCE WHITE BACKGROUND */
-    .plot-container {
-        width: 100% !important;
-        background-color: white !important;
-    }
-    .plot-container > div, .plot-container .plotly {
-        background-color: white !important;
-    }
-    /* 9. LINKS */
-    a { color: #0366d6 !important; text-decoration: none; }
-    a:hover { text-decoration: underline; }
-    """
-    with gr.Blocks(title="MoE-CAP Dashboard", css=row_css, theme=gr.themes.Default()) as demo:
-        gr.Markdown("# MoE-CAP Dashboard")
-        with gr.Row():
-            # Left Sidebar
-            with gr.Column(scale=2):
-                with gr.Group(elem_classes="search-box"):
-                    search_input = gr.Textbox(label="🔍 Search", placeholder="Search...", lines=1)
-                with gr.Group(elem_classes="filter-section"):
-                    gr.Markdown("### 🎛️ Filters")
-                    dir_path = gr.State(RESULT_DIR)
-                    task_filter = gr.CheckboxGroup(
-                        label="📊 Tasks",
-                        choices=[("GSM8K", "gsm8k"), ("LongBench", "longbench"), ("MMLU", "mmlu"), ("NuminaMath", "numinamath"), ("RULER", "ruler")],
-                        value=["gsm8k", "longbench", "mmlu", "numinamath", "ruler"]
                     )
-                    framework_filter = gr.CheckboxGroup(label="⚙️ Frameworks", choices=["sglang", "vllm"], value=["sglang", "vllm"])
-                    model_type_filter = gr.CheckboxGroup(label="🤖 Model Types", choices=["instruct", "thinking"], value=["instruct", "thinking"])
-                    precision_filter = gr.CheckboxGroup(label="🎯 Precision", choices=["bfloat16", "fp8"], value=["bfloat16", "fp8"])
-                with gr.Accordion("📖 About Tasks & Metrics", open=True):
-                    gr.Markdown(
-                        "### Tasks\n- **GSM8K**, **LongBench**, **MMLU**, **NuminaMath**, **RULER**\n\n"
-                        "### Metrics\n- **E2E(s)**: Latency | **Cost($)** | **T/s**: Throughput | **S-MBU/MFU**: Utilization | **TPOT**, **TTFT**",
-                        elem_classes="info-section"
                     )
-                    gr.Markdown(
-                        "Github Repo: [https://github.com/Auto-CAP/MoE-CAP](https://github.com/Auto-CAP/MoE-CAP)",
-                        elem_classes="info-section"
                     )
-            # Right Main Content
-            with gr.Column(scale=5):
-                leaderboard_output = gr.HTML(label="📈 Results")
-                with gr.Group(elem_classes="filter-section"):
-                    gr.Markdown("### 📊 CAP Radar Plot")
-                    gr.Markdown("**How to use:** Look at the 'Row #' column in the table. Enter row numbers (e.g., 0,1,2) and click Generate.")
-                    with gr.Row():
-                        row_indices_input = gr.Textbox(label="Row Numbers", placeholder="0,1,2", value="0,1,2", scale=3)
-                        generate_btn = gr.Button("🎯 Generate", variant="primary", scale=1, elem_id="gen_btn")
-                    radar_plot = gr.Plot(value=generate_radar_plot([]), elem_classes="plot-container")
-        # State & Events
-        df_data_state = gr.State([])
-        inputs = [dir_path, task_filter, framework_filter, model_type_filter, precision_filter, search_input]
-        # Load data and generate initial plot on page load
-        demo.load(fn=initial_load, inputs=inputs, outputs=[leaderboard_output, df_data_state, radar_plot])
-        search_input.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
-        task_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
-        framework_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
-        model_type_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
-        precision_filter.change(fn=load_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
-        generate_btn.click(fn=parse_and_generate_plot, inputs=[df_data_state, row_indices_input], outputs=[radar_plot])
-        gr.Timer(60.0).tick(fn=auto_refresh_from_dir, inputs=inputs, outputs=[leaderboard_output, df_data_state])
-    return demo
 if __name__ == "__main__":
-    app = build_app()
-    app.launch()

 #!/usr/bin/env python
 import os
+import datetime
+import socket
+import base64
+from threading import Thread
 import gradio as gr
 import pandas as pd
+import time
+from apscheduler.schedulers.background import BackgroundScheduler
+from huggingface_hub import snapshot_download
+from src.display.about import (
+    CITATION_BUTTON_LABEL,
+    CITATION_BUTTON_TEXT,
+    EVALUATION_QUEUE_TEXT,
+    INTRODUCTION_TEXT,
+    LLM_BENCHMARKS_TEXT,
+    LLM_BENCHMARKS_DETAILS,
+    FAQ_TEXT,
+    TITLE,
+    ACKNOWLEDGEMENT_TEXT,
+)
+from src.display.css_html_js import custom_css
+from src.display.utils import (
+    BENCHMARK_COLS,
+    COLS,
+    EVAL_COLS,
+    EVAL_TYPES,
+    TYPES,
+    AutoEvalColumn,
+    ModelType,
+    InferenceFramework,
+    fields,
+    WeightType,
+    Precision,
+    GPUType
+)
+from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, H4_TOKEN, IS_PUBLIC, \
+    QUEUE_REPO, REPO_ID, RESULTS_REPO, DEBUG_QUEUE_REPO, DEBUG_RESULTS_REPO
+from src.populate import get_evaluation_queue_df, get_leaderboard_df
+from src.submission.submit import add_new_eval
+from src.utils import get_dataset_summary_table
+def get_args():
+    import argparse
+    parser = argparse.ArgumentParser(description="Run the LLM Leaderboard")
+    parser.add_argument("--debug", action="store_true", help="Run in debug mode")
+    return parser.parse_args()
+args = get_args()
+if args.debug:
+    print("Running in debug mode")
+    QUEUE_REPO = DEBUG_QUEUE_REPO
+    RESULTS_REPO = DEBUG_RESULTS_REPO
+def ui_snapshot_download(repo_id, local_dir, repo_type, tqdm_class, etag_timeout):
+    try:
+        print(local_dir)
+        snapshot_download(
+            repo_id=repo_id, local_dir=local_dir, repo_type=repo_type, tqdm_class=tqdm_class, etag_timeout=etag_timeout
+        )
+    except Exception as e:
+        restart_space()
+def restart_space():
+    API.restart_space(repo_id=REPO_ID, token=H4_TOKEN)
+def init_space():
+    # dataset_df = get_dataset_summary_table(file_path="blog/Hallucination-Leaderboard-Summary.csv")
+    if socket.gethostname() not in {"neuromancer"}:
+        # sync model_type with open-llm-leaderboard
+        ui_snapshot_download(
+            repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30
         )
+        ui_snapshot_download(
+            repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30
         )
+    raw_data, original_df = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, "", COLS, BENCHMARK_COLS)
+    finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df = get_evaluation_queue_df(
+        EVAL_REQUESTS_PATH, EVAL_COLS
+    )
+    # return dataset_df, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df
+    return None, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df
+def add_benchmark_columns(shown_columns):
+    benchmark_columns = []
+    for benchmark in BENCHMARK_COLS:
+        if benchmark in shown_columns:
+            for c in COLS:
+                if benchmark in c and benchmark != c:
+                    benchmark_columns.append(c)
+    return benchmark_columns
+# Searching and filtering
+def update_table(
+    hidden_df: pd.DataFrame, columns: list, type_query: list, precision_query: list, size_query: list, query: str
+):
+    filtered_df = filter_models(hidden_df, type_query, size_query, precision_query)
+    filtered_df = filter_queries(query, filtered_df)
+    benchmark_columns = add_benchmark_columns(columns)
+    df = select_columns(filtered_df, columns + benchmark_columns)
+    return df
+def search_table(df: pd.DataFrame, query: str) -> pd.DataFrame:
+    return df[(df[AutoEvalColumn.dummy.name].str.contains(query, case=False))]
+def select_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
+    # always_here_cols = [AutoEvalColumn.model_type_symbol.name, AutoEvalColumn.model.name]
+    always_here_cols = [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
+    dummy_col = [AutoEvalColumn.dummy.name]
+    # We use COLS to maintain sorting
+    filtered_df = df[
+        # always_here_cols + [c for c in COLS if c in df.columns and c in columns] + [AutoEvalColumn.dummy.name]
+        always_here_cols
+        + [c for c in COLS if c in df.columns and c in columns]
+        + dummy_col
+    ]
+    return filtered_df
+def filter_queries(query: str, filtered_df: pd.DataFrame):
+    final_df = []
+    if query != "":
+        queries = [q.strip() for q in query.split(";")]
+        for _q in queries:
+            _q = _q.strip()
+            if _q != "":
+                temp_filtered_df = search_table(filtered_df, _q)
+                if len(temp_filtered_df) > 0:
+                    final_df.append(temp_filtered_df)
+        if len(final_df) > 0:
+            filtered_df = pd.concat(final_df)
+            subset = [AutoEvalColumn.model.name, AutoEvalColumn.precision.name, AutoEvalColumn.revision.name]
+            filtered_df = filtered_df.drop_duplicates(subset=subset)
+    return filtered_df
+def filter_models(df: pd.DataFrame, type_query: list, size_query: list, precision_query: list) -> pd.DataFrame:
+    # Show all models
+    filtered_df = df
+    type_emoji = [t[0] for t in type_query]
+    filtered_df = filtered_df.loc[df[AutoEvalColumn.model_type_symbol.name].isin(type_emoji)]
+    filtered_df = filtered_df.loc[df[AutoEvalColumn.precision.name].isin(precision_query + ["None"])]
+    # numeric_interval = pd.IntervalIndex(sorted([NUMERIC_INTERVALS[s] for s in size_query]))
+    # params_column = pd.to_numeric(df[AutoEvalColumn.params.name], errors="coerce")
+    # mask = params_column.apply(lambda x: any(numeric_interval.contains(x)))
+    # filtered_df = filtered_df.loc[mask]
+    return filtered_df
+shown_columns = None
+dataset_df, original_df, finished_eval_queue_df, running_eval_queue_df, pending_eval_queue_df = init_space()
+leaderboard_df = original_df.copy()
+# def update_leaderboard_table():
+#     global leaderboard_df, shown_columns
+#     print("Updating leaderboard table")
+#     return leaderboard_df[
+#                 [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
+#                 + shown_columns.value
+#                 + [AutoEvalColumn.dummy.name]
+#             ] if not leaderboard_df.empty else leaderboard_df
+# def update_hidden_leaderboard_table():
+#     global original_df
+#     return original_df[COLS] if original_df.empty is False else original_df
+# def update_dataset_table():
+#     global dataset_df
+#     return dataset_df
+# def update_finish_table():
+#     global finished_eval_queue_df
+#     return finished_eval_queue_df
+# def update_running_table():
+#     global running_eval_queue_df
+#     return running_eval_queue_df
+# def update_pending_table():
+#     global pending_eval_queue_df
+#     return pending_eval_queue_df
+# def update_finish_num():
+#     global finished_eval_queue_df
+#     return len(finished_eval_queue_df)
+# def update_running_num():
+#     global running_eval_queue_df
+#     return len(running_eval_queue_df)
+# def update_pending_num():
+#     global pending_eval_queue_df
+#     return len(pending_eval_queue_df)
+# triggered only once at startup => read query parameter if it exists
+def load_query(request: gr.Request):
+    query = request.query_params.get("query") or ""
+    return query
+def get_image_html(url, image_path):
+    with open(image_path, "rb") as image_file:
+        encoded_string = base64.b64encode(image_file.read()).decode()
+    return f'<a href="{url}" target="_blank"><img src="data:image/jpg;base64,{encoded_string}" alt="NetMind.AI Logo" style="width:100pt;"></a>'
+# Prepare the HTML content with the image
+image_html = get_image_html("https://netmind.ai/home", "./src/display/imgs/Netmind.AI_LOGO.jpg")
+demo = gr.Blocks(css=custom_css)
+with demo:
+    gr.HTML(TITLE)
+    gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
+    gr.HTML(ACKNOWLEDGEMENT_TEXT.format(image_html=image_html))
+    with gr.Tabs(elem_classes="tab-buttons") as tabs:
+        with gr.TabItem("open-moe-llm-leaderboard", elem_id="llm-benchmark-tab-table", id=0):
+            with gr.Row():
+                with gr.Column():
+                    with gr.Row():
+                        search_bar = gr.Textbox(
+                            placeholder=" 🔍 Model search (separate multiple queries with `;`)",
+                            show_label=False,
+                            elem_id="search-bar"
+                        )
+                    with gr.Row():
+                        shown_columns = gr.CheckboxGroup(
+                            choices=[
+                                c.name
+                                for c in fields(AutoEvalColumn)
+                                if not c.hidden and not c.never_hidden and not c.dummy
+                            ],
+                            value=[
+                                c.name
+                                for c in fields(AutoEvalColumn)
+                                if c.displayed_by_default and not c.hidden and not c.never_hidden
+                            ],
+                            label="Select columns to show",
+                            elem_id="column-select",
+                            interactive=True,
+                        )
+                with gr.Column(min_width=320):
+                    filter_columns_size = gr.CheckboxGroup(
+                        label="Inference frameworks",
+                        choices=[t.to_str() for t in InferenceFramework],
+                        value=[t.to_str() for t in InferenceFramework],
+                        interactive=True,
+                        elem_id="filter-columns-size",
                     )
+                    filter_columns_type = gr.CheckboxGroup(
+                        label="Model types",
+                        choices=[t.to_str() for t in ModelType],
+                        value=[t.to_str() for t in ModelType],
+                        interactive=True,
+                        elem_id="filter-columns-type",
                     )
+                    filter_columns_precision = gr.CheckboxGroup(
+                        label="Precision",
+                        choices=[i.value.name for i in Precision],
+                        value=[i.value.name for i in Precision],
+                        interactive=True,
+                        elem_id="filter-columns-precision",
                     )
+                    # filter_columns_size = gr.CheckboxGroup(
+                    #     label="Model sizes (in billions of parameters)",
+                    #     choices=list(NUMERIC_INTERVALS.keys()),
+                    #     value=list(NUMERIC_INTERVALS.keys()),
+                    #     interactive=True,
+                    #     elem_id="filter-columns-size",
+                    # )
+            # breakpoint()
+            benchmark_columns = add_benchmark_columns(shown_columns.value)
+            leaderboard_table = gr.components.Dataframe(
+                value=(
+                    leaderboard_df[
+                        [c.name for c in fields(AutoEvalColumn) if c.never_hidden]
+                        + shown_columns.value
+                        + benchmark_columns
+                        + [AutoEvalColumn.dummy.name]
+                    ]
+                    if leaderboard_df.empty is False
+                    else leaderboard_df
+                ),
+                headers=[c.name for c in fields(AutoEvalColumn) if c.never_hidden] + shown_columns.value + benchmark_columns,
+                datatype=TYPES,
+                elem_id="leaderboard-table",
+                interactive=False,
+                visible=True,
+            )  # column_widths=["2%", "20%"]
+            # Dummy leaderboard for handling the case when the user uses backspace key
+            hidden_leaderboard_table_for_search = gr.components.Dataframe(
+                value=original_df[COLS] if original_df.empty is False else original_df,
+                headers=COLS,
+                datatype=TYPES,
+                visible=False,
+            )
+            search_bar.submit(
+                update_table,
+                [
+                    hidden_leaderboard_table_for_search,
+                    shown_columns,
+                    filter_columns_type,
+                    filter_columns_precision,
+                    filter_columns_size,
+                    search_bar,
+                ],
+                leaderboard_table
+            )
+            # Check query parameter once at startup and update search bar
+            demo.load(load_query, inputs=[], outputs=[search_bar])
+            for selector in [shown_columns, filter_columns_type, filter_columns_precision, filter_columns_size]:
+                selector.change(
+                    update_table,
+                    [
+                        hidden_leaderboard_table_for_search,
+                        shown_columns,
+                        filter_columns_type,
+                        filter_columns_precision,
+                        filter_columns_size,
+                        search_bar,
+                    ],
+                    leaderboard_table,
+                    queue=True,
+                )
+        # with gr.TabItem("About", elem_id="llm-benchmark-tab-table", id=2):
+        #     gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
+        #     dataset_table = gr.components.Dataframe(
+        #         value=dataset_df,
+        #         headers=list(dataset_df.columns),
+        #         datatype=["str", "markdown", "str", "str", "str"],
+        #         elem_id="dataset-table",
+        #         interactive=False,
+        #         visible=True,
+        #         column_widths=["15%", "20%"],
+        #     )
+        #     gr.Markdown(LLM_BENCHMARKS_DETAILS, elem_classes="markdown-text")
+        #     gr.Markdown(FAQ_TEXT, elem_classes="markdown-text")
+        with gr.TabItem("Submit a model ", elem_id="llm-benchmark-tab-table", id=3):
+            with gr.Column():
+                with gr.Row():
+                    gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
+                with gr.Column():
+                    with gr.Accordion(f"✅ Finished Evaluations ({len(finished_eval_queue_df)})", open=False):
+                        with gr.Row():
+                            finished_eval_table = gr.components.Dataframe(
+                                value=finished_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
+                            )
+                    with gr.Accordion(f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})", open=False):
+                        with gr.Row():
+                            running_eval_table = gr.components.Dataframe(
+                                value=running_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
+                            )
+                    with gr.Accordion(f"⏳ Scheduled Evaluation Queue ({len(pending_eval_queue_df)})", open=False):
+                        with gr.Row():
+                            pending_eval_table = gr.components.Dataframe(
+                                value=pending_eval_queue_df, headers=EVAL_COLS, datatype=EVAL_TYPES, row_count=5
+                            )
+            with gr.Row():
+                gr.Markdown("# Submit your model here", elem_classes="markdown-text")
+            with gr.Row():
+                inference_framework = gr.Dropdown(
+                    choices=[t.to_str() for t in InferenceFramework],
+                    label="Inference framework",
+                    multiselect=False,
+                    value=None,
+                    interactive=True,
+                )
+                gpu_type = gr.Dropdown(
+                    choices=[t.to_str() for t in GPUType],
+                    label="GPU type",
+                    multiselect=False,
+                    value="NVIDIA-A100-PCIe-80GB",
+                    interactive=True,
+                )
+            with gr.Row():
+                with gr.Column():
+                    model_name_textbox = gr.Textbox(label="Model name")
+                    revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
+                    private = gr.Checkbox(False, label="Private", visible=not IS_PUBLIC)
+                    model_type = gr.Dropdown(
+                        choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
+                        label="Model type",
+                        multiselect=False,
+                        value=None,
+                        interactive=True,
+                    )
+                with gr.Column():
+                    precision = gr.Dropdown(
+                        choices=[i.value.name for i in Precision if i != Precision.Unknown],
+                        label="Precision",
+                        multiselect=False,
+                        value="float32",
+                        interactive=True,
+                    )
+                    weight_type = gr.Dropdown(
+                        choices=[i.value.name for i in WeightType],
+                        label="Weights type",
+                        multiselect=False,
+                        value="Original",
+                        interactive=True,
+                    )
+                    base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
+            submit_button = gr.Button("Submit Eval")
+            submission_result = gr.Markdown()
+            debug = gr.Checkbox(value=args.debug, label="Debug", visible=False)
+            submit_button.click(
+                add_new_eval,
+                [
+                    model_name_textbox,
+                    base_model_name_textbox,
+                    revision_name_textbox,
+                    precision,
+                    private,
+                    weight_type,
+                    model_type,
+                    inference_framework,
+                    debug,
+                    gpu_type
+                ],
+                submission_result,
+            )
+    with gr.Row():
+        with gr.Accordion("Citing this leaderboard", open=False):
+            citation_button = gr.Textbox(
+                value=CITATION_BUTTON_TEXT,
+                label=CITATION_BUTTON_LABEL,
+                lines=20,
+                elem_id="citation-button",
+                show_copy_button=True,
+            )
+scheduler = BackgroundScheduler()
+scheduler.add_job(restart_space, "interval", hours=6)
+def launch_backend():
+    import subprocess
+    from src.backend.envs import DEVICE
+    if DEVICE not in {"cpu"}:
+        _ = subprocess.run(["python", "backend-cli.py"])
+# Thread(target=periodic_init, daemon=True).start()
+# scheduler.add_job(launch_backend, "interval", seconds=120)
 if __name__ == "__main__":
+    scheduler.start()
+    demo.queue(default_concurrency_limit=40).launch()

backend-cli.py CHANGED Viewed

@@ -458,7 +458,6 @@ def get_args():
     parser.add_argument("--gpu-type", type=str, default="NVIDIA-A100-PCIe-80GB",
                         help="GPU type. NVIDIA-A100-PCIe-80GB; NVIDIA-RTX-A5000-24GB; NVIDIA-H100-PCIe-80GB")
     parser.add_argument("--debug_repo", action="store_true", help="Use debug repo")
-    parser.add_argument("--model_type", type=str, default="chat", help="Model type")
     return parser.parse_args()
@@ -489,8 +488,7 @@ if __name__ == "__main__":
                         json_filepath="",
                         precision=precision,  # Use precision from arguments
                         inference_framework=args.inference_framework,  # Use inference framework from arguments
-                        gpu_type=args.gpu_type,
-                        model_type=args.model_type,
                     )
                     curr_gpu_type = get_gpu_details()
                     if eval_request.gpu_type != curr_gpu_type:

     parser.add_argument("--gpu-type", type=str, default="NVIDIA-A100-PCIe-80GB",
                         help="GPU type. NVIDIA-A100-PCIe-80GB; NVIDIA-RTX-A5000-24GB; NVIDIA-H100-PCIe-80GB")
     parser.add_argument("--debug_repo", action="store_true", help="Use debug repo")
     return parser.parse_args()
                         json_filepath="",
                         precision=precision,  # Use precision from arguments
                         inference_framework=args.inference_framework,  # Use inference framework from arguments
+                        gpu_type=args.gpu_type
                     )
                     curr_gpu_type = get_gpu_details()
                     if eval_request.gpu_type != curr_gpu_type:

moe-cap-results DELETED Viewed

File without changes

requirements.txt CHANGED Viewed

@@ -1,6 +1,36 @@
-gradio>=4.44.0
-pandas
 datasets
-huggingface_hub<0.25.0
-plotly>=5.0.0
-kaleido>=0.2.1

+torch
+colorama
+APScheduler
+black
+click
 datasets
+gradio==4.26.0
+gradio_client
+huggingface-hub
+matplotlib
+numpy
+pandas
+plotly
+python-dateutil
+requests
+semantic-version
+tqdm
+wandb
+transformers
+tokenizers>=0.15.0
+lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@v0.4.2
+accelerate
+sentencepiece
+langdetect
+sacrebleu
+cchardet
+rouge_score
+bert-score
+evaluate
+spacy==3.7.4
+selfcheckgpt
+immutabledict
+gputil
+bitsandbytes
+openai
+scikit-learn

src/backend/run_eval_suite.py CHANGED Viewed

@@ -17,22 +17,16 @@ def process_results_decorator(func):
         end_to_end_time = sum([r[1] for r in results]) / len(results)
         prefilling_time = sum([r[2] for r in results]) / len(results)
         decoding_throughput = sum([r[3] for r in results]) / len(results)
-        decoding_mfu = sum([r[4] for r in results]) / len(results)
-        decoding_mbu = sum([r[5] for r in results]) / len(results)
-        prefill_throughput = sum([r[6] for r in results]) / len(results)
-        prefill_mfu = sum([r[7] for r in results]) / len(results)
-        prefill_mbu = sum([r[8] for r in results]) / len(results)
         # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
         result_dict = func(self, doc, processed_results, *args, **kwargs)
         result_dict["end_to_end_time"] = end_to_end_time
         result_dict["prefilling_time"] = prefilling_time
         result_dict["decoding_throughput"] = decoding_throughput
-        result_dict["decoding_mfu"] = decoding_mfu
-        result_dict["decoding_mbu"] = decoding_mbu
-        result_dict["prefill_throughput"] = prefill_throughput
-        result_dict["prefill_mfu"] = prefill_mfu
-        result_dict["prefill_mbu"] = prefill_mbu
         return result_dict
     return wrapper
 ConfigurableTask.process_results = process_results_decorator(orig_process_results)
@@ -43,11 +37,8 @@ def aggregation_decorator(func):
         aggregation_list["end_to_end_time"] = mean
         aggregation_list["prefilling_time"] = mean
         aggregation_list["decoding_throughput"] = mean
-        aggregation_list["decoding_mfu"] = mean
-        aggregation_list["decoding_mbu"] = mean
-        aggregation_list["prefill_throughput"] = mean
-        aggregation_list["prefill_mfu"] = mean
-        aggregation_list["prefill_mbu"] = mean
         return aggregation_list
     return wrapper
 ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
@@ -58,11 +49,8 @@ def higher_is_better_decorator(func):
         higher_is_better_dict["end_to_end_time"] = False
         higher_is_better_dict["prefilling_time"] = False
         higher_is_better_dict["decoding_throughput"] = True
-        higher_is_better_dict["decoding_mfu"] = True
-        higher_is_better_dict["decoding_mbu"] = True
-        higher_is_better_dict["prefill_throughput"] = True
-        higher_is_better_dict["prefill_mfu"] = True
-        higher_is_better_dict["prefill_mbu"] = True
         return higher_is_better_dict
     return wrapper
 ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
@@ -77,8 +65,6 @@ from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
 from src.backend.huggingface_generate_until import HFLMwithChatTemplate
 from src.backend.moe_infinity import MoEHFLM
-from src.backend.vllm import VLLM_MOE
-from src.backend.sglang import SGLangMoE
 def run_evaluation(
     eval_request: EvalRequest,

         end_to_end_time = sum([r[1] for r in results]) / len(results)
         prefilling_time = sum([r[2] for r in results]) / len(results)
         decoding_throughput = sum([r[3] for r in results]) / len(results)
+        mfu = sum([r[4] for r in results]) / len(results)
+        mbu = sum([r[5] for r in results]) / len(results)
         # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
         result_dict = func(self, doc, processed_results, *args, **kwargs)
         result_dict["end_to_end_time"] = end_to_end_time
         result_dict["prefilling_time"] = prefilling_time
         result_dict["decoding_throughput"] = decoding_throughput
+        result_dict["mfu"] = mfu
+        result_dict["mbu"] = mbu
         return result_dict
     return wrapper
 ConfigurableTask.process_results = process_results_decorator(orig_process_results)
         aggregation_list["end_to_end_time"] = mean
         aggregation_list["prefilling_time"] = mean
         aggregation_list["decoding_throughput"] = mean
+        aggregation_list["mfu"] = mean
+        aggregation_list["mbu"] = mean
         return aggregation_list
     return wrapper
 ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
         higher_is_better_dict["end_to_end_time"] = False
         higher_is_better_dict["prefilling_time"] = False
         higher_is_better_dict["decoding_throughput"] = True
+        higher_is_better_dict["mfu"] = True
+        higher_is_better_dict["mbu"] = True
         return higher_is_better_dict
     return wrapper
 ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
 from src.backend.huggingface_generate_until import HFLMwithChatTemplate
 from src.backend.moe_infinity import MoEHFLM
 def run_evaluation(
     eval_request: EvalRequest,

src/backend/tasks/arena_hard/task.py CHANGED Viewed

@@ -72,7 +72,7 @@ class ArenaHard(ConfigurableTask):
         super().__init__(config={"metadata": {"version": self.VERSION}})
         # these end tokens are hard coded because of the current limitaion of the llm-eval.
         # self.generation_kwargs = {"until": ["\n\n", "<unk>", "<|im_end|>", "</s>", "<|endoftext|>"], "max_length": 512}
-        self.generation_kwargs = {"until": ["</s>", "<|im_end|>"], "max_gen_toks": 4096}
         # self.generation_kwargs_sampling_number = 5  # the number of sampling for self-consistence
         # self.generation_kwargs_sampling = {
         #     "temperature": 0.99,

         super().__init__(config={"metadata": {"version": self.VERSION}})
         # these end tokens are hard coded because of the current limitaion of the llm-eval.
         # self.generation_kwargs = {"until": ["\n\n", "<unk>", "<|im_end|>", "</s>", "<|endoftext|>"], "max_length": 512}
+        self.generation_kwargs = {"until": ["</s>", "<|im_end|>"], "max_length": 4096}
         # self.generation_kwargs_sampling_number = 5  # the number of sampling for self-consistence
         # self.generation_kwargs_sampling = {
         #     "temperature": 0.99,

src/backend/tasks/measurement_task_utils.py CHANGED Viewed

@@ -12,12 +12,8 @@ def process_results_decorator(func):
         end_to_end_time = sum([r[1] for r in results]) / len(results)
         prefilling_time = sum([r[2] for r in results]) / len(results)
         decoding_throughput = sum([r[3] for r in results]) / len(results)
-        decoding_mfu = sum([r[4] for r in results]) / len(results)
-        decoding_mbu = sum([r[5] for r in results]) / len(results)
-        prefill_throughput = sum([r[6] for r in results]) / len(results)
-        prefill_mfu = sum([r[7] for r in results]) / len(results)
-        prefill_mbu = sum([r[8] for r in results]) / len(results)
         # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
@@ -26,11 +22,8 @@ def process_results_decorator(func):
         result_dict["end_to_end_time"] = end_to_end_time
         result_dict["prefilling_time"] = prefilling_time
         result_dict["decoding_throughput"] = decoding_throughput
-        result_dict["decoding_mfu"] = decoding_mfu
-        result_dict["decoding_mbu"] = decoding_mbu
-        result_dict["prefill_throughput"] = prefill_throughput
-        result_dict["prefill_mfu"] = prefill_mfu
-        result_dict["prefill_mbu"] = prefill_mbu
         return result_dict
     return wrapper
@@ -42,11 +35,8 @@ def aggregation_decorator(func):
         aggregation_list["end_to_end_time"] = mean
         aggregation_list["prefilling_time"] = mean
         aggregation_list["decoding_throughput"] = mean
-        aggregation_list["decoding_mfu"] = mean
-        aggregation_list["decoding_mbu"] = mean
-        aggregation_list["prefill_throughput"] = mean
-        aggregation_list["prefill_mfu"] = mean
-        aggregation_list["prefill_mbu"] = mean
         return aggregation_list
     return wrapper
@@ -58,11 +48,8 @@ def higher_is_better_decorator(func):
         higher_is_better_dict["end_to_end_time"] = False
         higher_is_better_dict["prefilling_time"] = False
         higher_is_better_dict["decoding_throughput"] = True
-        higher_is_better_dict["decoding_mfu"] = True
-        higher_is_better_dict["decoding_mbu"] = True
-        higher_is_better_dict["prefill_throughput"] = True
-        higher_is_better_dict["prefill_mfu"] = True
-        higher_is_better_dict["prefill_mbu"] = True
         return higher_is_better_dict
     return wrapper

         end_to_end_time = sum([r[1] for r in results]) / len(results)
         prefilling_time = sum([r[2] for r in results]) / len(results)
         decoding_throughput = sum([r[3] for r in results]) / len(results)
+        mfu = sum([r[4] for r in results]) / len(results)
+        mbu = sum([r[5] for r in results]) / len(results)
         # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
         result_dict["end_to_end_time"] = end_to_end_time
         result_dict["prefilling_time"] = prefilling_time
         result_dict["decoding_throughput"] = decoding_throughput
+        result_dict["mfu"] = mfu
+        result_dict["mbu"] = mbu
         return result_dict
     return wrapper
         aggregation_list["end_to_end_time"] = mean
         aggregation_list["prefilling_time"] = mean
         aggregation_list["decoding_throughput"] = mean
+        aggregation_list["mfu"] = mean
+        aggregation_list["mbu"] = mean
         return aggregation_list
     return wrapper
         higher_is_better_dict["end_to_end_time"] = False
         higher_is_better_dict["prefilling_time"] = False
         higher_is_better_dict["decoding_throughput"] = True
+        higher_is_better_dict["mfu"] = True
+        higher_is_better_dict["mbu"] = True
         return higher_is_better_dict
     return wrapper

src/display/about.py CHANGED Viewed

@@ -18,13 +18,10 @@ Columns and Metrics:
 - Method: The MOE LLMs inference framework.
 - E2E(s): Average End to End generation time in seconds.
 - PRE(s): Prefilling Time of input prompt in seconds.
-- Decoding T/s: Tokens throughout per second for decoding.
-- Decoding S-MBU(%): Sparse Model Bandwidth Utilization for decoding.
-- Decoding S-MFU(%): Sparse Model FLOPs Utilization for decoding.
-- Prefill T/s: Tokens throughout per second for Prefilling.
-- Prefill S-MBU(%): Sparse Model Bandwidth Utilization for Prefilling.
-- Prefill S-MFU(%): Sparse Model FLOPs Utilization for Prefilling.
-- Precision: The precision of used model.
 """

 - Method: The MOE LLMs inference framework.
 - E2E(s): Average End to End generation time in seconds.
 - PRE(s): Prefilling Time of input prompt in seconds.
+- T/s: Tokens throughout per second.
+- MBU(%): Model Bandwidth Utilization.
+- MFU(%): Model FLOPs Utilization.
+- Precision: The precison of used model.
 """

src/display/utils.py CHANGED Viewed

@@ -9,32 +9,25 @@ def fields(raw_class):
 E2Es = "E2E(s)" #"End-to-end time (s)"
 PREs = "PRE(s)" #"Prefilling time (s)"
-TS = "Decoding T/s" #Decoding throughput (tok/s)
-PTS = "Prefill T/s" #Prefill throughput (tok/s)
 InFrame = "Method" #"Inference framework"
 MULTIPLE_CHOICEs = ["mmlu"]
 GPU_TEMP = 'Temp(C)'
 GPU_Power = 'Power(W)'
 GPU_Mem = 'Mem(G)'
 GPU_Name = "GPU"
 GPU_Util = 'Util(%)'
-DSMFU = 'Decoding S-MFU(%)'
-DSMBU = 'Decoding S-MBU(%)'
-PSMFU = 'Prefill S-MFU(%)'
-PSMBU = 'Prefill S-MBU(%)'
 BATCH_SIZE = 'bs'
 PRECISION = "Precision"
 system_metrics_to_name_map = {
     "end_to_end_time": f"{E2Es}",
     "prefilling_time": f"{PREs}",
     "decoding_throughput": f"{TS}",
-    "decoding_mfu": f"{DSMFU}",
-    "decoding_mbu": f"{DSMBU}",
-    "prefill_throughput": f"{PTS}",
-    "prefill_mfu": f"{PSMFU}",
-    "prefill_mbu": f"{PSMBU}",
 }
 gpu_metrics_to_name_map = {
@@ -85,11 +78,10 @@ class Tasks(Enum):
     # # XXX include me back at some point
     # selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
-    # selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
     gsm8k = Task("gsm8k_custom", "em", "GSM8K") #GSM8K/EM (5-shot)
     # gsm8k_cot = Task("gsm8k_cot", "em", "GSM8K COT") #GSM8K COT/EM (5-shot)
     arena_hard = Task("arena_hard", "score", "Arena Hard") #Arena Hard/Score
-    mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
 # These classes are for user facing column names,
@@ -114,7 +106,7 @@ auto_eval_column_dict.append(["model", ColumnContent, ColumnContent("Model", "ma
 # # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
 # Inference framework
-auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True, dummy=True)])
 for task in Tasks:
     auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
@@ -125,30 +117,24 @@ for task in Tasks:
     # auto_eval_column_dict.append([f"{task.name}_gpu_mem", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Mem}", "number", True, hidden=True)])
     auto_eval_column_dict.append([f"{task.name}_gpu", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Name}", "str", True, hidden=True)])
     # auto_eval_column_dict.append([f"{task.name}_gpu_util", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Util}", "number", True, hidden=True)])
-    auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} {PREs}", "number", False, hidden=True)])
     if task.value.benchmark in MULTIPLE_CHOICEs:
         continue
     auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {TS}", "number", True, hidden=True)])
-    # if task.value.benchmark != "gsm8k_custom":
-    #     continue
-    auto_eval_column_dict.append([f"{task.name}_decoding_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {DSMBU}", "number", True, hidden=True)])
-    auto_eval_column_dict.append([f"{task.name}_decoding_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {DSMFU}", "number", True, hidden=True)])
-    auto_eval_column_dict.append([f"{task.name}_prefill_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {PTS}", "number", True, hidden=True)])
-    auto_eval_column_dict.append([f"{task.name}_prefill_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {PSMBU}", "number", True, hidden=True)])
-    auto_eval_column_dict.append([f"{task.name}_prefill_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {PSMFU}", "number", True, hidden=True)])
 # Model information
-auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False, dummy=True)])
-# auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
-# auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
-auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True, dummy=True)])
-# auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
-# auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
-# auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
-# auto_eval_column_dict.append(["still_on_hub", ColumnContent, ColumnContent("Available on the hub", "bool", False)])
-# auto_eval_column_dict.append(["revision", ColumnContent, ColumnContent("Model sha", "str", False, False)])
 # Dummy column for the search bar (hidden by the custom CSS)
 auto_eval_column_dict.append(["dummy", ColumnContent, ColumnContent("model_name_for_query", "str", False, dummy=True)])
@@ -174,10 +160,10 @@ class ModelDetails:
 class ModelType(Enum):
-    # PT = ModelDetails(name="pretrained", symbol="🟢")
-    # FT = ModelDetails(name="fine-tuned on domain-specific datasets", symbol="🔶")
     chat = ModelDetails(name="chat models (RLHF, DPO, IFT, ...)", symbol="💬")
-    # merges = ModelDetails(name="base merges and moerges", symbol="🤝")
     Unknown = ModelDetails(name="", symbol="?")
     def to_str(self, separator=" "):
@@ -185,25 +171,22 @@ class ModelType(Enum):
     @staticmethod
     def from_str(type):
-        # if "fine-tuned" in type or "🔶" in type:
-        #     return ModelType.FT
-        # if "pretrained" in type or "🟢" in type:
-        #     return ModelType.PT
         if any([k in type for k in ["instruction-tuned", "RL-tuned", "chat", "🟦", "⭕", "💬"]]):
             return ModelType.chat
-        # if "merge" in type or "🤝" in type:
-        #     return ModelType.merges
         return ModelType.Unknown
 class InferenceFramework(Enum):
     # "moe-infinity", hf-chat
-    # MoE_Infinity = ModelDetails("moe-infinity")
     HF_Chat = ModelDetails("hf-chat")
     VLLM = ModelDetails("vllm_moe")
-    VLLM_FIX = ModelDetails("vllm_moe_fixbs")
-    TRTLLM = ModelDetails("tensorrt_llm")
-    SGLANG = ModelDetails("sglang")
     Unknown = ModelDetails("?")
     def to_str(self):
@@ -211,23 +194,16 @@ class InferenceFramework(Enum):
     @staticmethod
     def from_str(inference_framework: str):
-        # if inference_framework in ["moe-infinity"]:
-        #     return InferenceFramework.MoE_Infinity
-        if inference_framework in ["tensorrt_llm"]:
-            return InferenceFramework.TRTLLM
         if inference_framework in ["hf-chat"]:
             return InferenceFramework.HF_Chat
         if inference_framework in ["vllm_moe"]:
             return InferenceFramework.VLLM
-        if inference_framework in ["vllm_moe_fixbs"]:
-            return InferenceFramework.VLLM_FIX
-        if inference_framework in ["sglang"]:
-            return InferenceFramework.SGLANG
         return InferenceFramework.Unknown
 class GPUType(Enum):
     A100_sxm = ModelDetails("NVIDIA-A100-SXM4-80GB")
-    A100_sxm4 = ModelDetails("NVIDIA-A100-SMX4-80GB")
     A100_pcie = ModelDetails("NVIDIA-A100-PCIe-80GB")
     Unknown = ModelDetails("?")
@@ -249,28 +225,28 @@ class WeightType(Enum):
 class Precision(Enum):
-    # float32 = ModelDetails("float32")
-    # float16 = ModelDetails("float16")
     bfloat16 = ModelDetails("bfloat16")
     qt_8bit = ModelDetails("8bit")
     qt_4bit = ModelDetails("4bit")
-    # qt_GPTQ = ModelDetails("GPTQ")
     Unknown = ModelDetails("?")
     @staticmethod
     def from_str(precision: str):
-        # if precision in ["torch.float32", "float32"]:
-        #     return Precision.float32
-        # if precision in ["torch.float16", "float16"]:
-        #     return Precision.float16
         if precision in ["torch.bfloat16", "bfloat16"]:
             return Precision.bfloat16
         if precision in ["8bit"]:
             return Precision.qt_8bit
         if precision in ["4bit"]:
             return Precision.qt_4bit
-        # if precision in ["GPTQ", "None"]:
-        #     return Precision.qt_GPTQ
         return Precision.Unknown

 E2Es = "E2E(s)" #"End-to-end time (s)"
 PREs = "PRE(s)" #"Prefilling time (s)"
+TS = "T/s" #Decoding throughput (tok/s)
 InFrame = "Method" #"Inference framework"
 MULTIPLE_CHOICEs = ["mmlu"]
 GPU_TEMP = 'Temp(C)'
 GPU_Power = 'Power(W)'
 GPU_Mem = 'Mem(G)'
 GPU_Name = "GPU"
 GPU_Util = 'Util(%)'
+MFU = 'MFU(%)'
+MBU = 'MBU(%)'
 BATCH_SIZE = 'bs'
 PRECISION = "Precision"
 system_metrics_to_name_map = {
     "end_to_end_time": f"{E2Es}",
     "prefilling_time": f"{PREs}",
     "decoding_throughput": f"{TS}",
+    "mfu": f"{MFU}",
+    "mbu": f"{MBU}"
 }
 gpu_metrics_to_name_map = {
     # # XXX include me back at some point
     # selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
+    mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
     gsm8k = Task("gsm8k_custom", "em", "GSM8K") #GSM8K/EM (5-shot)
     # gsm8k_cot = Task("gsm8k_cot", "em", "GSM8K COT") #GSM8K COT/EM (5-shot)
     arena_hard = Task("arena_hard", "score", "Arena Hard") #Arena Hard/Score
 # These classes are for user facing column names,
 # # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
 # Inference framework
+auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True)])
 for task in Tasks:
     auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
     # auto_eval_column_dict.append([f"{task.name}_gpu_mem", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Mem}", "number", True, hidden=True)])
     auto_eval_column_dict.append([f"{task.name}_gpu", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Name}", "str", True, hidden=True)])
     # auto_eval_column_dict.append([f"{task.name}_gpu_util", ColumnContent, ColumnContent(f"{task.value.col_name} {GPU_Util}", "number", True, hidden=True)])
     if task.value.benchmark in MULTIPLE_CHOICEs:
         continue
+    # auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} {PREs}", "number", False, hidden=True)])
     auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} {TS}", "number", True, hidden=True)])
+    auto_eval_column_dict.append([f"{task.name}_mbu", ColumnContent, ColumnContent(f"{task.value.col_name} {MBU}", "number", True, hidden=True)])
+    auto_eval_column_dict.append([f"{task.name}_mfu", ColumnContent, ColumnContent(f"{task.value.col_name} {MFU}", "number", True, hidden=True)])
 # Model information
+auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
+auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
+auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
+auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True)])
+auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
+auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
+auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
+auto_eval_column_dict.append(["still_on_hub", ColumnContent, ColumnContent("Available on the hub", "bool", False)])
+auto_eval_column_dict.append(["revision", ColumnContent, ColumnContent("Model sha", "str", False, False)])
 # Dummy column for the search bar (hidden by the custom CSS)
 auto_eval_column_dict.append(["dummy", ColumnContent, ColumnContent("model_name_for_query", "str", False, dummy=True)])
 class ModelType(Enum):
+    PT = ModelDetails(name="pretrained", symbol="🟢")
+    FT = ModelDetails(name="fine-tuned on domain-specific datasets", symbol="🔶")
     chat = ModelDetails(name="chat models (RLHF, DPO, IFT, ...)", symbol="💬")
+    merges = ModelDetails(name="base merges and moerges", symbol="🤝")
     Unknown = ModelDetails(name="", symbol="?")
     def to_str(self, separator=" "):
     @staticmethod
     def from_str(type):
+        if "fine-tuned" in type or "🔶" in type:
+            return ModelType.FT
+        if "pretrained" in type or "🟢" in type:
+            return ModelType.PT
         if any([k in type for k in ["instruction-tuned", "RL-tuned", "chat", "🟦", "⭕", "💬"]]):
             return ModelType.chat
+        if "merge" in type or "🤝" in type:
+            return ModelType.merges
         return ModelType.Unknown
 class InferenceFramework(Enum):
     # "moe-infinity", hf-chat
+    MoE_Infinity = ModelDetails("moe-infinity")
     HF_Chat = ModelDetails("hf-chat")
     VLLM = ModelDetails("vllm_moe")
     Unknown = ModelDetails("?")
     def to_str(self):
     @staticmethod
     def from_str(inference_framework: str):
+        if inference_framework in ["moe-infinity"]:
+            return InferenceFramework.MoE_Infinity
         if inference_framework in ["hf-chat"]:
             return InferenceFramework.HF_Chat
         if inference_framework in ["vllm_moe"]:
             return InferenceFramework.VLLM
         return InferenceFramework.Unknown
 class GPUType(Enum):
     A100_sxm = ModelDetails("NVIDIA-A100-SXM4-80GB")
     A100_pcie = ModelDetails("NVIDIA-A100-PCIe-80GB")
     Unknown = ModelDetails("?")
 class Precision(Enum):
+    float32 = ModelDetails("float32")
+    float16 = ModelDetails("float16")
     bfloat16 = ModelDetails("bfloat16")
     qt_8bit = ModelDetails("8bit")
     qt_4bit = ModelDetails("4bit")
+    qt_GPTQ = ModelDetails("GPTQ")
     Unknown = ModelDetails("?")
     @staticmethod
     def from_str(precision: str):
+        if precision in ["torch.float32", "float32"]:
+            return Precision.float32
+        if precision in ["torch.float16", "float16"]:
+            return Precision.float16
         if precision in ["torch.bfloat16", "bfloat16"]:
             return Precision.bfloat16
         if precision in ["8bit"]:
             return Precision.qt_8bit
         if precision in ["4bit"]:
             return Precision.qt_4bit
+        if precision in ["GPTQ", "None"]:
+            return Precision.qt_GPTQ
         return Precision.Unknown

src/leaderboard/read_evals.py CHANGED Viewed

@@ -140,7 +140,6 @@ class EvalResult:
             revision=config.get("model_sha", ""),
             still_on_hub=still_on_hub,
             architecture=architecture,
-            model_type=ModelType.from_str(config.get("model_type", "")),
             inference_framework=inference_framework,
         )
@@ -175,22 +174,22 @@ class EvalResult:
         # breakpoint()
         # average = sum([v for v in self.results.values() if v is not None]) / len(Tasks)
         data_dict = {
             "eval_name": self.eval_name,  # not a column, just a save name,
             AutoEvalColumn.precision.name: self.precision.value.name,
-            # AutoEvalColumn.model_type.name: self.model_type.value.name,
             AutoEvalColumn.model_type_symbol.name: self.model_type.value.symbol,
-            # AutoEvalColumn.weight_type.name: self.weight_type.value.name,
-            # AutoEvalColumn.architecture.name: self.architecture,
             AutoEvalColumn.model.name: make_clickable_model(self.full_model),
             AutoEvalColumn.dummy.name: self.full_model,
-            # AutoEvalColumn.revision.name: self.revision,
-            # # AutoEvalColumn.average.name: average,
-            # AutoEvalColumn.license.name: self.license,
-            # AutoEvalColumn.likes.name: self.likes,
-            # AutoEvalColumn.params.name: self.num_params,
-            # AutoEvalColumn.still_on_hub.name: self.still_on_hub,
             AutoEvalColumn.inference_framework.name: self.inference_framework,
         }

             revision=config.get("model_sha", ""),
             still_on_hub=still_on_hub,
             architecture=architecture,
             inference_framework=inference_framework,
         )
         # breakpoint()
         # average = sum([v for v in self.results.values() if v is not None]) / len(Tasks)
         data_dict = {
             "eval_name": self.eval_name,  # not a column, just a save name,
             AutoEvalColumn.precision.name: self.precision.value.name,
+            AutoEvalColumn.model_type.name: self.model_type.value.name,
             AutoEvalColumn.model_type_symbol.name: self.model_type.value.symbol,
+            AutoEvalColumn.weight_type.name: self.weight_type.value.name,
+            AutoEvalColumn.architecture.name: self.architecture,
             AutoEvalColumn.model.name: make_clickable_model(self.full_model),
             AutoEvalColumn.dummy.name: self.full_model,
+            AutoEvalColumn.revision.name: self.revision,
+            # AutoEvalColumn.average.name: average,
+            AutoEvalColumn.license.name: self.license,
+            AutoEvalColumn.likes.name: self.likes,
+            AutoEvalColumn.params.name: self.num_params,
+            AutoEvalColumn.still_on_hub.name: self.still_on_hub,
             AutoEvalColumn.inference_framework.name: self.inference_framework,
         }

src/populate.py CHANGED Viewed

@@ -75,7 +75,7 @@ def get_leaderboard_df(
             df[col] = np.nan
     if not df.empty:
-        df = df.map(lambda x: round(x, 2) if isinstance(x, (int, float)) else x)
         # filter out if any of the benchmarks have not been produced
         # df = df[has_no_nan_values(df, benchmark_cols)]

             df[col] = np.nan
     if not df.empty:
+        df = df.round(decimals=2)
         # filter out if any of the benchmarks have not been produced
         # df = df[has_no_nan_values(df, benchmark_cols)]

src/utils.py CHANGED Viewed

@@ -4,8 +4,6 @@ import subprocess
 import re
 import os
 import GPUtil
-from transformers import AutoConfig
-from typing import List
 try:
     from src.display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
@@ -14,63 +12,44 @@ except:
     from display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
 MEM_BW_DICT ={
-    "NVIDIA-A100-PCIe-80GB": 1935e9,
-    "NVIDIA-A100-SXM4-80GB": 2039e9,
-    "NVIDIA-H100-PCIe-80GB": 2039e9,
-    "NVIDIA-RTX-A5000-24GB": 768e9,
-    "NVIDIA-RTX-A6000-48GB": 768e9,
 }
 PEAK_FLOPS_DICT = {
     "float32":{
         "NVIDIA-A100-PCIe-80GB": 312e12,
-        "NVIDIA-A100-SXM4-80GB": 312e12,
         "NVIDIA-H100-PCIe-80GB": 756e12,
-        "NVIDIA-RTX-A5000-24GB": 222.2e12,
-        "NVIDIA-RTX-A6000-48GB": 309.7e12
     },
     "float16":{
         "NVIDIA-A100-PCIe-80GB": 624e12,
-        "NVIDIA-A100-SXM4-80GB": 624e12,
         "NVIDIA-H100-PCIe-80GB": 1513e12,
-        "NVIDIA-RTX-A5000-24GB": 222.2e12,
-        "NVIDIA-RTX-A6000-48GB": 309.7e12
     },
     "bfloat16":{
         "NVIDIA-A100-PCIe-80GB": 624e12,
-        "NVIDIA-A100-SXM4-80GB": 624e12,
         "NVIDIA-H100-PCIe-80GB": 1513e12,
-        "NVIDIA-RTX-A5000-24GB": 222.2e12,
-        "NVIDIA-RTX-A6000-48GB": 309.7e12
     },
-    "int8":{
         "NVIDIA-A100-PCIe-80GB": 1248e12,
-        "NVIDIA-A100-SXM4-80GB": 1248e12,
         "NVIDIA-H100-PCIe-80GB": 3026e12,
-        "NVIDIA-RTX-A5000-24GB": 222.2e12,
-        "NVIDIA-RTX-A6000-48GB": 309.7e12
     },
-    "fp8":{
-        "NVIDIA-A100-PCIe-80GB": 1248e12,
-        "NVIDIA-A100-SXM4-80GB": 1248e12,
-        "NVIDIA-H100-PCIe-80GB": 3026e12,
-        "NVIDIA-RTX-A5000-24GB": 0,
-        "NVIDIA-RTX-A6000-48GB": 0
-    },
-    "fp4": {
-        "NVIDIA-A100-PCIe-80GB": 1248e12,
-        "NVIDIA-A100-SXM4-80GB": 1248e12,
-        "NVIDIA-H100-PCIe-80GB": 3026e12,
-        "NVIDIA-RTX-A5000-24GB": 0,
-        "NVIDIA-RTX-A6000-48GB": 0
-    },
-    "int4": {
-        "NVIDIA-A100-PCIe-80GB": 1248e12,
-        "NVIDIA-A100-SXM4-80GB": 1248e12,
-        "NVIDIA-H100-PCIe-80GB": 3026e12,
-        "NVIDIA-RTX-A5000-24GB": 222.2e12,
-        "NVIDIA-RTX-A6000-48GB": 309.7e12
     }
 }
 def my_snapshot_download(repo_id, revision, local_dir, repo_type, max_workers):
@@ -118,7 +97,7 @@ def parse_nvidia_smi():
     # print(f"gpu_indices: {gpu_indices}")
     gpu_stats = []
-    gpu_info_pattern = re.compile(r'(\d+)C\s+P\d+\s+(\d+)W\s*/\s*\d+W\s*\|\s*(\d+)MiB\s*/\s*\d+MiB\s*\|\s*(\d+)%')
     # gpu_name_pattern = re.compile(r'NVIDIA\s+([\w\s]+\d+(?:\s*GB)?)')
     gpu_name_pattern = re.compile(r'NVIDIA\s+(RTX\s+)?([A-Z0-9]+)')
@@ -216,790 +195,17 @@ def get_peak_bw(gpu_name):
 def get_peak_flops(gpu_name, precision):
     return PEAK_FLOPS_DICT[precision][gpu_name]
-def _calculate_batch_metrics(outputs, decoding_tp, n_layers, d_model,
-                                n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
-                                avg_activated_experts, hf_config, num_gpus, model_name,
-                                used_dtype, batch_size, precision):
-    """Calculate metrics for a batch of outputs"""
-    gpu_type = get_gpu_details()
-    hardware_specs = {
-        "peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
-        "peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
-    }
-    kvs = []
-    true_kvs = []
-    attn_score = []
-    # Calculate KV sizes
-    per_token_kv_size = 2 * n_layers * d_head * n_kv_heads  # Default calculation
-    if "DeepSeek" in model_name:
-        if hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
-            per_token_kv_size = n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
-    # Process each output
-    for x in outputs:
-        output_len = len(x.outputs[0].token_ids)
-        context_prefill_size = len(x.prompt_token_ids)
-        # Calculate attention scores
-        if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim"):
-            q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
-            origin_per_token_k_state_size = n_layers * n_attn_heads * q_head_dim
-            origin_per_token_v_state_size = n_layers * n_attn_heads * hf_config.v_head_dim
-            attention_score = context_prefill_size * origin_per_token_k_state_size + (output_len - 1) * origin_per_token_k_state_size / 2
-            attention_score += context_prefill_size * origin_per_token_v_state_size + (output_len - 1) * origin_per_token_v_state_size / 2
-            attention_score = attention_score / 1e12
-        else:
-            origin_per_token_kv_states_size = n_layers * n_attn_heads * d_head
-            attention_score = context_prefill_size * origin_per_token_kv_states_size + (output_len - 1) * origin_per_token_kv_states_size / 2
-            attention_score = attention_score * 2 / 1e12
-        # Store attention scores and KV sizes
-        attn_score.append(attention_score)
-        kv_size = context_prefill_size * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2
-        kv_size = kv_size / 1e12
-        true_kv = (context_prefill_size * per_token_kv_size + output_len * per_token_kv_size) / 1e12 * 1e3
-        kvs.append(kv_size)
-        true_kvs.append(true_kv)
-    # Calculate aggregate values
-    kv_size = sum(kvs)
-    true_kv_size = sum(true_kvs) * 1e3
-    attention_score = sum(attn_score) / len(attn_score)
-    # Calculate attention size per token
-    if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim") and hasattr(hf_config, "kv_lora_rank"):
-        q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
-        if not hasattr(hf_config, "q_lora_rank") or not hf_config.q_lora_rank:
-            attention_size_per_token = (d_model * n_attn_heads * q_head_dim) + \
-                (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
-                    (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
-                        (hf_config.v_head_dim * n_attn_heads * d_model)
-            attention_size_per_token = attention_size_per_token / 1e12
-        else:
-            attention_size_per_token = (d_model * hf_config.q_lora_rank) + \
-                (hf_config.q_lora_rank * n_attn_heads * q_head_dim) + \
-                    (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
-                        (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
-                            (hf_config.v_head_dim * n_attn_heads * d_model)
-            attention_size_per_token = attention_size_per_token / 1e12
-    else:
-        attention_size_per_token = d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) + n_attn_heads * d_head * d_model
-        attention_size_per_token = attention_size_per_token / 1e12
-    # Calculate expert sizes
-    expert_size = d_ff * 3 * d_model / 1e12
-    shared_experts_size_total = 0
-    deepseek_dense_ffn_size = 0
-    deepseek_sparse_layer_num = 0
-    if "Qwen" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "shared_expert_intermediate_size"):
-        d_ff = hf_config.moe_intermediate_size
-        d_ff_share = hf_config.shared_expert_intermediate_size
-        shared_experts_size = d_ff_share * 3 * d_model
-        expert_size = d_ff * 3 * d_model
-        shared_experts_size_total = shared_experts_size / 1e12
-        expert_size = expert_size / 1e12
-    elif "Qwen3" in model_name and hasattr(hf_config, "moe_intermediate_size"):
-        d_ff = hf_config.moe_intermediate_size
-        expert_size = d_ff * 3 * d_model
-        expert_size = expert_size / 1e12
-    elif "DeepSeek" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "intermediate_size") and hasattr(hf_config, "first_k_dense_replace"):
-        d_ff = hf_config.moe_intermediate_size
-        d_ff_dense = hf_config.intermediate_size
-        deepseek_num_dense_layer = hf_config.first_k_dense_replace
-        shared_experts_size = d_ff * 3 * d_model
-        expert_size = d_ff * 3 * d_model
-        shared_experts = 2
-        shared_experts_size_total = shared_experts_size * shared_experts / 1e12
-        expert_size = expert_size / 1e12
-        deepseek_sparse_layer_num = n_layers - deepseek_num_dense_layer
-        deepseek_dense_ffn_size = d_ff_dense * 3 * d_model / 1e12
-    # Calculate S-MBU and S-MFU
-    if "Qwen" in model_name and not "Qwen3" in model_name:
-        smbu = ((n_layers*(avg_activated_experts * expert_size + shared_experts_size_total + attention_size_per_token) +
-                kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size + shared_experts_size_total) + attention_score) \
-            * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    elif "Qwen3" in model_name:
-        smbu = ((n_layers * (avg_activated_experts * expert_size + attention_size_per_token) +
-                kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
-            * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    elif "DeepSeek" in model_name:
-        smbu = ((n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
-                (avg_activated_experts * expert_size + shared_experts_size_total) + \
-                deepseek_num_dense_layer * deepseek_dense_ffn_size + \
-                kv_size) * precision/ (batch_size / decoding_tp)) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
-                (n_experts_per_tok * expert_size + shared_experts_size_total) + \
-                deepseek_num_dense_layer * deepseek_dense_ffn_size + attention_score) \
-                * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    else:
-        smbu = ((n_layers*(avg_activated_experts * expert_size + attention_size_per_token) +
-                kv_size) * precision/ (batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
-            * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return {
-        'smbu': smbu,
-        'smfu': smfu,
-        'kv_size': true_kv_size,
-        'decoding_throughput': decoding_tp
-    }
-def _calculate_batch_metrics_sglang(outputs, decoding_tp, n_layers, d_model,
-                                n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
-                                avg_activated_experts, hf_config, num_gpus, model_name,
-                                used_dtype, batch_size, precision, ttft=None, prefill_tp=None):
-    """Calculate metrics for a batch of outputs"""
-    # Initialize hardware specs and output lists
-    hardware_specs = _get_hardware_specs(used_dtype)
-    output_data = _extract_output_data(outputs)
-    # Calculate model-specific sizes
-    per_token_kv_size = _calculate_kv_size(model_name, hf_config, n_layers, d_head, n_kv_heads)
-    attention_size_per_token = _calculate_attention_size(model_name, hf_config, d_model, n_attn_heads, d_head, n_kv_heads)
-    expert_config = _calculate_expert_config(model_name, hf_config, d_ff, d_model, n_layers)
-    # Process outputs and calculate metrics
-    metrics_data = _process_outputs(output_data, per_token_kv_size, attention_size_per_token,
-                                  model_name, hf_config, n_layers, n_attn_heads, d_head)
-    # Calculate throughput metrics
-    if ttft is None or prefill_tp is None:
-        ttft, prefill_tp = _calculate_throughput_metrics(batch_size, output_data['prefill_lengths'],
-                                                       output_data['max_duration'])
-    # Calculate S-MBU and S-MFU
-    smbu_smfu_metrics = _calculate_smbu_smfu(model_name, n_layers, attention_size_per_token,
-                                           expert_config, avg_activated_experts, metrics_data,
-                                           hardware_specs, num_gpus, precision, ttft, prefill_tp,
-                                           batch_size, decoding_tp)
-    return {
-        'prefill_smbu': smbu_smfu_metrics['prefill_smbu'],
-        'prefill_smfu': smbu_smfu_metrics['prefill_smfu'],
-        'decoding_smbu': smbu_smfu_metrics['decoding_smbu'],
-        'decoding_smfu': smbu_smfu_metrics['decoding_smfu'],
-        'kv_size': metrics_data['true_kv_size'],
-        'decoding_throughput': decoding_tp,
-        'prefill_tp': prefill_tp,
-        'ttft': ttft
-    }
-def _get_hardware_specs(used_dtype):
-    """Get hardware specifications"""
-    gpu_type = get_gpu_details()
-    return {
-        "peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
-        "peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
-    }
-def _extract_output_data(outputs):
-    """Extract relevant data from outputs"""
-    prefill_lengths = []
-    output_lengths = []
-    max_duration = 0.0
-    for x in outputs:
-        output_lengths.append(x['meta_info']['completion_tokens'])
-        prefill_lengths.append(x['meta_info']['prompt_tokens'])
-        max_duration = max(max_duration, x['meta_info']['e2e_latency'])
-    return {
-        'prefill_lengths': prefill_lengths,
-        'output_lengths': output_lengths,
-        'max_duration': max_duration
-    }
-def _calculate_kv_size(model_name, hf_config, n_layers, d_head, n_kv_heads):
-    """Calculate per-token KV size based on model type"""
-    if "DeepSeek" in model_name and hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
-        return n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
-    return 2 * n_layers * d_head * n_kv_heads
-def _calculate_attention_size(model_name, hf_config, d_model, n_attn_heads, d_head, n_kv_heads):
-    """Calculate attention size per token based on model type"""
-    if ("DeepSeek" in model_name and
-        hasattr(hf_config, "qk_rope_head_dim") and
-        hasattr(hf_config, "qk_nope_head_dim") and
-        hasattr(hf_config, "v_head_dim") and
-        hasattr(hf_config, "kv_lora_rank")):
-        return _calculate_deepseek_attention_size(hf_config, d_model, n_attn_heads)
-    return (d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) +
-            n_attn_heads * d_head * d_model) / 1e12
-def _calculate_deepseek_attention_size(hf_config, d_model, n_attn_heads):
-    """Calculate DeepSeek-specific attention size"""
-    q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
-    base_size = ((d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) +
-                (hf_config.kv_lora_rank * n_attn_heads *
-                 (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) +
-                (hf_config.v_head_dim * n_attn_heads * d_model))
-    if hasattr(hf_config, "q_lora_rank") and hf_config.q_lora_rank:
-        q_size = (d_model * hf_config.q_lora_rank +
-                 hf_config.q_lora_rank * n_attn_heads * q_head_dim)
-    else:
-        q_size = d_model * n_attn_heads * q_head_dim
-    return (base_size + q_size) / 1e12
-def _calculate_expert_config(model_name, hf_config, d_ff, d_model, n_layers):
-    """Calculate expert configuration based on model type"""
-    config = {
-        'expert_size': d_ff * 3 * d_model / 1e12,
-        'shared_experts_size_total': 0,
-        'deepseek_dense_ffn_size': 0,
-        'deepseek_sparse_layer_num': 0,
-        'deepseek_num_dense_layer': 0
-    }
-    if "Qwen" in model_name and not "Qwen3" in model_name:
-        config.update(_get_qwen_expert_config(hf_config, d_model))
-    elif "Qwen3" in model_name:
-        config.update(_get_qwen3_expert_config(hf_config, d_model))
-    elif "DeepSeek" in model_name:
-        config.update(_get_deepseek_expert_config(hf_config, d_model, n_layers))
-    return config
-def _get_qwen_expert_config(hf_config, d_model):
-    """Get Qwen-specific expert configuration"""
-    if (hasattr(hf_config, "moe_intermediate_size") and
-        hasattr(hf_config, "shared_expert_intermediate_size")):
-        return {
-            'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12,
-            'shared_experts_size_total': hf_config.shared_expert_intermediate_size * 3 * d_model / 1e12
-        }
-    return {}
-def _get_qwen3_expert_config(hf_config, d_model):
-    """Get Qwen3-specific expert configuration"""
-    if hasattr(hf_config, "moe_intermediate_size"):
-        return {
-            'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12
-        }
-    return {}
-def _get_deepseek_expert_config(hf_config, d_model, n_layers):
-    """Get DeepSeek-specific expert configuration"""
-    if (hasattr(hf_config, "moe_intermediate_size") and
-        hasattr(hf_config, "intermediate_size") and
-        hasattr(hf_config, "first_k_dense_replace")):
-        deepseek_num_dense_layer = hf_config.first_k_dense_replace
-        return {
-            'expert_size': hf_config.moe_intermediate_size * 3 * d_model / 1e12,
-            'shared_experts_size_total': hf_config.moe_intermediate_size * 3 * d_model * 2 / 1e12,
-            'deepseek_dense_ffn_size': hf_config.intermediate_size * 3 * d_model / 1e12,
-            'deepseek_sparse_layer_num': n_layers - deepseek_num_dense_layer,
-            'deepseek_num_dense_layer': deepseek_num_dense_layer
-        }
-    return {}
-def _process_outputs(output_data, per_token_kv_size, attention_size_per_token,
-                    model_name, hf_config, n_layers, n_attn_heads, d_head):
-    """Process outputs to calculate KV sizes and attention scores"""
-    kvs = []
-    true_kvs = []
-    attn_scores = []
-    for prefill_len, output_len in zip(output_data['prefill_lengths'], output_data['output_lengths']):
-        # Calculate attention score
-        attn_score = _calculate_attention_score(model_name, hf_config, prefill_len, output_len,
-                                              n_layers, n_attn_heads, d_head)
-        attn_scores.append(attn_score)
-        # Calculate KV sizes
-        kv_size = (prefill_len * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2) / 1e12
-        true_kv = (prefill_len * per_token_kv_size + output_len * per_token_kv_size) / 1e9
-        kvs.append(kv_size)
-        true_kvs.append(true_kv)
-    return {
-        'kv_size': sum(kvs),
-        'true_kv_size': sum(true_kvs) * 1e3,
-        'attention_score': sum(attn_scores) / len(attn_scores)
-    }
-def _calculate_attention_score(model_name, hf_config, prefill_len, output_len,
-                             n_layers, n_attn_heads, d_head):
-    """Calculate attention score for a single output"""
-    if ("DeepSeek" in model_name and
-        hasattr(hf_config, "qk_rope_head_dim") and
-        hasattr(hf_config, "qk_nope_head_dim") and
-        hasattr(hf_config, "v_head_dim")):
-        q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
-        k_size = n_layers * n_attn_heads * q_head_dim
-        v_size = n_layers * n_attn_heads * hf_config.v_head_dim
-        score = (prefill_len * k_size + (output_len - 1) * k_size / 2 +
-                prefill_len * v_size + (output_len - 1) * v_size / 2)
-    else:
-        kv_size = n_layers * n_attn_heads * d_head
-        score = (prefill_len * kv_size + (output_len - 1) * kv_size / 2) * 2
-    return score / 1e12
-def _calculate_throughput_metrics(batch_size, prefill_lengths, max_duration):
-    """Calculate throughput metrics"""
-    total_prefill = sum(prefill_lengths)
-    prefill_tp = total_prefill / (max_duration)
-    ttft = max_duration / batch_size
-    return ttft, prefill_tp
-def _calculate_smbu_smfu(model_name, n_layers, attention_size_per_token, expert_config,
-                        avg_activated_experts, metrics_data, hardware_specs, num_gpus,
-                        precision, ttft, prefill_tp, batch_size, decoding_tp):
-    """Calculate S-MBU and S-MFU metrics"""
-    prefill_activation = avg_activated_experts[1]
-    decode_steps_activation = avg_activated_experts[2:]
-    # Calculate prefill metrics
-    prefill_smbu, prefill_smfu = _calculate_prefill_metrics(
-        model_name, n_layers, attention_size_per_token, expert_config,
-        prefill_activation, metrics_data['attention_score'], hardware_specs,
-        num_gpus, precision, ttft, prefill_tp
-    )
-    # Calculate decoding metrics
-    decoding_smbu, decoding_smfu = _calculate_decoding_metrics(
-        model_name, n_layers, attention_size_per_token, expert_config,
-        decode_steps_activation, metrics_data, hardware_specs,
-        num_gpus, precision, batch_size, decoding_tp
-    )
-    return {
-        'prefill_smbu': prefill_smbu,
-        'prefill_smfu': prefill_smfu,
-        'decoding_smbu': decoding_smbu,
-        'decoding_smfu': decoding_smfu
-    }
-def _calculate_prefill_metrics(model_name, n_layers, attention_size_per_token, expert_config,
-                             prefill_activation, attention_score, hardware_specs,
-                             num_gpus, precision, ttft, prefill_tp):
-    """Calculate prefill S-MBU and S-MFU"""
-    model_calculators = {
-        'Qwen': _calculate_qwen_prefill,
-        'Qwen3': _calculate_qwen3_prefill,
-        'DeepSeek': _calculate_deepseek_prefill
-    }
-    for model_type, calculator in model_calculators.items():
-        if model_type in model_name and (model_type != 'Qwen' or 'Qwen3' not in model_name):
-            return calculator(n_layers, attention_size_per_token, expert_config,
-                            prefill_activation, attention_score, hardware_specs,
-                            num_gpus, precision, ttft, prefill_tp)
-    # Default case
-    return _calculate_default_prefill(n_layers, attention_size_per_token, expert_config,
-                                    prefill_activation, attention_score, hardware_specs,
-                                    num_gpus, precision, ttft, prefill_tp)
-def _calculate_decoding_metrics(model_name, n_layers, attention_size_per_token, expert_config,
-                              decode_steps_activation, metrics_data, hardware_specs,
-                              num_gpus, precision, batch_size, decoding_tp):
-    """Calculate decoding S-MBU and S-MFU"""
-    decoding_smbus = []
-    for activation in decode_steps_activation:
-        if "Qwen" in model_name and "Qwen3" not in model_name:
-            smbu, smfu = _calculate_qwen_decoding(n_layers, attention_size_per_token, expert_config,
-                                                activation, metrics_data, hardware_specs, num_gpus,
-                                                precision, batch_size, decoding_tp)
-        elif "Qwen3" in model_name:
-            smbu, smfu = _calculate_qwen3_decoding(n_layers, attention_size_per_token, expert_config,
-                                                 activation, metrics_data, hardware_specs, num_gpus,
-                                                 precision, batch_size, decoding_tp)
-        elif "DeepSeek" in model_name:
-            smbu, smfu = _calculate_deepseek_decoding(n_layers, attention_size_per_token, expert_config,
-                                                    activation, metrics_data, hardware_specs, num_gpus,
-                                                    precision, batch_size, decoding_tp)
-        else:
-            smbu, smfu = _calculate_default_decoding(n_layers, attention_size_per_token, expert_config,
-                                                   activation, metrics_data, hardware_specs, num_gpus,
-                                                   precision, batch_size, decoding_tp)
-        decoding_smbus.append(smbu)
-    return sum(decoding_smbus) / len(decoding_smbus), smfu
-# Helper functions for specific model calculations
-def _calculate_qwen_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
-                          attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
-    smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
-                                expert_config['shared_experts_size_total'] +
-                                attention_size_per_token)) * precision / ttft
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size'] +
-                                expert_config['shared_experts_size_total']) + attention_score) * 2 * prefill_tp
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_qwen3_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
-                           attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
-    smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
-                                attention_size_per_token)) * precision / ttft
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size']) +
-                     attention_score) * 2 * prefill_tp
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_deepseek_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
-                              attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
-    smbu_numerator = ((n_layers * attention_size_per_token +
-                      expert_config['deepseek_sparse_layer_num'] *
-                      (prefill_activation * expert_config['expert_size'] +
-                       expert_config['shared_experts_size_total']) +
-                      expert_config['deepseek_num_dense_layer'] *
-                      expert_config['deepseek_dense_ffn_size']) * precision / ttft)
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = ((n_layers * attention_size_per_token +
-                      expert_config['deepseek_sparse_layer_num'] *
-                      (expert_config['expert_size'] + expert_config['shared_experts_size_total']) +
-                      expert_config['deepseek_num_dense_layer'] *
-                      expert_config['deepseek_dense_ffn_size'] + attention_score) * 2 * prefill_tp)
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_default_prefill(n_layers, attention_size_per_token, expert_config, prefill_activation,
-                             attention_score, hardware_specs, num_gpus, precision, ttft, prefill_tp):
-    # Default implementation
-    smbu_numerator = (n_layers * (prefill_activation * expert_config['expert_size'] +
-                                attention_size_per_token)) * precision / ttft
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = (n_layers * (attention_size_per_token + expert_config['expert_size']) +
-                     attention_score) * 2 * prefill_tp
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_qwen_decoding(n_layers, attention_size_per_token, expert_config, activation,
-                           metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
-    smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
-                                 expert_config['shared_experts_size_total'] +
-                                 attention_size_per_token) +
-                      metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size'] +
-                                 expert_config['shared_experts_size_total']) +
-                      metrics_data['attention_score']) * 2 * decoding_tp)
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_qwen3_decoding(n_layers, attention_size_per_token, expert_config, activation,
-                            metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
-    smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
-                                 attention_size_per_token) +
-                      metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size']) +
-                      metrics_data['attention_score']) * 2 * decoding_tp)
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_deepseek_decoding(n_layers, attention_size_per_token, expert_config, activation,
-                               metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
-    smbu_numerator = ((n_layers * attention_size_per_token +
-                      expert_config['deepseek_sparse_layer_num'] *
-                      (activation * expert_config['expert_size'] +
-                       expert_config['shared_experts_size_total']) +
-                      expert_config['deepseek_num_dense_layer'] *
-                      expert_config['deepseek_dense_ffn_size'] +
-                      metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = ((n_layers * attention_size_per_token +
-                      expert_config['deepseek_sparse_layer_num'] *
-                      (expert_config['expert_size'] + expert_config['shared_experts_size_total']) +
-                      expert_config['deepseek_num_dense_layer'] *
-                      expert_config['deepseek_dense_ffn_size'] +
-                      metrics_data['attention_score']) * 2 * decoding_tp)
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_default_decoding(n_layers, attention_size_per_token, expert_config, activation,
-                              metrics_data, hardware_specs, num_gpus, precision, batch_size, decoding_tp):
-    smbu_numerator = ((n_layers * (activation * expert_config['expert_size'] +
-                                 attention_size_per_token) +
-                      metrics_data['kv_size']) * precision / (batch_size / decoding_tp))
-    smbu = smbu_numerator / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-    smfu_numerator = ((n_layers * (attention_size_per_token + expert_config['expert_size']) +
-                      metrics_data['attention_score']) * 2 * decoding_tp)
-    smfu = smfu_numerator / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return smbu, smfu
-def _calculate_batch_metrics_hflm(output_len, context_prefill_size, decoding_tp, n_layers, d_model,
-                                n_attn_heads, d_head, n_kv_heads, n_experts_per_tok, d_ff,
-                                avg_activated_experts, hf_config, num_gpus, model_name,
-                                used_dtype, batch_size, precision):
-    """Calculate metrics for a batch of outputs"""
-    gpu_type = get_gpu_details()
-    hardware_specs = {
-        "peak_bandwidth_tb": get_peak_bw(gpu_type) / 1e12,
-        "peak_flops_tf": get_peak_flops(gpu_type, precision=used_dtype) / 1e12,
-    }
-    # Calculate KV sizes
-    per_token_kv_size = 2 * n_layers * d_head * n_kv_heads  # Default calculation
-    if "DeepSeek" in model_name:
-        if hasattr(hf_config, "kv_lora_rank") and hasattr(hf_config, "qk_rope_head_dim"):
-            per_token_kv_size = n_layers * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)
-    # Calculate attention scores
-    if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim"):
-        q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
-        origin_per_token_k_state_size = n_layers * n_attn_heads * q_head_dim
-        origin_per_token_v_state_size = n_layers * n_attn_heads * hf_config.v_head_dim
-        attention_score = context_prefill_size * origin_per_token_k_state_size + (output_len - 1) * origin_per_token_k_state_size / 2
-        attention_score += context_prefill_size * origin_per_token_v_state_size + (output_len - 1) * origin_per_token_v_state_size / 2
-        attention_score = attention_score / 1e12
     else:
-        origin_per_token_kv_states_size = n_layers * n_attn_heads * d_head
-        attention_score = context_prefill_size * origin_per_token_kv_states_size + (output_len - 1) * origin_per_token_kv_states_size / 2
-        attention_score = attention_score * 2 / 1e12
-    # Store attention scores and KV sizes
-    kv_size = context_prefill_size * per_token_kv_size + (output_len - 1) * per_token_kv_size / 2
-    kv_size = kv_size / 1e12
-    true_kv = (context_prefill_size * per_token_kv_size + output_len * per_token_kv_size) / 1e12 * 1e3
-    # Calculate aggregate values
-    kv_size = kv_size * batch_size
-    true_kv_size = true_kv * batch_size * 1e3
-    # Calculate attention size per token
-    if "DeepSeek" in model_name and hasattr(hf_config, "qk_rope_head_dim") and hasattr(hf_config, "qk_nope_head_dim") and hasattr(hf_config, "v_head_dim") and hasattr(hf_config, "kv_lora_rank"):
-        q_head_dim = hf_config.qk_rope_head_dim + hf_config.qk_nope_head_dim
-        if not hasattr(hf_config, "q_lora_rank") or not hf_config.q_lora_rank:
-            attention_size_per_token = (d_model * n_attn_heads * q_head_dim) + \
-                (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
-                    (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
-                        (hf_config.v_head_dim * n_attn_heads * d_model)
-            attention_size_per_token = attention_size_per_token / 1e12
-        else:
-            attention_size_per_token = (d_model * hf_config.q_lora_rank) + \
-                (hf_config.q_lora_rank * n_attn_heads * q_head_dim) + \
-                    (d_model * (hf_config.kv_lora_rank + hf_config.qk_rope_head_dim)) + \
-                        (hf_config.kv_lora_rank * n_attn_heads * (q_head_dim - hf_config.qk_rope_head_dim + hf_config.v_head_dim)) + \
-                            (hf_config.v_head_dim * n_attn_heads * d_model)
-            attention_size_per_token = attention_size_per_token / 1e12
-    else:
-        attention_size_per_token = d_model * (n_attn_heads * d_head + n_kv_heads * d_head * 2) + n_attn_heads * d_head * d_model
-        attention_size_per_token = attention_size_per_token / 1e12
-    # Calculate expert sizes
-    expert_size = d_ff * 3 * d_model / 1e12
-    shared_experts_size_total = 0
-    deepseek_dense_ffn_size = 0
-    deepseek_sparse_layer_num = 0
-    if "Qwen" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "shared_expert_intermediate_size"):
-        d_ff = hf_config.moe_intermediate_size
-        d_ff_share = hf_config.shared_expert_intermediate_size
-        shared_experts_size = d_ff_share * 3 * d_model
-        expert_size = d_ff * 3 * d_model
-        shared_experts_size_total = shared_experts_size / 1e12
-        expert_size = expert_size / 1e12
-    elif "Qwen3" in model_name and hasattr(hf_config, "moe_intermediate_size"):
-        d_ff = hf_config.moe_intermediate_size
-        expert_size = d_ff * 3 * d_model
-        expert_size = expert_size / 1e12
-    elif "DeepSeek" in model_name and hasattr(hf_config, "moe_intermediate_size") and hasattr(hf_config, "intermediate_size") and hasattr(hf_config, "first_k_dense_replace"):
-        d_ff = hf_config.moe_intermediate_size
-        d_ff_dense = hf_config.intermediate_size
-        deepseek_num_dense_layer = hf_config.first_k_dense_replace
-        shared_experts_size = d_ff * 3 * d_model
-        expert_size = d_ff * 3 * d_model
-        shared_experts = 2
-        shared_experts_size_total = shared_experts_size * shared_experts / 1e12
-        expert_size = expert_size / 1e12
-        deepseek_sparse_layer_num = n_layers - deepseek_num_dense_layer
-        deepseek_dense_ffn_size = d_ff_dense * 3 * d_model / 1e12
-    # Calculate S-MBU and S-MFU
-    if "Qwen" in model_name:
-        smbu = ((n_layers*(avg_activated_experts * expert_size + shared_experts_size_total + attention_size_per_token) +
-                kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size + shared_experts_size_total) + attention_score) \
-            * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    elif "Qwen3" in model_name:
-        smbu = ((n_layers * (avg_activated_experts * expert_size + attention_size_per_token) +
-                kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
-            * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    elif "DeepSeek" in model_name:
-        smbu = ((n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
-                (avg_activated_experts * expert_size + shared_experts_size_total) + \
-                deepseek_num_dense_layer * deepseek_dense_ffn_size + \
-                kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * attention_size_per_token + deepseek_sparse_layer_num * \
-                (n_experts_per_tok * expert_size + shared_experts_size_total) + \
-                deepseek_num_dense_layer * deepseek_dense_ffn_size + attention_score) \
-                * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    else:
-        smbu = ((n_layers*(avg_activated_experts * expert_size + attention_size_per_token) +
-                kv_size) * precision/(batch_size / decoding_tp) ) / (num_gpus * hardware_specs['peak_bandwidth_tb'])
-        smfu = (n_layers * (attention_size_per_token + n_experts_per_tok * expert_size) + attention_score) \
-            * 2 * decoding_tp / (num_gpus * hardware_specs['peak_flops_tf'] / 2)
-    return {
-        'smbu': smbu,
-        'smfu': smfu,
-        'kv_size': true_kv_size,
-        'decoding_throughput': decoding_tp,
-        'ttft': 0
-    }
-class ModelInfoRetriever:
-    def __init__(self, model_name: str, precision: str = 'float16'):
-        if precision not in ['float32', 'float16', 'bfloat16', 'int8', 'int4', 'awq', 'gptq', 'fp8', 'fp4']:
-            raise ValueError("Precision must be one of ['float32', 'float16', 'bfloat16', 'int8', 'int4', 'awq', 'gptq', 'fp8', 'fp4']")
-        self.model_name = model_name
-        self.precision = precision
-        self.config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
-        self.model_type = self.config.model_type
-    def get_model_precision_bits(self):
-        """Returns bit width used by the given quantization format."""
-        if self.precision == 'float32':
-            return 4
-        if self.precision in ['float16', 'bfloat16']:
-            return 2
-        if self.precision in ['int8', 'fp8']:
-            return 1
-        if self.precision in ['int4', 'fp4', 'gptq', 'awq']:
-            return 0.5
-        raise ValueError(f"Unsupported precision: {self.precision}")
-    def get_attention_info(self):
-        """Returns attention-related info"""
-        return {
-            'num_attention_heads': getattr(self.config, "num_attention_heads", None),
-            'num_key_value_heads': getattr(self.config, "num_key_value_heads", getattr(self.config, "num_kv_heads", None)),
-            'head_dim': getattr(self.config, "head_dim", getattr(self.config, "hidden_size", None) // getattr(self.config, "num_attention_heads", 1))
-        }
-    def get_rope_info(self):
-        """Returns RoPE (rotary embedding) info if available"""
-        if hasattr(self.config, "rope_scaling"):
-            return {
-                "type": self.config.rope_scaling.get("type"),
-                "factor": self.config.rope_scaling.get("factor")
-            }
-        elif hasattr(self.config, "use_alibi"):
-            return {"type": "alibi", "enabled": self.config.use_alibi}
-        else:
-            return {"type": "none"}
-    def get_moe_info(self, d_model=None):
-        """Returns MoE configuration such as number of experts and FFN dim"""
-        if d_model is None:
-            d_model = getattr(self.config, "hidden_size", None)
-        num_experts = (
-            getattr(self.config, "num_local_experts", None) or
-            getattr(self.config, "num_experts", None) or
-            getattr(self.config, "n_routed_experts", None) or
-            getattr(getattr(self.config, "ffn_config", {}), "moe_num_experts", None) or
-            1
-        )
-        n_experts_per_tok = (
-            getattr(self.config, "num_experts_per_tok", None) or
-            getattr(self.config, "num_selected_experts", None) or
-            getattr(getattr(self.config, "ffn_config", {}), "moe_top_k", None) or
-            1
-        )
-        d_ff = (
-            getattr(self.config, "ffn_dim", None) or
-            getattr(self.config, "intermediate_size", None) or
-            getattr(self.config, "d_ff", None) or
-            (d_model * getattr(self.config, "ff_ratio", 4)) or
-            getattr(getattr(self.config, "ffn_config", {}), "ffn_hidden_size", None) or
-            (4 * d_model)
-        )
-        return {
-            "num_experts": num_experts,
-            "experts_per_token": n_experts_per_tok,
-            "ffn_dim": d_ff
-        }
-    def get_architecture_info(self):
-        """Returns model-wide architecture info"""
-        return {
-            "model_type": self.model_type,
-            "hidden_size": getattr(self.config, "hidden_size", None),
-            "num_hidden_layers": getattr(self.config, "num_hidden_layers", None),
-            "max_position_embeddings": getattr(self.config, "max_position_embeddings", None),
-            "vocab_size": getattr(self.config, "vocab_size", None),
-            "architectures": getattr(self.config, "architectures", []),
-        }
-    def summarize(self):
-        """Aggregate all extracted info in a dictionary"""
-        d_model = getattr(self.config, "hidden_size", None)
-        return {
-            "model_name": self.model_name,
-            "model_type": self.model_type,
-            "precision_bits": self.get_model_precision_bits(),
-            "architecture": self.get_architecture_info(),
-            "attention": self.get_attention_info(),
-            "rope": self.get_rope_info(),
-            "moe": self.get_moe_info(d_model)
-        }
-# if __name__ == "__main__":
-#     print(analyze_gpu_stats(parse_nvidia_smi()))
-#     print(get_gpu_details())

 import re
 import os
 import GPUtil
 try:
     from src.display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
     from display.utils import GPU_TEMP, GPU_Mem, GPU_Power, GPU_Util, GPU_Name
 MEM_BW_DICT ={
+    "NVIDIA-A100-PCIe-80GB": 1935,
+    "NVIDIA-A100-SXM-80GB": 2039,
+    "NVIDIA-H100-PCIe-80GB": 2039,
+    "NVIDIA-RTX-A5000-24GB": 768
 }
 PEAK_FLOPS_DICT = {
     "float32":{
         "NVIDIA-A100-PCIe-80GB": 312e12,
+        "NVIDIA-A100-SXM-80GB": 312e12,
         "NVIDIA-H100-PCIe-80GB": 756e12,
+        "NVIDIA-RTX-A5000-24GB": 222.2e12
     },
     "float16":{
         "NVIDIA-A100-PCIe-80GB": 624e12,
+        "NVIDIA-A100-SXM-80GB": 624e12,
         "NVIDIA-H100-PCIe-80GB": 1513e12,
+        "NVIDIA-RTX-A5000-24GB": 444.4e12
     },
     "bfloat16":{
         "NVIDIA-A100-PCIe-80GB": 624e12,
+        "NVIDIA-A100-SXM-80GB": 624e12,
         "NVIDIA-H100-PCIe-80GB": 1513e12,
+        "NVIDIA-RTX-A5000-24GB": 444.4e12
     },
+    "8bit":{
         "NVIDIA-A100-PCIe-80GB": 1248e12,
+        "NVIDIA-A100-SXM-80GB": 1248e12,
         "NVIDIA-H100-PCIe-80GB": 3026e12,
+        "NVIDIA-RTX-A5000-24GB": 889e12
     },
+    "4bit": {
+        "NVIDIA-A100-PCIe-80GB": 2496e12,
+        "NVIDIA-A100-SXM-80GB": 2496e12,
+        "NVIDIA-H100-PCIe-80GB": 6052e12,
+        "NVIDIA-RTX-A5000-24GB": 1778e12
     }
 }
 def my_snapshot_download(repo_id, revision, local_dir, repo_type, max_workers):
     # print(f"gpu_indices: {gpu_indices}")
     gpu_stats = []
+    gpu_info_pattern = re.compile(r'(\d+)C\s+P\d+\s+(\d+)W / \d+W\s+\|\s+(\d+)MiB / \d+MiB\s+\|\s+(\d+)%')
     # gpu_name_pattern = re.compile(r'NVIDIA\s+([\w\s]+\d+(?:\s*GB)?)')
     gpu_name_pattern = re.compile(r'NVIDIA\s+(RTX\s+)?([A-Z0-9]+)')
 def get_peak_flops(gpu_name, precision):
     return PEAK_FLOPS_DICT[precision][gpu_name]
+def transfer_precision2bytes(precision):
+    if precision == "float32":
+        return 4
+    elif precision in ["float16", "bfloat16"]:
+        return 2
+    elif precision == "8bit":
+        return 1
+    elif precision == "4bit":
+        return 0.5
     else:
+        raise ValueError(f"Unsupported precision: {precision}")
+if __name__ == "__main__":
+    print(analyze_gpu_stats(parse_nvidia_smi()))