Spaces:

whats2000
/

tw-eval-analyzer

Sleeping

App Files Files Community

lianghsun commited on Aug 28, 2025

Commit

740e5d3

0 Parent(s):

docs: add project README and MIT license section

Browse files

Files changed (8) hide show

.gitignore +83 -0
CODE_OF_CONDUCT.md +18 -0
CONTRIBUTING.md +15 -0
LICENSE +21 -0
README.md +81 -0
STYLEGUIDE.md +22 -0
app.py +155 -0
requirements.txt +5 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,83 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Virtual environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# PyInstaller
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Jupyter Notebook
+.ipynb_checkpoints
+# PyCharm / VS Code
+.idea/
+.vscode/
+*.swp
+# mypy / pytype / pyright
+.mypy_cache/
+.dmypy.json
+dmypy.json
+.pytype/
+.pyright/
+# Cython debug symbols
+cython_debug/
+# Local configs
+*.env
+*.local
+install_ngork.sh

CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,18 @@

+# CODE OF CONDUCT
+本專案遵循 Contributor Covenant 行為準則（簡化版）。
+## 我們的承諾
+- 以尊重、包容的態度互動。
+- 建設性地提出意見，避免人身攻擊。
+- 保持社群安全、友善。
+## 不允許的行為
+- 歧視、騷擾、攻擊性言語或行為。
+- 惡意散播錯誤資訊或破壞協作。
+## 責任
+專案維護者有權移除或拒絕不當的貢獻，並在必要時禁止違規者參與。
+## 聯絡
+若有任何疑慮，請透過 **Twinkle AI 社群官方管道** 聯繫管理員。

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,15 @@

+# CONTRIBUTING
+感謝你願意貢獻本專案！
+## 基本流程
+1. Fork 專案並建立分支（例：`feat/add-lesson-xyz`）。
+2. 在 `courses/` 下建立新課程資料夾（格式：`YYYY-MM-course-slug/`）。
+3. 課程至少包含一個 `README.md` 與一組 `.ipynb`（notebook-first，完整可執行）。
+4. 確認 notebook 能在 **Google Colab** 從零執行完成。
+5. 發送 Pull Request，簡述修改重點與測試方式。
+## 注意事項
+- 優先確保 Notebook 內教學步驟完整，避免把程式抽到外部模組。
+- 資料需避免個資與敏感資訊。
+- 推薦在 PR 中附上 Colab 測試連結。

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2020 Hugging Face
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,81 @@

+# 🌟 Eval Analyzer
+一個基於 🎈 **Streamlit** 的互動式工具，用來分析 **[Twinkle Eval](https://github.com/ai-twinkle/Eval)** 格式的評估檔案（`.json` / `.jsonl`）。
+## 📌 功能特色
+<p align="center">
+  <img src="https://github.com/ai-twinkle/llm-lab/blob/main/courses/2025-0827-llm-eval-with-twinkle/assets/gpt-oss-120b-mmlu-eval-report.png?raw=1" width="100%"/><br/>
+  <em>圖：gpt-oss-120b 在 MMLU 部分子集上的表現成績預覽</em>
+</p>
+- 支援上傳多個 **Twinkle Eval 檔案**（`json` / `jsonl`）。
+- 自動解析評估結果，抽取：
+  - `dataset`
+  - `category`
+  - `file`
+  - `accuracy_mean`
+  - `source_label`（模型名稱 + timestamp）
+- 提供整體平均值的計算，缺漏時自動補足。
+- 視覺化：
+  - 各類別的柱狀圖（依模型分組對照）。
+  - 可選擇排序方式（平均由高→低、平均由低→高、字母排序）。
+  - 支援分頁顯示（自訂每頁顯示類別數量）。
+  - 指標可切換為原始值或 0–100 比例。
+- 支援 **CSV 匯出**（下載分頁結果）。
+## 🚀 使用方式
+### 1. 安裝環境
+建議使用虛擬環境（如 `venv` 或 `conda`）：
+```bash
+pip install -r requirements.txt
+```
+### 2. 啟動應用程式
+```bash
+streamlit run app.py
+```
+### 3. 操作流程
+1. 在左側 Sidebar 上傳一個或多個 **Twinkle Eval 檔案**。
+2. 選擇要查看的資料集。
+3. 設定排序方式、分頁大小、顯示比例（0–1 或 0–100）。
+4. 查看圖表與資料表，並可下載 CSV。
+## 📂 檔案格式要求
+每份 json / jsonl 檔案需符合 Twinkle Eval 格式，至少包含以下欄位：
+```json
+{
+  "timestamp": "2025-08-20T10:00:00",
+  "config": {
+    "model": { "name": "my-model" }
+  },
+  "dataset_results": {
+    "datasets/my_dataset": {
+      "average_accuracy": 0.85,
+      "results": [
+        {
+          "file": "category1.json",
+          "accuracy_mean": 0.9
+        },
+        {
+          "file": "category2.json",
+          "accuracy_mean": 0.8
+        }
+      ]
+    }
+  }
+}
+```
+或者可以到 Twinkle AI [Eval logs](https://huggingface.co/collections/twinkle-ai/eval-logs-6811a657da5ce4cbd75dbf50) collections 下載範例。
+## 📊 輸出範例
+- **圖表**：顯示各模型在不同類別的 accuracy_mean 比較。
+- **表格**：Pivot Table，行為類別，列為模型，值為 accuracy。
+- **下載**：每頁結果可匯出成 CSV。
+## 📄 License
+MIT

STYLEGUIDE.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# STYLEGUIDE
+## 語言與措辭
+- 課程內容與註解以 **繁體中文** 為主。
+- 避免使用「台灣地區」措辭，直接使用「台灣」。
+## Notebook 結構
+每支 Notebook 建議包含以下段落：
+1. 學習目標
+2. 重點說明
+3. 實作步驟（含完整可執行程式碼區塊）
+4. 練習或作業
+5. 延伸閱讀
+## 程式風格
+- 以 **教學清晰** 為優先，允許在 Notebook 內重覆程式碼（方便學員複習）。
+- 變數命名具意義，必要時以註解補充脈絡。
+- 使用 [PEP8](https://peps.python.org/pep-0008/) 作為 Python 基礎規範。
+## 檔案命名
+- Notebook 檔名以 `00_`, `01_`, `02_`… 開頭，保持執行順序一致。
+- 資料集檔案建議用 snake_case，例如：`dialogues_raw.jsonl`、`sft.jsonl`。

app.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import json
+import io
+from typing import List, Dict, Tuple
+import pandas as pd
+import numpy as np
+import altair as alt
+import streamlit as st
+from pathlib import PurePosixPath
+st.set_page_config(page_title="Twinkle Eval Analyzer", page_icon=":star2:", layout="wide")
+st.title("✨ Twinkle Eval Analyzer (.json / .jsonl)")
+# ----------------- Helpers -----------------
+def _decode_bytes_to_text(b: bytes) -> str:
+    for enc in ("utf-8", "utf-16", "utf-16le", "utf-16be", "big5", "cp950"):
+        try:
+            return b.decode(enc)
+        except Exception:
+            continue
+    return b.decode("utf-8", errors="ignore")
+def read_twinkle_doc(file) -> Dict:
+    raw = file.read()
+    if isinstance(raw, bytes):
+        text = _decode_bytes_to_text(raw)
+    else:
+        text = raw
+    text = text.strip()
+    try:
+        obj = json.loads(text)
+    except Exception:
+        for line in text.splitlines():
+            line = line.strip().rstrip(",")
+            if not line:
+                continue
+            try:
+                obj = json.loads(line)
+                break
+            except Exception:
+                continue
+    if not isinstance(obj, dict):
+        raise ValueError("檔案不是有效的 Twinkle Eval JSON 物件。")
+    if "timestamp" not in obj or "config" not in obj or "dataset_results" not in obj:
+        raise ValueError("缺少必要欄位")
+    return obj
+def extract_records(doc: Dict) -> Tuple[pd.DataFrame, Dict[str, float]]:
+    model = doc.get("config", {}).get("model", {}).get("name", "<unknown>")
+    timestamp = doc.get("timestamp", "<no-ts>")
+    source_label = f"{model} @ {timestamp}"
+    rows = []
+    avg_map = {}
+    for ds_path, ds_payload in doc.get("dataset_results", {}).items():
+        ds_name = ds_path.split("datasets/")[-1].strip("/") if ds_path.startswith("datasets/") else ds_path
+        avg_meta = ds_payload.get("average_accuracy") if isinstance(ds_payload, dict) else None
+        results = ds_payload.get("results", []) if isinstance(ds_payload, dict) else []
+        for item in results:
+            if not isinstance(item, dict):
+                continue
+            file_path = item.get("file")
+            acc_mean = item.get("accuracy_mean")
+            if file_path is None or acc_mean is None:
+                continue
+            fname = PurePosixPath(file_path).name
+            category = fname.rsplit(".", 1)[0]
+            rows.append({
+                "dataset": ds_name,
+                "category": category,
+                "file": fname,
+                "accuracy_mean": float(acc_mean),
+                "source_label": source_label
+            })
+        if avg_meta is None and results:
+            vals = [float(it.get("accuracy_mean", np.nan)) for it in results if "accuracy_mean" in it]
+            if vals:
+                avg_meta = float(np.mean(vals))
+        if avg_meta is not None:
+            avg_map[ds_name] = avg_meta
+    return pd.DataFrame(rows), avg_map
+def load_all(files) -> Tuple[pd.DataFrame, Dict[str, Dict[str, float]]]:
+    frames = []
+    meta = {}
+    for f in files or []:
+        try:
+            doc = read_twinkle_doc(f)
+        except Exception as e:
+            st.error(f"❌ 無法讀取 {getattr(f, 'name', '檔案')}：{e}")
+            continue
+        df, avg_map = extract_records(doc)
+        if not df.empty:
+            frames.append(df)
+            src = df["source_label"].iloc[0]
+            meta[src] = avg_map
+    if not frames:
+        return pd.DataFrame(columns=["dataset", "category", "file", "accuracy_mean", "source_label"]), {}
+    return pd.concat(frames, ignore_index=True), meta
+# ----------------- Sidebar -----------------
+with st.sidebar:
+    files = st.file_uploader("選擇 Twinkle Eval 檔案", type=["json", "jsonl"], accept_multiple_files=True)
+    df_all, meta_all = load_all(files)
+    normalize_0_100 = st.checkbox("以 0–100 顯示", value=False)
+    page_size = st.selectbox("每張圖顯示幾個類別", [10, 20, 30, 50, 100], index=1)
+    sort_mode = st.selectbox("排序方式", ["依整體平均由高到低", "依整體平均由低到高", "依字母排序"])
+if df_all.empty:
+    st.info("請上傳 Twinkle Eval 檔案")
+    st.stop()
+all_datasets = sorted(df_all["dataset"].unique().tolist())
+selected_dataset = st.selectbox("選擇資料集", options=all_datasets)
+work = df_all[df_all["dataset"] == selected_dataset].copy()
+metric_plot = "accuracy_mean" + (" (x100)" if normalize_0_100 else "")
+work[metric_plot] = work["accuracy_mean"] * (100.0 if normalize_0_100 else 1.0)
+order_df = work.groupby("category")[metric_plot].mean().reset_index()
+if sort_mode == "依整體平均由高到低":
+    order_df = order_df.sort_values(metric_plot, ascending=False)
+elif sort_mode == "依整體平均由低到高":
+    order_df = order_df.sort_values(metric_plot, ascending=True)
+else:
+    order_df = order_df.sort_values("category", ascending=True)
+cat_order = order_df["category"].tolist()
+work["category"] = pd.Categorical(work["category"], categories=cat_order, ordered=True)
+n = len(cat_order)
+pages = int(np.ceil(n / page_size))
+for p in range(pages):
+    start, end = p * page_size, min((p + 1) * page_size, n)
+    subset_cats = cat_order[start:end]
+    sub = work[work["category"].isin(subset_cats)]
+    st.subheader(f"📊 {selected_dataset}｜類別 {start+1}-{end} / {n}")
+    base = alt.Chart(sub).encode(
+        x=alt.X("category:N", sort=subset_cats),
+        y=alt.Y(f"{metric_plot}:Q"),
+        color=alt.Color("source_label:N"),
+        tooltip=["source_label", "file", alt.Tooltip(metric_plot, format=".3f")]
+    )
+    bars = base.mark_bar().encode(xOffset="source_label")
+    st.altair_chart(bars.properties(height=420), use_container_width=True)
+    pivot = sub.pivot_table(index="category", columns="source_label", values=metric_plot)
+    st.dataframe(pivot, use_container_width=True)
+    st.download_button(
+        label=f"下載此頁 CSV ({start+1}-{end})",
+        data=pivot.reset_index().to_csv(index=False).encode("utf-8"),
+        file_name=f"twinkle_{selected_dataset}_{start+1}_{end}.csv",
+        mime="text/csv"
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+pandas
+altair
+streamlit