Spaces:

lllouo
/

BD_framework_test

Running

App Files Files Community

lllouo commited on Feb 2

Commit

28e23fd

1 Parent(s): ba32277

English Version

Browse files

Files changed (3) hide show

README.md +48 -48
app.py +195 -229
leaderboard.json +42 -42

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: BD Framework Test
 emoji: 🔥
 colorFrom: blue
 colorTo: gray
@@ -13,80 +13,80 @@ short_description: Benchmark-Denoising (BD) framework
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-# 数据集清洗框架展示系统
-基于LLM的智能数据集质量提升框架 - 研究生毕业论文成果展示
-## 部署到 Hugging Face Spaces
-### 步骤1: 创建Space
-1. 访问 https://huggingface.co/spaces
-2. 点击 "Create new Space"
-3. 选择 **Gradio** SDK (或Docker)
-4. Space名称: `dataset-cleaning-demo`
-### 步骤2: 上传文件
-将以下文件上传到Space:
-- `app.py` - 主应用程序
-- `requirements.txt` - Python依赖
-- `README.md` - 本文件
-### 步骤3: 配置环境变量
-在Space设置中添加:
-- `DEEPSEEK_API_KEY`: 你的DeepSeek API密钥
-### 步骤4: 等待构建
-HF Spaces会自动构建并部署你的应用。
-## 本地运行
 ```bash
-# 安装依赖
 pip install -r requirements.txt
-# 设置环境变量
 export DEEPSEEK_API_KEY="your-api-key"
-# 运行应用
 python app.py
 ```
-访问 http://localhost:7860
-## 功能特性
-✅ 数据集上传 (JSON/JSONL格式)
-✅ 基于DeepSeek API的智能清洗
-✅ 19个主流benchmark的清洗效果展示
-✅ 交互式Leaderboard
-✅ 清洗结果下载
-## 技术栈
-- **前端**: React + Tailwind CSS
-- **后端**: FastAPI
 - **LLM**: DeepSeek API
-- **部署**: Hugging Face Spaces
-## 清洗流程
-1. **错误检测**: 识别数据质量问题
-2. **质量评估**: 对样本进行评分
-3. **智能修正**: LLM生成高质量版本
-4. **一致性验证**: 确保逻辑一致性
-## 注意事项
-- Demo版本限制每次处理10个样本
-- 需要有效的DeepSeek API密钥
-- Leaderboard数据为预置结果
-## 后续完善计划
-- [ ] 连接学校服务器LLaMA3模型
-- [ ] 支持大规模数据集处理
-- [ ] 添加更多评估指标
-- [ ] 实时处理进度反馈

 ---
+title: BD Framework
 emoji: 🔥
 colorFrom: blue
 colorTo: gray
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Dataset Denoising Framework Demo System
+LLM-based Intelligent Dataset Quality Enhancement Framework - Graduate Thesis Research Showcase
+## Deploy to Hugging Face Spaces
+### Step 1: Create Space
+1. Visit https://huggingface.co/spaces
+2. Click "Create new Space"
+3. Select **Gradio** SDK (or Docker)
+4. Space name: `dataset-cleaning-demo`
+### Step 2: Upload Files
+Upload the following files to the Space:
+- `app.py` - Main application
+- `requirements.txt` - Python dependencies
+- `README.md` - This file
+### Step 3: Configure Environment Variables
+Add in Space settings:
+- `DEEPSEEK_API_KEY`: Your DeepSeek API key
+### Step 4: Wait for Build
+HF Spaces will automatically build and deploy your application.
+## Local Development
 ```bash
+# Install dependencies
 pip install -r requirements.txt
+# Set environment variable
 export DEEPSEEK_API_KEY="your-api-key"
+# Run application
 python app.py
 ```
+Visit http://localhost:7860
+## Features
+✅ Dataset upload (JSON/JSONL format)
+✅ Intelligent denoising via DeepSeek API
+✅ Showcase denoising effects on 19 mainstream benchmarks
+✅ Interactive Leaderboard
+✅ Download denoised results
+## Tech Stack
+- **Frontend**: React + Tailwind CSS
+- **Backend**: FastAPI
 - **LLM**: DeepSeek API
+- **Deployment**: Hugging Face Spaces
+## Denoising Workflow
+1. **Error Detection**: Identify data quality issues
+2. **Quality Assessment**: Score samples
+3. **Intelligent Correction**: LLM generates high-quality versions
+4. **Consistency Validation**: Ensure logical consistency
+## Notes
+- Demo version limits processing to 10 samples per batch
+- Requires valid DeepSeek API key
+- Leaderboard data is pre-configured results
+## Future Enhancements
+- [ ] Connect to university server LLaMA3 model
+- [ ] Support large-scale dataset processing
+- [ ] Add more evaluation metrics
+- [ ] Real-time processing progress feedback

app.py CHANGED Viewed

@@ -14,27 +14,27 @@ import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import hashlib
-# ======================== 新增：WAC-GEC导入 ========================
 try:
     from whitespace_correction import WhitespaceCorrector
     WAC_GEC_AVAILABLE = True
-    # 初始化WAC-GEC模型（延迟加载）
     wac_corrector = None
 except ImportError:
     WAC_GEC_AVAILABLE = False
     wac_corrector = None
-    print("⚠️ whitespace_correction未安装，WAC-GEC功能将不可用")
-# 初始化GEC模型（延迟加载）
 gec_tokenizer = None
 gec_model = None
-GEC_MODEL_NAME = "lllouo/gec_Chat-LLaMa-2-7B-FT"  # 你的HF模型地址
-# ======================== API配置 ========================
 DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY", "")
 DEEPSEEK_BASE_URL = "https://dashscope.aliyuncs.com/compatible-mode/v1"
-# ======================== NLP工具初始化 ========================
 try:
     nlp = spacy.load("en_core_web_sm")
 except OSError:
@@ -51,7 +51,7 @@ WHITESPACE_PATTERNS = [
     re.compile(r'([.,!?;:])\s{2,}'),
 ]
-# ======================== Prompt模板 ========================
 PROMPT_TEMPLATE = """## Positioning
 You are a **LANGUAGE grammatical error correction tool** that can identify and correct grammatical errors in a text.
 Reply with a corrected version of the input sentence with all **grammatical**, **spelling** and **whitespace errors** fixed, making only necessary changes.
@@ -79,14 +79,14 @@ Next, please correct the following sentence according to the above requirements.
 [input]: """
-# ======================== 新增：初始化函数（WAC + GEC） ========================
 def initialize_wac_gec():
-    """延迟初始化WAC-GEC模型（空白符纠正 + 语法纠错）"""
     global wac_corrector, gec_tokenizer, gec_model
-    # 1. 初始化WAC（空白符纠正）
     if not WAC_GEC_AVAILABLE:
-        print("❌ WAC模块未安装")
         return False
     if wac_corrector is None:
@@ -97,17 +97,17 @@ def initialize_wac_gec():
                 device=device,
                 download_dir="./models"
             )
-            print(f"✅ WAC空白符纠正模型已加载 (设备: {device})")
         except Exception as e:
-            print(f"❌ WAC模型加载失败: {e}")
             return False
-    # 2. 初始化GEC（语法纠错）
     if gec_model is None or gec_tokenizer is None:
         try:
             device = "cuda" if torch.cuda.is_available() else "cpu"
-            print(f"📥 正在从HuggingFace下载GEC模型: {GEC_MODEL_NAME}")
             gec_tokenizer = AutoTokenizer.from_pretrained(
                 GEC_MODEL_NAME,
                 trust_remote_code=True
@@ -119,51 +119,45 @@ def initialize_wac_gec():
                 trust_remote_code=True
             )
-            # 如果是CPU模式，手动移动模型
             if device == "cpu":
                 gec_model = gec_model.to(device)
-            # 设置tokenizer的pad_token和padding_side
             gec_tokenizer.pad_token_id = gec_tokenizer.eos_token_id
             gec_tokenizer.padding_side = "left"
-            print(f"✅ GEC语法纠错模型已加载 (设备: {device})")
         except Exception as e:
-            print(f"❌ GEC模型加载失败: {e}")
             return False
     return True
-# ======================== 新增：GEC语法纠错函数 ========================
 def correct_sentence_gec(input_sentence):
     """
-    使用GEC模型进行语法纠错
-    参数:
-        input_sentence (str): 需要纠正的句子
-    返回:
-        str: 纠正后的句子
     """
     if gec_model is None or gec_tokenizer is None:
-        raise ValueError("GEC模型未初始化")
-    # 构建提示词
     prompt = f"""Rewrite the following sentence to correct grammatical errors. Return ONLY the corrected sentence.
 Original: {input_sentence}
 Corrected:"""
-    # 生成修正
     inputs = gec_tokenizer(prompt, return_tensors="pt").to(gec_model.device)
-    # 检测设备类型以优化参数
     is_cpu = str(gec_model.device) == "cpu" or not torch.cuda.is_available()
-    # CPU优化参数：减少beam search和token长度
     if is_cpu:
-        max_tokens = 256  # CPU模式减半
-        beams = 2         # 减少beam数量加速
     else:
-        max_tokens = 512  # GPU模式保持
         beams = 4
     with torch.no_grad():
@@ -176,55 +170,50 @@ Corrected:"""
             top_p=None
         )
-    # 提取并清理输出
     full_output = gec_tokenizer.decode(outputs[0], skip_special_tokens=True)
     corrected_text = full_output.replace(prompt, "").strip()
-    # 进一步清理可能的前缀
     if corrected_text.startswith("Corrected:"):
         corrected_text = corrected_text[len("Corrected:"):].strip()
     return corrected_text
-# ======================== 新增：WAC-GEC组合处理函数 ========================
 def call_wac_gec(text):
     """
-    使用WAC-GEC两步纠正：
-    1. GEC模型进行语法和拼写纠正
-    2. WAC模型进行空白符纠正
     """
     if not initialize_wac_gec():
-        raise ValueError("⚠️ WAC-GEC模型未安装或加载失败")
     try:
-        # Step 1: 使用GEC模型进行语法纠错
-        print(f"🔍 GEC处理: {text[:50]}...")
         gec_corrected = correct_sentence_gec(text)
-        print(f"✅ GEC结果: {gec_corrected[:50]}...")
-        # Step 2: 使用WAC模型进行空白符纠正
-        print(f"🔍 WAC处理: {gec_corrected[:50]}...")
         final_corrected = wac_corrector.correct_text(gec_corrected)
-        print(f"✅ WAC结果: {final_corrected[:50]}...")
-        # 格式化输出以匹配DeepSeek的格式
         return f"[output]: {final_corrected}"
     except Exception as e:
-        raise Exception(f"WAC-GEC处理错误: {str(e)}")
-# ======================== 新增：颜色对比函数 ========================
 def generate_colored_diff(original, cleaned):
     """
-    生成带颜色标注的HTML差异对比
-    原始文本中的错误：红色
-    去噪后的修正：绿色
     """
-    # 分词处理
     original_words = original.split()
     cleaned_words = cleaned.split()
-    # 使用difflib进行序列匹配
     matcher = difflib.SequenceMatcher(None, original_words, cleaned_words)
     original_html = []
@@ -232,21 +221,17 @@ def generate_colored_diff(original, cleaned):
     for tag, i1, i2, j1, j2 in matcher.get_opcodes():
         if tag == 'equal':
-            # 相同部分保持黑色
             original_html.extend(original_words[i1:i2])
             cleaned_html.extend(cleaned_words[j1:j2])
         elif tag == 'replace':
-            # 替换部分：原文红色，新文绿色
             original_html.extend([f'<span style="color: #dc3545; font-weight: bold;">{w}</span>'
                                  for w in original_words[i1:i2]])
             cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
                                 for w in cleaned_words[j1:j2]])
         elif tag == 'delete':
-            # 删除部分：原文红色带删除线
             original_html.extend([f'<span style="color: #dc3545; text-decoration: line-through;">{w}</span>'
                                  for w in original_words[i1:i2]])
         elif tag == 'insert':
-            # 插入部分：新文绿色
             cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
                                 for w in cleaned_words[j1:j2]])
@@ -254,7 +239,7 @@ def generate_colored_diff(original, cleaned):
 def create_comparison_html(original_list, cleaned_list):
     """
-    创建HTML表格展示对比 - 样式匹配Leaderboard表格
     """
     html = """
     <div style="font-family: 'Times New Roman', serif; max-width: 100%; overflow-x: auto;">
@@ -290,8 +275,8 @@ def create_comparison_html(original_list, cleaned_list):
             <thead>
                 <tr>
                     <th class="index-col">#</th>
-                    <th>原始问题</th>
-                    <th>去噪后问题</th>
                 </tr>
             </thead>
             <tbody>
@@ -315,11 +300,11 @@ def create_comparison_html(original_list, cleaned_list):
     return html
-# ======================== 工具函数 ========================
 def check_api_key(model_choice):
-    """检查API密钥（仅DeepSeek需要）"""
     if model_choice == "deepseek-r1-distill-llama-8b" and not DEEPSEEK_API_KEY:
-        raise ValueError("⚠️ 请在 Space Settings 中配置 DEEPSEEK_API_KEY！")
 def call_deepseek_api(prompt, model="deepseek-r1-distill-llama-8b", temperature=0.1, stream=True):
     check_api_key(model)
@@ -418,18 +403,17 @@ def calculate_spelling_error_density(sentences):
         return 0.0
     return total_errors / total_words * 100
-# ======================== Leaderboard数据处理 ========================
 def load_leaderboard_data():
     json_path = "leaderboard.json"
     try:
         with open(json_path, 'r', encoding='utf-8') as f:
             data = json.load(f)
-        # Replace ID with hash based on Benchmark
         for item in data:
             benchmark = item['Benchmark']
             hash_object = hashlib.md5(benchmark.encode())
-            item['ID'] = hash_object.hexdigest()[:8]  # Use first 8 hex digits for brevity
         return pd.DataFrame(data)
     except Exception as e:
@@ -438,15 +422,13 @@ def load_leaderboard_data():
 def filter_leaderboard(df, category_query, version_query):
     """
-    同时按类别和版本筛选
     """
     result = df.copy()
-    # 按类别筛选
     if category_query != "all":
         result = result[result['Category'] == category_query]
-    # 按版本筛选
     if version_query != "all":
         if version_query == "original":
             result = result[result['Benchmark'].str.contains('_original', case=False, na=False)]
@@ -462,38 +444,35 @@ def search_leaderboard(df, query):
         return df
     return df[df['Benchmark'].str.contains(query, case=False, na=False)]
-# ======================== 数据去噪函数（修改版：支持双模型）========================
 def clean_dataset(file_path, question_column, model_choice, temperature, max_samples, progress=gr.Progress()):
     try:
-        # 检查API密钥（仅DeepSeek需要）
         try:
             check_api_key(model_choice)
         except ValueError as e:
             if model_choice == "deepseek-r1-distill-llama-8b":
                 return str(e), None, ""
-        # 检查WAC-GEC可用性
         if model_choice == "WAC-GEC" and not WAC_GEC_AVAILABLE:
-            return "❌ WAC-GEC模型未安装！请安装 whitespace_correction 包。", None, ""
-        progress(0.05, desc="📁 读取数据文件...")
         df = pd.read_parquet(file_path)
         if question_column not in df.columns:
             available_columns = ", ".join(df.columns.tolist())
-            return f"❌ 列名 '{question_column}' 不存在！\n可用列名: {available_columns}", None, ""
         data_ori = df[question_column].tolist()[:int(max_samples)]
         total = len(data_ori)
-        progress(0.08, desc="📊 计算原始指标...")
         original_sentences = [str(item) for item in data_ori]
         war_original = calculate_whitespace_anomaly_rate(original_sentences)
         sed_original = calculate_spelling_error_density(original_sentences)
-        progress(0.1, desc=f"🚀 开始去噪 {total} 个样本 (模型: {model_choice})...")
-        # WAC-GEC不需要添加___标记
         if model_choice == "WAC-GEC":
             data_corrupt = [str(item) for item in data_ori]
         else:
@@ -501,11 +480,11 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
         results = []
         max_retries = 5 if model_choice == "deepseek-r1-distill-llama-8b" else 3
-        log_text = f"🚀 开始处理 {total} 个样本...\n"
-        log_text += f"📌 使用模型: {model_choice}\n\n"
         for idx in range(total):
-            progress((0.1 + 0.7 * idx / total), desc=f"处理中: {idx+1}/{total}")
             unprocess_text = str(data_ori[idx])
             original_text = data_corrupt[idx]
@@ -514,7 +493,6 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
             while retry_count < max_retries:
                 try:
-                    # 根据模型选择调用不同的API
                     if model_choice == "WAC-GEC":
                         response_content = call_wac_gec(original_text)
                     else:
@@ -524,7 +502,6 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
                             temperature=float(temperature)
                         )
-                    # WAC-GEC的输出格式简单，无需复杂验证
                     if model_choice == "WAC-GEC":
                         if response_content.startswith('[output]:'):
                             results.append(response_content)
@@ -540,12 +517,12 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
                 except Exception as e:
                     retry_count += 1
-                    log_text += f"⚠️ 样本 {idx+1} 处理错误，重试 {retry_count}/{max_retries}: {str(e)}\n"
             else:
                 results.append(f"[ERROR] Failed to process: {original_text}")
-                log_text += f"❌ 样本 {idx+1} 处理失败\n"
-        progress(0.85, desc="📊 后处理中...")
         lst_extracted = []
         error_count = 0
@@ -571,7 +548,7 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
             else:
                 lst_final.append(lst_extracted[i])
-        progress(0.90, desc="📊 计算去噪后指标...")
         cleaned_sentences = [str(item) for item in lst_final]
         war_cleaned = calculate_whitespace_anomaly_rate(cleaned_sentences)
         sed_cleaned = calculate_spelling_error_density(cleaned_sentences)
@@ -579,7 +556,7 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
         delta_war = war_cleaned - war_original
         delta_sed = sed_cleaned - sed_original
-        progress(0.95, desc="💾 保存结果...")
         df_cleaned = df.copy()
         df_cleaned[question_column + '_cleaned'] = lst_final[:len(df)]
@@ -592,144 +569,143 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
         df_cleaned.to_parquet(output_path, index=False)
-        log_text += f"\n\n📊 处理完成！\n"
         log_text += f"{'='*50}\n"
-        log_text += f"【基础统计】\n"
-        log_text += f"- 使用模型: {model_choice}\n"
-        log_text += f"- 总样本数: {total}\n"
-        log_text += f"- 成功处理: {total - error_count - unknown_count}\n"
-        log_text += f"- 失败样本: {error_count}\n"
-        log_text += f"- 未知格式: {unknown_count}\n"
-        log_text += f"- 输出文件: {output_filename}\n\n"
-        log_text += f"【质量指标】\n"
-        log_text += f"📍 空白符异常率（WAR）:\n"
-        log_text += f"   原始: {war_original:.2f}% → 去噪后: {war_cleaned:.2f}%\n"
-        log_text += f"   变化: {delta_war:+.2f}% {'✅ 改善' if delta_war < 0 else '⚠️ 增加'}\n\n"
-        log_text += f"📍 拼写错误密度（SED）:\n"
-        log_text += f"   原始: {sed_original:.2f}% → 去噪后: {sed_cleaned:.2f}%\n"
-        log_text += f"   变化: {delta_sed:+.2f}% {'✅ 改善' if delta_sed < 0 else '⚠️ 增加'}\n"
         if model_choice == "WAC-GEC":
-            log_text += f"\n💡 注意: WAC-GEC使用两步纠正（GEC语法纠错 + WAC空白符纠正）\n"
         log_text += f"{'='*50}\n"
-        # 生成带颜色的对比HTML
         preview_html = create_comparison_html(data_ori[:5], lst_final[:5])
-        progress(1.0, desc="✅ 完成！")
         return log_text, output_path, preview_html
     except Exception as e:
         import traceback
         error_detail = traceback.format_exc()
-        return f"❌ 处理出错: {str(e)}\n\n详细错误:\n{error_detail}", None, ""
-# ======================== 文本内容 ========================
 ABOUT_TEXT = """
-## 去噪流程说明
-### 支持的模型
 #### 1. DeepSeek-R1 (deepseek-r1-distill-llama-8b)
-- **功能**: 全面的语法、拼写、空格错误修正
-- **优势**: 综合性强，能处理多种类型的错误
-- **配置**: 需要在Space Settings中配置DEEPSEEK_API_KEY
 #### 2. WAC-GEC (Whitespace + Grammar Error Correction)
-- **功能**: 两步纠正流程
-  - **Step 1 (GEC)**: 使用LLaMA-2-7B微调模型进行语法和拼写纠错
-  - **Step 2 (WAC)**: 使用空白符纠正模型修正空格问题
-- **优势**:
-  - 完全本地化，无需API密钥
-  - 组合两个专门模型，各司其职
-  - 适合离线环境和预算有限的场景
-- **模型来源**:
   - GEC: [lllouo/gec_Chat-LLaMa-2-7B-FT](https://huggingface.co/lllouo/gec_Chat-LLaMa-2-7B-FT)
-  - WAC: whitespace_correction库
-### 核心算法
-1. **预处理 (process_sentence)**
-   - 检测句子完整性
-   - 为不完整的句子添加标记 `___` (仅DeepSeek)
-   - 保留多行文本格式
-2. **模型去噪**
-   - **DeepSeek**: 使用API进行全面错误修正，重试机制最多5次
    - **WAC-GEC**:
-     - 先使用GEC模型进行语法和拼写纠正
-     - 再使用WAC模型进行空白符纠正
-     - 重试机制最多3次
-3. **格式验证**
-   - 验证输出格式正确性
-   - 检查标记保留情况
-   - 长度合理性检查
-4. **后处理**
-   - 提取去噪后的内容
-   - 恢复原始多行格式
-   - 生成带模型标识的Parquet文件
-### 支持的数据集
-- **MMLU**: 57个学科的多选题
-- **GSM8K**: 数学推理题
-- **ARC-Challenge**: 科学问答
-- **MedMCQA**: 医学选择题
-- **CoQA**: 对话问答
-- 以及更多...
-### 颜色标注说明
-- 🔴 **红色**: 原始文本中的错误（拼写、语法、空格等）
-- 🟢 **绿色**: 去噪后的修正内容
-- ⚫ **黑色**: 未修改的正确部分
-### 技术栈
 - **LLM**: DeepSeek API (deepseek-r1-distill-llama-8b)
-- **本地模型**:
-  - GEC: LLaMA-2-7B (微调于语法纠错任务)
   - WAC: Whitespace Correction Model
-- **前端**: Gradio 4.16.0
-- **数据处理**: Pandas + PyArrow (Parquet)
-- **差异对比**: Python difflib
-- **NLP工具**: spaCy, pyspellchecker
-- **API调用**: OpenAI SDK
-- **部署**: Hugging Face Spaces
-### 质量指标
-- **WAR (Whitespace Anomaly Rate)**: 空白符异常率
-- **SED (Spelling Error Density)**: 拼写错误密度
-### 模型选择建议
-- **需要全面去噪 + 有API预算**: 选择 DeepSeek-R1
-- **本地化部署 + 完整纠错**: 选择 WAC-GEC（推荐）
-- **仅需修正空格**: 单独使用WAC模块
-- **追求最快速度**: 使用GPU加速的WAC-GEC
 ---
-**研究生毕业论文成果展示** | Powered by DeepSeek API & WAC-GEC
 """
-# ======================== Gradio界面 ========================
-demo = gr.Blocks(title="数据集去噪框架展示系统", css="""
     .markdown-text { font-size: 16px; line-height: 1.6; }
 """)
 with demo:
     gr.Markdown(
-        """<div style="text-align: center;"><h1>⭐ 基于基准去噪框架的 <span style='color: #e6b800;'>去噪工厂</span> 展示系统</h1></div>
         <br>
-        <p>本系统展示了基于<a href="https://github.com/LLLoUo/bd-toolkit" target="_blank">BD-toolkit</a>的DeepSeek-R1和WAC-GEC两种方法对主流benchmark数据集的去噪效果。通过WAR(空白符异常率)和SED(拼写错误密度)两个指标评估去噪质量。</p>
         """,
         elem_classes="markdown-text"
     )
@@ -739,27 +715,27 @@ with demo:
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
         with gr.TabItem("📊 BD-benchmarks Leaderboard", id=0):
             with gr.Column():
-                gr.Markdown("### BD去噪后主流基准排行榜")
                 with gr.Row():
                     search_bar = gr.Textbox(
-                        placeholder="🔍 搜索Benchmark名称并按ENTER...",
                         show_label=False,
                         elem_id="search-bar",
                     )
                     filter_categories = gr.Radio(
-                        label="📂 筛选Benchmark类别",
                         choices=["all", "BT", "RA", "TG", "SU", "ME", "GR"],
                         value="all",
                         elem_id="filter-columns",
                     )
                     filter_versions = gr.Radio(
-                        label="🔖 筛选数据集版本",
                         choices=[
-                            ("全部版本", "all"),
-                            ("原始数据集", "original"),
-                            ("DeepSeek-R1去噪", "deepseek"),
-                            ("WAC-GEC去噪", "wac_gec")
                         ],
                         value="all",
                         elem_id="filter-versions",
@@ -767,7 +743,7 @@ with demo:
                 leaderboard_table = gr.Dataframe(
                     value=leaderboard_data[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
-                    headers=['ID', 'Category', 'Benchmark', 'WAR (%)', 'SED', '下载'],
                     datatype=['number', 'str', 'str', 'number', 'number', 'markdown'],
                     elem_id="leaderboard-table",
                     interactive=False,
@@ -778,14 +754,12 @@ with demo:
                     visible=False
                 )
-                # 搜索功能
                 search_bar.submit(
                     lambda df, query: search_leaderboard(df, query)[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
                     [hidden_leaderboard, search_bar],
                     leaderboard_table
                 )
-                # 类别筛选功能（需要考虑版本筛选）
                 def combined_filter(df, category, version):
                     filtered = filter_leaderboard(df, category, version)
                     return filtered[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']]
@@ -796,7 +770,6 @@ with demo:
                     leaderboard_table
                 )
-                # 版本筛选功能（需要考虑类别筛选）
                 filter_versions.change(
                     combined_filter,
                     [hidden_leaderboard, filter_categories, filter_versions],
@@ -804,46 +777,40 @@ with demo:
                 )
                 gr.Markdown("""
-                **说明:**
-                - **Category**: BT=基础任务, RA=推理能力, TG=文本生成, SU=语音理解, ME=医学领域, GR=语法领域
-                - **Version**: 原始=未处理数据集, DeepSeek-R1=DeepSeek去噪版本, WAC-GEC=WAC-GEC去噪版本
-                - **WAR**: 空白符异常率（越低越好）
-                - **SED**: 拼写错误密度（越低越好）
                 """, elem_classes="markdown-text")
-        with gr.TabItem("📈 Performance Plot", id=1):
-            gr.Markdown("### 性能可视化分析")
-            gr.Markdown("**注意**: 性能图表功能开发中,敬请期待。")
-        with gr.TabItem("📝 About", id=2):
-            gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text")
-        with gr.TabItem("🚀 BD-toolkit Demo", id=3):
-            gr.Markdown("## BD-toolkit轻量化Demo展示")
-            # 模型可用性提示
-            model_status = "✅ WAC-GEC: " + ("可用" if WAC_GEC_AVAILABLE else "未安装")
-            model_status += " | ✅ DeepSeek-R1: " + ("已配置" if DEEPSEEK_API_KEY else "未配置API密钥")
-            gr.Markdown(f"**模型状态**: {model_status}")
             with gr.Row():
                 with gr.Column():
                     file_input = gr.File(
-                        label="📁 上传 Parquet 文件",
                         file_types=[".parquet"]
                     )
                     question_column = gr.Textbox(
-                        label="📝 问题列名",
                         value="question",
-                        placeholder="例如: question, input_text, prompt"
                     )
                     model_choice = gr.Dropdown(
                         choices=["WAC-GEC", "deepseek-r1-distill-llama-8b"],
                         value="WAC-GEC",
-                        label="🤖 选择模型",
-                        info="DeepSeek: 全面纠错 | WAC-GEC: 语法+空白符纠正(本地模型)"
                     )
                     temperature = gr.Slider(
@@ -852,8 +819,8 @@ with demo:
                         value=0.1,
                         step=0.1,
                         label="🌡️ Temperature",
-                        info="仅对DeepSeek生效",
-                        interactive=False  # 默认不可交互（因为默认选择WAC-GEC）
                     )
                     max_samples = gr.Slider(
@@ -861,26 +828,25 @@ with demo:
                         maximum=100,
                         value=5,
                         step=1,
-                        label="📊 处理样本数 (Demo限制)"
                     )
-                    clean_btn = gr.Button("🚀 开始去噪", variant="primary", size="lg")
                 with gr.Column():
                     output_text = gr.Textbox(
-                        label="⏳ 处理进度",
                         lines=10,
                         max_lines=15
                     )
-                    download_file = gr.File(label="📥 下载去噪后的数据集")
-            # 添加交互逻辑：根据模型选择动态启用/禁用temperature滑块
             def update_temperature_interactive(model):
                 if model == "deepseek-r1-distill-llama-8b":
-                    return gr.update(interactive=True, info="调整生成的随机性")
                 else:
-                    return gr.update(interactive=False, info="WAC-GEC模型不支持temperature参数")
             model_choice.change(
                 fn=update_temperature_interactive,
@@ -888,13 +854,12 @@ with demo:
                 outputs=[temperature]
             )
-            # 颜色对比预览区域
-            gr.Markdown("### 🎨 去噪效果对比预览")
             gr.Markdown("""
-            **颜色说明**:
-            - 🔴 <span style="color: #dc3545;">红色</span> = 原始文本中的错误
-            - 🟢 <span style="color: #28a745;">绿色</span> = 去噪后的修正
-            - ⚫ 黑色 = 未修改的正确部分
             """)
             colored_preview = gr.HTML(label="")
@@ -905,10 +870,11 @@ with demo:
                 outputs=[output_text, download_file, colored_preview]
             )
 if __name__ == "__main__":
-    # 可选：预加载模型（会增加启动时间）
-    # 如果想要预加载,取消下面两行的注释
-    print("🚀 预加载WAC-GEC模型...")
     initialize_wac_gec()
     demo.launch(

 from transformers import AutoTokenizer, AutoModelForCausalLM
 import hashlib
+# ======================== WAC-GEC Import ========================
 try:
     from whitespace_correction import WhitespaceCorrector
     WAC_GEC_AVAILABLE = True
+    # Initialize WAC-GEC model (lazy loading)
     wac_corrector = None
 except ImportError:
     WAC_GEC_AVAILABLE = False
     wac_corrector = None
+    print("⚠️ whitespace_correction not installed, WAC-GEC functionality unavailable")
+# Initialize GEC model (lazy loading)
 gec_tokenizer = None
 gec_model = None
+GEC_MODEL_NAME = "lllouo/gec_Chat-LLaMa-2-7B-FT"
+# ======================== API Configuration ========================
 DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY", "")
 DEEPSEEK_BASE_URL = "https://dashscope.aliyuncs.com/compatible-mode/v1"
+# ======================== NLP Tools Initialization ========================
 try:
     nlp = spacy.load("en_core_web_sm")
 except OSError:
     re.compile(r'([.,!?;:])\s{2,}'),
 ]
+# ======================== Prompt Template ========================
 PROMPT_TEMPLATE = """## Positioning
 You are a **LANGUAGE grammatical error correction tool** that can identify and correct grammatical errors in a text.
 Reply with a corrected version of the input sentence with all **grammatical**, **spelling** and **whitespace errors** fixed, making only necessary changes.
 [input]: """
+# ======================== Initialize WAC + GEC ========================
 def initialize_wac_gec():
+    """Lazy initialization of WAC-GEC models (Whitespace + Grammar Error Correction)"""
     global wac_corrector, gec_tokenizer, gec_model
+    # 1. Initialize WAC (Whitespace Correction)
     if not WAC_GEC_AVAILABLE:
+        print("❌ WAC module not installed")
         return False
     if wac_corrector is None:
                 device=device,
                 download_dir="./models"
             )
+            print(f"✅ WAC whitespace correction model loaded (device: {device})")
         except Exception as e:
+            print(f"❌ WAC model loading failed: {e}")
             return False
+    # 2. Initialize GEC (Grammar Error Correction)
     if gec_model is None or gec_tokenizer is None:
         try:
             device = "cuda" if torch.cuda.is_available() else "cpu"
+            print(f"📥 Downloading GEC model from HuggingFace: {GEC_MODEL_NAME}")
             gec_tokenizer = AutoTokenizer.from_pretrained(
                 GEC_MODEL_NAME,
                 trust_remote_code=True
                 trust_remote_code=True
             )
             if device == "cpu":
                 gec_model = gec_model.to(device)
             gec_tokenizer.pad_token_id = gec_tokenizer.eos_token_id
             gec_tokenizer.padding_side = "left"
+            print(f"✅ GEC grammar correction model loaded (device: {device})")
         except Exception as e:
+            print(f"❌ GEC model loading failed: {e}")
             return False
     return True
+# ======================== GEC Grammar Correction Function ========================
 def correct_sentence_gec(input_sentence):
     """
+    Use GEC model for grammar correction
+    Args:
+        input_sentence (str): Sentence to be corrected
+    Returns:
+        str: Corrected sentence
     """
     if gec_model is None or gec_tokenizer is None:
+        raise ValueError("GEC model not initialized")
     prompt = f"""Rewrite the following sentence to correct grammatical errors. Return ONLY the corrected sentence.
 Original: {input_sentence}
 Corrected:"""
     inputs = gec_tokenizer(prompt, return_tensors="pt").to(gec_model.device)
     is_cpu = str(gec_model.device) == "cpu" or not torch.cuda.is_available()
     if is_cpu:
+        max_tokens = 256
+        beams = 2
     else:
+        max_tokens = 512
         beams = 4
     with torch.no_grad():
             top_p=None
         )
     full_output = gec_tokenizer.decode(outputs[0], skip_special_tokens=True)
     corrected_text = full_output.replace(prompt, "").strip()
     if corrected_text.startswith("Corrected:"):
         corrected_text = corrected_text[len("Corrected:"):].strip()
     return corrected_text
+# ======================== WAC-GEC Combined Processing ========================
 def call_wac_gec(text):
     """
+    Use WAC-GEC two-step correction:
+    1. GEC model for grammar and spelling correction
+    2. WAC model for whitespace correction
     """
     if not initialize_wac_gec():
+        raise ValueError("⚠️ WAC-GEC models not installed or failed to load")
     try:
+        # Step 1: Use GEC model for grammar correction
+        print(f"🔍 GEC processing: {text[:50]}...")
         gec_corrected = correct_sentence_gec(text)
+        print(f"✅ GEC result: {gec_corrected[:50]}...")
+        # Step 2: Use WAC model for whitespace correction
+        print(f"🔍 WAC processing: {gec_corrected[:50]}...")
         final_corrected = wac_corrector.correct_text(gec_corrected)
+        print(f"✅ WAC result: {final_corrected[:50]}...")
         return f"[output]: {final_corrected}"
     except Exception as e:
+        raise Exception(f"WAC-GEC processing error: {str(e)}")
+# ======================== Color Diff Functions ========================
 def generate_colored_diff(original, cleaned):
     """
+    Generate HTML diff with color annotations
+    Errors in original text: red
+    Corrections after denoising: green
     """
     original_words = original.split()
     cleaned_words = cleaned.split()
     matcher = difflib.SequenceMatcher(None, original_words, cleaned_words)
     original_html = []
     for tag, i1, i2, j1, j2 in matcher.get_opcodes():
         if tag == 'equal':
             original_html.extend(original_words[i1:i2])
             cleaned_html.extend(cleaned_words[j1:j2])
         elif tag == 'replace':
             original_html.extend([f'<span style="color: #dc3545; font-weight: bold;">{w}</span>'
                                  for w in original_words[i1:i2]])
             cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
                                 for w in cleaned_words[j1:j2]])
         elif tag == 'delete':
             original_html.extend([f'<span style="color: #dc3545; text-decoration: line-through;">{w}</span>'
                                  for w in original_words[i1:i2]])
         elif tag == 'insert':
             cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
                                 for w in cleaned_words[j1:j2]])
 def create_comparison_html(original_list, cleaned_list):
     """
+    Create HTML table for comparison
     """
     html = """
     <div style="font-family: 'Times New Roman', serif; max-width: 100%; overflow-x: auto;">
             <thead>
                 <tr>
                     <th class="index-col">#</th>
+                    <th>Original Question</th>
+                    <th>Denoised Question</th>
                 </tr>
             </thead>
             <tbody>
     return html
+# ======================== Utility Functions ========================
 def check_api_key(model_choice):
+    """Check API key (only required for DeepSeek)"""
     if model_choice == "deepseek-r1-distill-llama-8b" and not DEEPSEEK_API_KEY:
+        raise ValueError("⚠️ Please configure DEEPSEEK_API_KEY in Space Settings!")
 def call_deepseek_api(prompt, model="deepseek-r1-distill-llama-8b", temperature=0.1, stream=True):
     check_api_key(model)
         return 0.0
     return total_errors / total_words * 100
+# ======================== Leaderboard Data Processing ========================
 def load_leaderboard_data():
     json_path = "leaderboard.json"
     try:
         with open(json_path, 'r', encoding='utf-8') as f:
             data = json.load(f)
         for item in data:
             benchmark = item['Benchmark']
             hash_object = hashlib.md5(benchmark.encode())
+            item['ID'] = hash_object.hexdigest()[:8]
         return pd.DataFrame(data)
     except Exception as e:
 def filter_leaderboard(df, category_query, version_query):
     """
+    Filter by both category and version
     """
     result = df.copy()
     if category_query != "all":
         result = result[result['Category'] == category_query]
     if version_query != "all":
         if version_query == "original":
             result = result[result['Benchmark'].str.contains('_original', case=False, na=False)]
         return df
     return df[df['Benchmark'].str.contains(query, case=False, na=False)]
+# ======================== Dataset Denoising Function ========================
 def clean_dataset(file_path, question_column, model_choice, temperature, max_samples, progress=gr.Progress()):
     try:
         try:
             check_api_key(model_choice)
         except ValueError as e:
             if model_choice == "deepseek-r1-distill-llama-8b":
                 return str(e), None, ""
         if model_choice == "WAC-GEC" and not WAC_GEC_AVAILABLE:
+            return "❌ WAC-GEC model not installed! Please install whitespace_correction package.", None, ""
+        progress(0.05, desc="📁 Reading data file...")
         df = pd.read_parquet(file_path)
         if question_column not in df.columns:
             available_columns = ", ".join(df.columns.tolist())
+            return f"❌ Column '{question_column}' not found!\nAvailable columns: {available_columns}", None, ""
         data_ori = df[question_column].tolist()[:int(max_samples)]
         total = len(data_ori)
+        progress(0.08, desc="📊 Calculating original metrics...")
         original_sentences = [str(item) for item in data_ori]
         war_original = calculate_whitespace_anomaly_rate(original_sentences)
         sed_original = calculate_spelling_error_density(original_sentences)
+        progress(0.1, desc=f"🚀 Starting denoising of {total} samples (model: {model_choice})...")
         if model_choice == "WAC-GEC":
             data_corrupt = [str(item) for item in data_ori]
         else:
         results = []
         max_retries = 5 if model_choice == "deepseek-r1-distill-llama-8b" else 3
+        log_text = f"🚀 Processing {total} samples...\n"
+        log_text += f"📌 Using model: {model_choice}\n\n"
         for idx in range(total):
+            progress((0.1 + 0.7 * idx / total), desc=f"Processing: {idx+1}/{total}")
             unprocess_text = str(data_ori[idx])
             original_text = data_corrupt[idx]
             while retry_count < max_retries:
                 try:
                     if model_choice == "WAC-GEC":
                         response_content = call_wac_gec(original_text)
                     else:
                             temperature=float(temperature)
                         )
                     if model_choice == "WAC-GEC":
                         if response_content.startswith('[output]:'):
                             results.append(response_content)
                 except Exception as e:
                     retry_count += 1
+                    log_text += f"⚠️ Sample {idx+1} error, retry {retry_count}/{max_retries}: {str(e)}\n"
             else:
                 results.append(f"[ERROR] Failed to process: {original_text}")
+                log_text += f"❌ Sample {idx+1} processing failed\n"
+        progress(0.85, desc="📊 Post-processing...")
         lst_extracted = []
         error_count = 0
             else:
                 lst_final.append(lst_extracted[i])
+        progress(0.90, desc="📊 Calculating denoised metrics...")
         cleaned_sentences = [str(item) for item in lst_final]
         war_cleaned = calculate_whitespace_anomaly_rate(cleaned_sentences)
         sed_cleaned = calculate_spelling_error_density(cleaned_sentences)
         delta_war = war_cleaned - war_original
         delta_sed = sed_cleaned - sed_original
+        progress(0.95, desc="💾 Saving results...")
         df_cleaned = df.copy()
         df_cleaned[question_column + '_cleaned'] = lst_final[:len(df)]
         df_cleaned.to_parquet(output_path, index=False)
+        log_text += f"\n\n📊 Processing Complete!\n"
         log_text += f"{'='*50}\n"
+        log_text += f"【Basic Statistics】\n"
+        log_text += f"- Model used: {model_choice}\n"
+        log_text += f"- Total samples: {total}\n"
+        log_text += f"- Successfully processed: {total - error_count - unknown_count}\n"
+        log_text += f"- Failed samples: {error_count}\n"
+        log_text += f"- Unknown format: {unknown_count}\n"
+        log_text += f"- Output file: {output_filename}\n\n"
+        log_text += f"【Quality Metrics】\n"
+        log_text += f"📍 Whitespace Anomaly Rate (WAR):\n"
+        log_text += f"   Original: {war_original:.2f}% → Denoised: {war_cleaned:.2f}%\n"
+        log_text += f"   Change: {delta_war:+.2f}% {'✅ Improved' if delta_war < 0 else '⚠️ Increased'}\n\n"
+        log_text += f"📍 Spelling Error Density (SED):\n"
+        log_text += f"   Original: {sed_original:.2f}% → Denoised: {sed_cleaned:.2f}%\n"
+        log_text += f"   Change: {delta_sed:+.2f}% {'✅ Improved' if delta_sed < 0 else '⚠️ Increased'}\n"
         if model_choice == "WAC-GEC":
+            log_text += f"\n💡 Note: WAC-GEC uses two-step correction (GEC grammar + WAC whitespace)\n"
         log_text += f"{'='*50}\n"
         preview_html = create_comparison_html(data_ori[:5], lst_final[:5])
+        progress(1.0, desc="✅ Complete!")
         return log_text, output_path, preview_html
     except Exception as e:
         import traceback
         error_detail = traceback.format_exc()
+        return f"❌ Processing error: {str(e)}\n\nDetailed error:\n{error_detail}", None, ""
+# ======================== Text Content ========================
 ABOUT_TEXT = """
+## Denoising Workflow
+### Supported Models
 #### 1. DeepSeek-R1 (deepseek-r1-distill-llama-8b)
+- **Function**: Comprehensive grammar, spelling, and whitespace error correction
+- **Advantages**: Strong comprehensive capability, handles multiple error types
+- **Configuration**: Requires DEEPSEEK_API_KEY in Space Settings
 #### 2. WAC-GEC (Whitespace + Grammar Error Correction)
+- **Function**: Two-step correction workflow
+  - **Step 1 (GEC)**: Use LLaMA-2-7B fine-tuned model for grammar and spelling correction
+  - **Step 2 (WAC)**: Use whitespace correction model for spacing issues
+- **Advantages**:
+  - Fully local, no API key required
+  - Combines two specialized models
+  - Suitable for offline environments and limited budgets
+- **Model Source**:
   - GEC: [lllouo/gec_Chat-LLaMa-2-7B-FT](https://huggingface.co/lllouo/gec_Chat-LLaMa-2-7B-FT)
+  - WAC: whitespace_correction library
+### Core Algorithm
+1. **Preprocessing (process_sentence)**
+   - Detect sentence completeness
+   - Add marker `___` for incomplete sentences (DeepSeek only)
+   - Preserve multi-line text format
+2. **Model Denoising**
+   - **DeepSeek**: Use API for comprehensive error correction, up to 5 retries
    - **WAC-GEC**:
+     - First use GEC model for grammar and spelling correction
+     - Then use WAC model for whitespace correction
+     - Up to 3 retries
+3. **Format Validation**
+   - Verify output format correctness
+   - Check marker preservation
+   - Length reasonability check
+4. **Post-processing**
+   - Extract denoised content
+   - Restore original multi-line format
+   - Generate Parquet file with model identifier
+### Supported Datasets
+- **MMLU**: Multiple choice questions across 57 subjects
+- **GSM8K**: Math reasoning problems
+- **ARC-Challenge**: Science Q&A
+- **MedMCQA**: Medical multiple choice
+- **CoQA**: Conversational Q&A
+- And more...
+### Color Annotation Legend
+- 🔴 **Red**: Errors in original text (spelling, grammar, spacing, etc.)
+- 🟢 **Green**: Corrections after denoising
+- ⚫ **Black**: Unchanged correct parts
+### Tech Stack
 - **LLM**: DeepSeek API (deepseek-r1-distill-llama-8b)
+- **Local Models**:
+  - GEC: LLaMA-2-7B (fine-tuned for grammar correction)
   - WAC: Whitespace Correction Model
+- **Frontend**: Gradio 4.16.0
+- **Data Processing**: Pandas + PyArrow (Parquet)
+- **Diff Comparison**: Python difflib
+- **NLP Tools**: spaCy, pyspellchecker
+- **API Calls**: OpenAI SDK
+- **Deployment**: Hugging Face Spaces
+### Quality Metrics
+- **WAR (Whitespace Anomaly Rate)**: Whitespace anomaly rate
+- **SED (Spelling Error Density)**: Spelling error density
+### Model Selection Guide
+- **Need comprehensive denoising + API budget**: Choose DeepSeek-R1
+- **Local deployment + complete correction**: Choose WAC-GEC (Recommended)
+- **Only need spacing correction**: Use WAC module alone
+- **Fastest speed**: Use GPU-accelerated WAC-GEC
 ---
+**Graduate Thesis Research Showcase** | Powered by DeepSeek API & WAC-GEC
 """
+# ======================== Gradio Interface ========================
+demo = gr.Blocks(title="Dataset Denoising Framework Demo System", css="""
     .markdown-text { font-size: 16px; line-height: 1.6; }
 """)
 with demo:
     gr.Markdown(
+        """<div style="text-align: center;"><h1>⭐ <span style='color: #e6b800;'>Denoising Factory</span> Based on Benchmark Denoising Framework</h1></div>
         <br>
+        <p>This system demonstrates the denoising effects of DeepSeek-R1 and WAC-GEC methods on mainstream benchmark datasets based on <a href="https://github.com/LLLoUo/bd-toolkit" target="_blank">BD-toolkit</a>. Quality is evaluated using WAR (Whitespace Anomaly Rate) and SED (Spelling Error Density) metrics.</p>
         """,
         elem_classes="markdown-text"
     )
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
         with gr.TabItem("📊 BD-benchmarks Leaderboard", id=0):
             with gr.Column():
+                gr.Markdown("### Mainstream Benchmark Leaderboard After BD Denoising")
                 with gr.Row():
                     search_bar = gr.Textbox(
+                        placeholder="🔍 Search benchmark name and press ENTER...",
                         show_label=False,
                         elem_id="search-bar",
                     )
                     filter_categories = gr.Radio(
+                        label="📂 Filter by Benchmark Category",
                         choices=["all", "BT", "RA", "TG", "SU", "ME", "GR"],
                         value="all",
                         elem_id="filter-columns",
                     )
                     filter_versions = gr.Radio(
+                        label="🔖 Filter by Dataset Version",
                         choices=[
+                            ("All Versions", "all"),
+                            ("Original Dataset", "original"),
+                            ("DeepSeek-R1 Denoised", "deepseek"),
+                            ("WAC-GEC Denoised", "wac_gec")
                         ],
                         value="all",
                         elem_id="filter-versions",
                 leaderboard_table = gr.Dataframe(
                     value=leaderboard_data[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
+                    headers=['ID', 'Category', 'Benchmark', 'WAR (%)', 'SED', 'Download'],
                     datatype=['number', 'str', 'str', 'number', 'number', 'markdown'],
                     elem_id="leaderboard-table",
                     interactive=False,
                     visible=False
                 )
                 search_bar.submit(
                     lambda df, query: search_leaderboard(df, query)[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
                     [hidden_leaderboard, search_bar],
                     leaderboard_table
                 )
                 def combined_filter(df, category, version):
                     filtered = filter_leaderboard(df, category, version)
                     return filtered[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']]
                     leaderboard_table
                 )
                 filter_versions.change(
                     combined_filter,
                     [hidden_leaderboard, filter_categories, filter_versions],
                 )
                 gr.Markdown("""
+                **Legend:**
+                - **Category**: BT=Basic Tasks, RA=Reasoning Abilities, TG=Text Generation, SU=Speech Understanding, ME=Medical, GR=Grammar
+                - **Version**: Original=Unprocessed dataset, DeepSeek-R1=DeepSeek denoised version, WAC-GEC=WAC-GEC denoised version
+                - **WAR**: Whitespace Anomaly Rate (lower is better)
+                - **SED**: Spelling Error Density (lower is better)
                 """, elem_classes="markdown-text")
+        with gr.TabItem("🚀 BD-toolkit Demo", id=2):
+            gr.Markdown("## BD-toolkit Lightweight Demo")
+            model_status = "✅ WAC-GEC: " + ("Available" if WAC_GEC_AVAILABLE else "Not Installed")
+            model_status += " | ✅ DeepSeek-R1: " + ("Configured" if DEEPSEEK_API_KEY else "API Key Not Configured")
+            gr.Markdown(f"**Model Status**: {model_status}")
             with gr.Row():
                 with gr.Column():
                     file_input = gr.File(
+                        label="📁 Upload Parquet File",
                         file_types=[".parquet"]
                     )
                     question_column = gr.Textbox(
+                        label="📝 Question Column Name",
                         value="question",
+                        placeholder="e.g., question, input_text, prompt"
                     )
                     model_choice = gr.Dropdown(
                         choices=["WAC-GEC", "deepseek-r1-distill-llama-8b"],
                         value="WAC-GEC",
+                        label="🤖 Select Model",
+                        info="DeepSeek: Comprehensive correction | WAC-GEC: Grammar + whitespace (local model)"
                     )
                     temperature = gr.Slider(
                         value=0.1,
                         step=0.1,
                         label="🌡️ Temperature",
+                        info="Only effective for DeepSeek",
+                        interactive=False
                     )
                     max_samples = gr.Slider(
                         maximum=100,
                         value=5,
                         step=1,
+                        label="📊 Number of Samples to Process (Demo Limit)"
                     )
+                    clean_btn = gr.Button("🚀 Start Denoising", variant="primary", size="lg")
                 with gr.Column():
                     output_text = gr.Textbox(
+                        label="⏳ Processing Progress",
                         lines=10,
                         max_lines=15
                     )
+                    download_file = gr.File(label="📥 Download Denoised Dataset")
             def update_temperature_interactive(model):
                 if model == "deepseek-r1-distill-llama-8b":
+                    return gr.update(interactive=True, info="Adjust generation randomness")
                 else:
+                    return gr.update(interactive=False, info="WAC-GEC model does not support temperature parameter")
             model_choice.change(
                 fn=update_temperature_interactive,
                 outputs=[temperature]
             )
+            gr.Markdown("### 🎨 Denoising Effect Comparison Preview")
             gr.Markdown("""
+            **Color Legend**:
+            - 🔴 <span style="color: #dc3545;">Red</span> = Errors in original text
+            - 🟢 <span style="color: #28a745;">Green</span> = Corrections after denoising
+            - ⚫ Black = Unchanged correct parts
             """)
             colored_preview = gr.HTML(label="")
                 outputs=[output_text, download_file, colored_preview]
             )
+        with gr.TabItem("📝 About", id=3):
+            gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text")
 if __name__ == "__main__":
+    print("🚀 Preloading WAC-GEC models...")
     initialize_wac_gec()
     demo.launch(

leaderboard.json CHANGED Viewed

@@ -5,7 +5,7 @@
         "Benchmark": "ARC_original",
         "WAR": 0.11,
         "SED": 0.67,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc)"
     },
     {
         "ID": 2,
@@ -13,7 +13,7 @@
         "Benchmark": "ARC_deepseek_r1_denoising",
         "WAR": 0.00,
         "SED": 0.67,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_deepseek_r1_denoising)"
     },
     {
         "ID": 3,
@@ -21,7 +21,7 @@
         "Benchmark": "ARC_wac_gec",
         "WAR": 0.00,
         "SED": 0.66,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_wac_gec)"
     },
     {
         "ID": 4,
@@ -29,7 +29,7 @@
         "Benchmark": "COQA_original",
         "WAR": 6.79,
         "SED": 2.74,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa)"
     },
     {
         "ID": 5,
@@ -37,7 +37,7 @@
         "Benchmark": "COQA_deepseek_r1_denoising",
         "WAR": 4.18,
         "SED": 2.57,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_deepseek_r1_denoising)"
     },
     {
         "ID": 6,
@@ -45,7 +45,7 @@
         "Benchmark": "COQA_wac_gec",
         "WAR": 4.70,
         "SED": 2.56,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_wac_gec)"
     },
     {
         "ID": 7,
@@ -53,7 +53,7 @@
         "Benchmark": "DROP_original",
         "WAR": 1.50,
         "SED": 3.38,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop)"
     },
     {
         "ID": 8,
@@ -61,7 +61,7 @@
         "Benchmark": "DROP_deepseek_r1_denoising",
         "WAR": 0.02,
         "SED": 3.24,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_deepseek_r1_denoising)"
     },
     {
         "ID": 9,
@@ -69,7 +69,7 @@
         "Benchmark": "DROP_wac_gec",
         "WAR": 0.64,
         "SED": 3.25,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_wac_gec)"
     },
     {
         "ID": 10,
@@ -77,7 +77,7 @@
         "Benchmark": "MRPC_original",
         "WAR": 100.00,
         "SED": 5.65,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/mrpc)"
     },
     {
         "ID": 11,
@@ -85,7 +85,7 @@
         "Benchmark": "MRPC_deepseek_r1_denoising",
         "WAR": 3.80,
         "SED": 4.70,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/mrpc)"
     },
     {
         "ID": 12,
@@ -93,7 +93,7 @@
         "Benchmark": "MRPC_wac_gec",
         "WAR": 1.84,
         "SED": 4.50,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/mrpc)"
     },
     {
         "ID": 13,
@@ -101,7 +101,7 @@
         "Benchmark": "RTE_original",
         "WAR": 2.17,
         "SED": 4.47,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/rte)"
     },
     {
         "ID": 14,
@@ -109,7 +109,7 @@
         "Benchmark": "RTE_deepseek_r1_denoising",
         "WAR": 0.36,
         "SED": 4.50,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/rte)"
     },
     {
         "ID": 15,
@@ -117,7 +117,7 @@
         "Benchmark": "RTE_wac_gec",
         "WAR": 0.72,
         "SED": 4.43,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/rte)"
     },
     {
         "ID": 16,
@@ -125,7 +125,7 @@
         "Benchmark": "SST2_original",
         "WAR": 98.97,
         "SED": 5.42,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/sst2)"
     },
     {
         "ID": 17,
@@ -133,7 +133,7 @@
         "Benchmark": "SST2_deepseek_r1_denoising",
         "WAR": 7.22,
         "SED": 3.66,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/sst2)"
     },
     {
         "ID": 18,
@@ -141,7 +141,7 @@
         "Benchmark": "SST2_wac_gec",
         "WAR": 5.39,
         "SED": 3.52,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/sst2)"
     },
     {
         "ID": 19,
@@ -149,7 +149,7 @@
         "Benchmark": "WNLI_original",
         "WAR": 0.70,
         "SED": 0.64,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/wnli)"
     },
     {
         "ID": 20,
@@ -157,7 +157,7 @@
         "Benchmark": "WNLI_deepseek_r1_denoising",
         "WAR": 0.00,
         "SED": 0.59,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/wnli)"
     },
     {
         "ID": 21,
@@ -165,7 +165,7 @@
         "Benchmark": "WNLI_wac_gec",
         "WAR": 0.00,
         "SED": 0.64,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/wnli)"
     },
     {
         "ID": 22,
@@ -173,7 +173,7 @@
         "Benchmark": "GSM8K_original",
         "WAR": 25.70,
         "SED": 1.11,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k)"
     },
     {
         "ID": 23,
@@ -181,7 +181,7 @@
         "Benchmark": "GSM8K_deepseek_r1_denoising",
         "WAR": 0.30,
         "SED": 1.13,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_deepseek_r1_denoising)"
     },
     {
         "ID": 24,
@@ -189,7 +189,7 @@
         "Benchmark": "GSM8K_wac_gec",
         "WAR": 1.97,
         "SED": 1.11,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_wac_gec)"
     },
     {
         "ID": 25,
@@ -197,7 +197,7 @@
         "Benchmark": "MMLU_original",
         "WAR": 10.06,
         "SED": 2.21,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu)"
     },
     {
         "ID": 26,
@@ -205,7 +205,7 @@
         "Benchmark": "MMLU_deepseek_r1_denoising",
         "WAR": 6.56,
         "SED": 2.15,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_deepseek_r1_denoising)"
     },
     {
         "ID": 27,
@@ -213,7 +213,7 @@
         "Benchmark": "MMLU_wac_gec",
         "WAR": 2.98,
         "SED": 2.08,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_wac_gec)"
     },
     {
         "ID": 28,
@@ -221,7 +221,7 @@
         "Benchmark": "MedMCQA_original",
         "WAR": 6.31,
         "SED": 6.18,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa)"
     },
     {
         "ID": 29,
@@ -229,7 +229,7 @@
         "Benchmark": "MedMCQA_deepseek_r1_denoising",
         "WAR": 3.44,
         "SED": 5.70,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_deepseek_r1_denoising)"
     },
     {
         "ID": 30,
@@ -237,7 +237,7 @@
         "Benchmark": "MedMCQA_wac_gec",
         "WAR": 2.44,
         "SED": 5.91,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_wac_gec)"
     },
     {
         "ID": 31,
@@ -245,7 +245,7 @@
         "Benchmark": "MedQA_original",
         "WAR": 16.97,
         "SED": 6.49,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA-USMLE-4-options)"
     },
     {
         "ID": 32,
@@ -253,7 +253,7 @@
         "Benchmark": "MedQA_deepseek_r1_denoising",
         "WAR": 16.26,
         "SED": 6.49,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_deepseek_r1_denoising)"
     },
     {
         "ID": 33,
@@ -261,7 +261,7 @@
         "Benchmark": "MedQA_wac_gec",
         "WAR": 0.79,
         "SED": 6.51,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_wac_gec)"
     },
     {
         "ID": 34,
@@ -269,7 +269,7 @@
         "Benchmark": "Natural_questions_original",
         "WAR": 0.17,
         "SED": 2.90,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open)"
     },
     {
         "ID": 35,
@@ -277,7 +277,7 @@
         "Benchmark": "Natural_questions_deepseek_r1_denoising",
         "WAR": 0.06,
         "SED": 3.06,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_deepseek_r1_denoising)"
     },
     {
         "ID": 36,
@@ -285,7 +285,7 @@
         "Benchmark": "Natural_questions_wac_gec",
         "WAR": 0.28,
         "SED": 2.93,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_wac_gec)"
     },
     {
         "ID": 37,
@@ -293,7 +293,7 @@
         "Benchmark": "PubMedQA_original",
         "WAR": 0.60,
         "SED": 8.15,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa)"
     },
     {
         "ID": 38,
@@ -301,7 +301,7 @@
         "Benchmark": "PubMedQA_deepseek_r1_denoising",
         "WAR": 0.20,
         "SED": 8.19,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_deepseek_r1_denoising)"
     },
     {
         "ID": 39,
@@ -309,7 +309,7 @@
         "Benchmark": "PubMedQA_wac_gec",
         "WAR": 0.00,
         "SED": 8.10,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_wac_gec)"
     },
     {
         "ID": 40,
@@ -317,7 +317,7 @@
         "Benchmark": "Truthful_QA_original",
         "WAR": 0.00,
         "SED": 1.75,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa)"
     },
     {
         "ID": 41,
@@ -325,7 +325,7 @@
         "Benchmark": "Truthful_QA_deepseek_r1_denoising",
         "WAR": 0.00,
         "SED": 1.73,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_deepseek_r1_denoising)"
     },
     {
         "ID": 42,
@@ -333,6 +333,6 @@
         "Benchmark": "Truthful_QA_wac_gec",
         "WAR": 0.00,
         "SED": 1.53,
-        "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_wac_gec)"
     }
 ]

         "Benchmark": "ARC_original",
         "WAR": 0.11,
         "SED": 0.67,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc)"
     },
     {
         "ID": 2,
         "Benchmark": "ARC_deepseek_r1_denoising",
         "WAR": 0.00,
         "SED": 0.67,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_deepseek_r1_denoising)"
     },
     {
         "ID": 3,
         "Benchmark": "ARC_wac_gec",
         "WAR": 0.00,
         "SED": 0.66,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_wac_gec)"
     },
     {
         "ID": 4,
         "Benchmark": "COQA_original",
         "WAR": 6.79,
         "SED": 2.74,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa)"
     },
     {
         "ID": 5,
         "Benchmark": "COQA_deepseek_r1_denoising",
         "WAR": 4.18,
         "SED": 2.57,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_deepseek_r1_denoising)"
     },
     {
         "ID": 6,
         "Benchmark": "COQA_wac_gec",
         "WAR": 4.70,
         "SED": 2.56,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_wac_gec)"
     },
     {
         "ID": 7,
         "Benchmark": "DROP_original",
         "WAR": 1.50,
         "SED": 3.38,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop)"
     },
     {
         "ID": 8,
         "Benchmark": "DROP_deepseek_r1_denoising",
         "WAR": 0.02,
         "SED": 3.24,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_deepseek_r1_denoising)"
     },
     {
         "ID": 9,
         "Benchmark": "DROP_wac_gec",
         "WAR": 0.64,
         "SED": 3.25,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_wac_gec)"
     },
     {
         "ID": 10,
         "Benchmark": "MRPC_original",
         "WAR": 100.00,
         "SED": 5.65,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/mrpc)"
     },
     {
         "ID": 11,
         "Benchmark": "MRPC_deepseek_r1_denoising",
         "WAR": 3.80,
         "SED": 4.70,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/mrpc)"
     },
     {
         "ID": 12,
         "Benchmark": "MRPC_wac_gec",
         "WAR": 1.84,
         "SED": 4.50,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/mrpc)"
     },
     {
         "ID": 13,
         "Benchmark": "RTE_original",
         "WAR": 2.17,
         "SED": 4.47,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/rte)"
     },
     {
         "ID": 14,
         "Benchmark": "RTE_deepseek_r1_denoising",
         "WAR": 0.36,
         "SED": 4.50,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/rte)"
     },
     {
         "ID": 15,
         "Benchmark": "RTE_wac_gec",
         "WAR": 0.72,
         "SED": 4.43,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/rte)"
     },
     {
         "ID": 16,
         "Benchmark": "SST2_original",
         "WAR": 98.97,
         "SED": 5.42,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/sst2)"
     },
     {
         "ID": 17,
         "Benchmark": "SST2_deepseek_r1_denoising",
         "WAR": 7.22,
         "SED": 3.66,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/sst2)"
     },
     {
         "ID": 18,
         "Benchmark": "SST2_wac_gec",
         "WAR": 5.39,
         "SED": 3.52,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/sst2)"
     },
     {
         "ID": 19,
         "Benchmark": "WNLI_original",
         "WAR": 0.70,
         "SED": 0.64,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/wnli)"
     },
     {
         "ID": 20,
         "Benchmark": "WNLI_deepseek_r1_denoising",
         "WAR": 0.00,
         "SED": 0.59,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/wnli)"
     },
     {
         "ID": 21,
         "Benchmark": "WNLI_wac_gec",
         "WAR": 0.00,
         "SED": 0.64,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/wnli)"
     },
     {
         "ID": 22,
         "Benchmark": "GSM8K_original",
         "WAR": 25.70,
         "SED": 1.11,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k)"
     },
     {
         "ID": 23,
         "Benchmark": "GSM8K_deepseek_r1_denoising",
         "WAR": 0.30,
         "SED": 1.13,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_deepseek_r1_denoising)"
     },
     {
         "ID": 24,
         "Benchmark": "GSM8K_wac_gec",
         "WAR": 1.97,
         "SED": 1.11,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_wac_gec)"
     },
     {
         "ID": 25,
         "Benchmark": "MMLU_original",
         "WAR": 10.06,
         "SED": 2.21,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu)"
     },
     {
         "ID": 26,
         "Benchmark": "MMLU_deepseek_r1_denoising",
         "WAR": 6.56,
         "SED": 2.15,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_deepseek_r1_denoising)"
     },
     {
         "ID": 27,
         "Benchmark": "MMLU_wac_gec",
         "WAR": 2.98,
         "SED": 2.08,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_wac_gec)"
     },
     {
         "ID": 28,
         "Benchmark": "MedMCQA_original",
         "WAR": 6.31,
         "SED": 6.18,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa)"
     },
     {
         "ID": 29,
         "Benchmark": "MedMCQA_deepseek_r1_denoising",
         "WAR": 3.44,
         "SED": 5.70,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_deepseek_r1_denoising)"
     },
     {
         "ID": 30,
         "Benchmark": "MedMCQA_wac_gec",
         "WAR": 2.44,
         "SED": 5.91,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_wac_gec)"
     },
     {
         "ID": 31,
         "Benchmark": "MedQA_original",
         "WAR": 16.97,
         "SED": 6.49,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA-USMLE-4-options)"
     },
     {
         "ID": 32,
         "Benchmark": "MedQA_deepseek_r1_denoising",
         "WAR": 16.26,
         "SED": 6.49,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_deepseek_r1_denoising)"
     },
     {
         "ID": 33,
         "Benchmark": "MedQA_wac_gec",
         "WAR": 0.79,
         "SED": 6.51,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_wac_gec)"
     },
     {
         "ID": 34,
         "Benchmark": "Natural_questions_original",
         "WAR": 0.17,
         "SED": 2.90,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open)"
     },
     {
         "ID": 35,
         "Benchmark": "Natural_questions_deepseek_r1_denoising",
         "WAR": 0.06,
         "SED": 3.06,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_deepseek_r1_denoising)"
     },
     {
         "ID": 36,
         "Benchmark": "Natural_questions_wac_gec",
         "WAR": 0.28,
         "SED": 2.93,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_wac_gec)"
     },
     {
         "ID": 37,
         "Benchmark": "PubMedQA_original",
         "WAR": 0.60,
         "SED": 8.15,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa)"
     },
     {
         "ID": 38,
         "Benchmark": "PubMedQA_deepseek_r1_denoising",
         "WAR": 0.20,
         "SED": 8.19,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_deepseek_r1_denoising)"
     },
     {
         "ID": 39,
         "Benchmark": "PubMedQA_wac_gec",
         "WAR": 0.00,
         "SED": 8.10,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_wac_gec)"
     },
     {
         "ID": 40,
         "Benchmark": "Truthful_QA_original",
         "WAR": 0.00,
         "SED": 1.75,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa)"
     },
     {
         "ID": 41,
         "Benchmark": "Truthful_QA_deepseek_r1_denoising",
         "WAR": 0.00,
         "SED": 1.73,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_deepseek_r1_denoising)"
     },
     {
         "ID": 42,
         "Benchmark": "Truthful_QA_wac_gec",
         "WAR": 0.00,
         "SED": 1.53,
+        "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_wac_gec)"
     }
 ]