lllouo commited on
Commit
28e23fd
·
1 Parent(s): ba32277

English Version

Browse files
Files changed (3) hide show
  1. README.md +48 -48
  2. app.py +195 -229
  3. leaderboard.json +42 -42
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: BD Framework Test
3
  emoji: 🔥
4
  colorFrom: blue
5
  colorTo: gray
@@ -13,80 +13,80 @@ short_description: Benchmark-Denoising (BD) framework
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
 
16
- # 数据集清洗框架展示系统
17
 
18
- 基于LLM的智能数据集质量提升框架 - 研究生毕业论文成果展示
19
 
20
- ## 部署到 Hugging Face Spaces
21
 
22
- ### 步骤1: 创建Space
23
 
24
- 1. 访问 https://huggingface.co/spaces
25
- 2. 点击 "Create new Space"
26
- 3. 选择 **Gradio** SDK (Docker)
27
- 4. Space名称: `dataset-cleaning-demo`
28
 
29
- ### 步骤2: 上传文件
30
 
31
- 将以下文件上传到Space:
32
- - `app.py` - 主应用程序
33
- - `requirements.txt` - Python依赖
34
- - `README.md` - 本文件
35
 
36
- ### 步骤3: 配置环境变量
37
 
38
- Space设置中添加:
39
- - `DEEPSEEK_API_KEY`: 你的DeepSeek API密钥
40
 
41
- ### 步骤4: 等待构建
42
 
43
- HF Spaces会自动构建并部署你的应用。
44
 
45
- ## 本地运行
46
  ```bash
47
- # 安装依赖
48
  pip install -r requirements.txt
49
 
50
- # 设置环境变量
51
  export DEEPSEEK_API_KEY="your-api-key"
52
 
53
- # 运行应用
54
  python app.py
55
  ```
56
 
57
- 访问 http://localhost:7860
58
 
59
- ## 功能特性
60
 
61
- 数据集上传 (JSON/JSONL格式)
62
- 基于DeepSeek API的智能清洗
63
- ✅ 19个主流benchmark的清洗效果展示
64
- 交互式Leaderboard
65
- 清洗结果下载
66
 
67
- ## 技术栈
68
 
69
- - **前端**: React + Tailwind CSS
70
- - **后端**: FastAPI
71
  - **LLM**: DeepSeek API
72
- - **部署**: Hugging Face Spaces
73
 
74
- ## 清洗流程
75
 
76
- 1. **错误检测**: 识别数据质量问题
77
- 2. **质量评估**: 对样本进行评分
78
- 3. **智能修正**: LLM生成高质量版本
79
- 4. **一致性验证**: 确保逻辑一致性
80
 
81
- ## 注意事项
82
 
83
- - Demo版本限制每次处理10个样本
84
- - 需要有效的DeepSeek API密钥
85
- - Leaderboard数据为预置结果
86
 
87
- ## 后续完善计划
88
 
89
- - [ ] 连接学校服务器LLaMA3模型
90
- - [ ] 支持大规模数据集处理
91
- - [ ] 添加更多评估指标
92
- - [ ] 实时处理进度反馈
 
1
  ---
2
+ title: BD Framework
3
  emoji: 🔥
4
  colorFrom: blue
5
  colorTo: gray
 
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
 
16
+ # Dataset Denoising Framework Demo System
17
 
18
+ LLM-based Intelligent Dataset Quality Enhancement Framework - Graduate Thesis Research Showcase
19
 
20
+ ## Deploy to Hugging Face Spaces
21
 
22
+ ### Step 1: Create Space
23
 
24
+ 1. Visit https://huggingface.co/spaces
25
+ 2. Click "Create new Space"
26
+ 3. Select **Gradio** SDK (or Docker)
27
+ 4. Space name: `dataset-cleaning-demo`
28
 
29
+ ### Step 2: Upload Files
30
 
31
+ Upload the following files to the Space:
32
+ - `app.py` - Main application
33
+ - `requirements.txt` - Python dependencies
34
+ - `README.md` - This file
35
 
36
+ ### Step 3: Configure Environment Variables
37
 
38
+ Add in Space settings:
39
+ - `DEEPSEEK_API_KEY`: Your DeepSeek API key
40
 
41
+ ### Step 4: Wait for Build
42
 
43
+ HF Spaces will automatically build and deploy your application.
44
 
45
+ ## Local Development
46
  ```bash
47
+ # Install dependencies
48
  pip install -r requirements.txt
49
 
50
+ # Set environment variable
51
  export DEEPSEEK_API_KEY="your-api-key"
52
 
53
+ # Run application
54
  python app.py
55
  ```
56
 
57
+ Visit http://localhost:7860
58
 
59
+ ## Features
60
 
61
+ Dataset upload (JSON/JSONL format)
62
+ Intelligent denoising via DeepSeek API
63
+ Showcase denoising effects on 19 mainstream benchmarks
64
+ Interactive Leaderboard
65
+ Download denoised results
66
 
67
+ ## Tech Stack
68
 
69
+ - **Frontend**: React + Tailwind CSS
70
+ - **Backend**: FastAPI
71
  - **LLM**: DeepSeek API
72
+ - **Deployment**: Hugging Face Spaces
73
 
74
+ ## Denoising Workflow
75
 
76
+ 1. **Error Detection**: Identify data quality issues
77
+ 2. **Quality Assessment**: Score samples
78
+ 3. **Intelligent Correction**: LLM generates high-quality versions
79
+ 4. **Consistency Validation**: Ensure logical consistency
80
 
81
+ ## Notes
82
 
83
+ - Demo version limits processing to 10 samples per batch
84
+ - Requires valid DeepSeek API key
85
+ - Leaderboard data is pre-configured results
86
 
87
+ ## Future Enhancements
88
 
89
+ - [ ] Connect to university server LLaMA3 model
90
+ - [ ] Support large-scale dataset processing
91
+ - [ ] Add more evaluation metrics
92
+ - [ ] Real-time processing progress feedback
app.py CHANGED
@@ -14,27 +14,27 @@ import torch
14
  from transformers import AutoTokenizer, AutoModelForCausalLM
15
  import hashlib
16
 
17
- # ======================== 新增:WAC-GEC导入 ========================
18
  try:
19
  from whitespace_correction import WhitespaceCorrector
20
  WAC_GEC_AVAILABLE = True
21
- # 初始化WAC-GEC模型(延迟加载)
22
  wac_corrector = None
23
  except ImportError:
24
  WAC_GEC_AVAILABLE = False
25
  wac_corrector = None
26
- print("⚠️ whitespace_correction未安装,WAC-GEC功能将不可用")
27
 
28
- # 初始化GEC模型(延迟加载)
29
  gec_tokenizer = None
30
  gec_model = None
31
- GEC_MODEL_NAME = "lllouo/gec_Chat-LLaMa-2-7B-FT" # 你的HF模型地址
32
 
33
- # ======================== API配置 ========================
34
  DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY", "")
35
  DEEPSEEK_BASE_URL = "https://dashscope.aliyuncs.com/compatible-mode/v1"
36
 
37
- # ======================== NLP工具初始化 ========================
38
  try:
39
  nlp = spacy.load("en_core_web_sm")
40
  except OSError:
@@ -51,7 +51,7 @@ WHITESPACE_PATTERNS = [
51
  re.compile(r'([.,!?;:])\s{2,}'),
52
  ]
53
 
54
- # ======================== Prompt模板 ========================
55
  PROMPT_TEMPLATE = """## Positioning
56
  You are a **LANGUAGE grammatical error correction tool** that can identify and correct grammatical errors in a text.
57
  Reply with a corrected version of the input sentence with all **grammatical**, **spelling** and **whitespace errors** fixed, making only necessary changes.
@@ -79,14 +79,14 @@ Next, please correct the following sentence according to the above requirements.
79
 
80
  [input]: """
81
 
82
- # ======================== 新增:初始化函数(WAC + GEC ========================
83
  def initialize_wac_gec():
84
- """延迟初始化WAC-GEC模型(空白符纠正 + 语法纠错)"""
85
  global wac_corrector, gec_tokenizer, gec_model
86
 
87
- # 1. 初始化WAC(空白符纠正)
88
  if not WAC_GEC_AVAILABLE:
89
- print("❌ WAC模块未安装")
90
  return False
91
 
92
  if wac_corrector is None:
@@ -97,17 +97,17 @@ def initialize_wac_gec():
97
  device=device,
98
  download_dir="./models"
99
  )
100
- print(f"✅ WAC空白符纠正模型已加载 (设备: {device})")
101
  except Exception as e:
102
- print(f"❌ WAC模型加载失败: {e}")
103
  return False
104
 
105
- # 2. 初始化GEC(语法纠错)
106
  if gec_model is None or gec_tokenizer is None:
107
  try:
108
  device = "cuda" if torch.cuda.is_available() else "cpu"
109
 
110
- print(f"📥 正在从HuggingFace下载GEC模型: {GEC_MODEL_NAME}")
111
  gec_tokenizer = AutoTokenizer.from_pretrained(
112
  GEC_MODEL_NAME,
113
  trust_remote_code=True
@@ -119,51 +119,45 @@ def initialize_wac_gec():
119
  trust_remote_code=True
120
  )
121
 
122
- # 如果是CPU模式,手动移动模型
123
  if device == "cpu":
124
  gec_model = gec_model.to(device)
125
 
126
- # 设置tokenizer的pad_token和padding_side
127
  gec_tokenizer.pad_token_id = gec_tokenizer.eos_token_id
128
  gec_tokenizer.padding_side = "left"
129
 
130
- print(f"✅ GEC语法纠错模型已加载 (设备: {device})")
131
 
132
  except Exception as e:
133
- print(f"❌ GEC模型加载失败: {e}")
134
  return False
135
 
136
  return True
137
 
138
- # ======================== 新增:GEC语法纠错函数 ========================
139
  def correct_sentence_gec(input_sentence):
140
  """
141
- 使用GEC模型进行语法纠错
142
- 参数:
143
- input_sentence (str): 需要纠正的句子
144
- 返回:
145
- str: 纠正后的句子
146
  """
147
  if gec_model is None or gec_tokenizer is None:
148
- raise ValueError("GEC模型未初始化")
149
 
150
- # 构建提示词
151
  prompt = f"""Rewrite the following sentence to correct grammatical errors. Return ONLY the corrected sentence.
152
  Original: {input_sentence}
153
  Corrected:"""
154
 
155
- # 生成修正
156
  inputs = gec_tokenizer(prompt, return_tensors="pt").to(gec_model.device)
157
 
158
- # 检测设备类型以优化参数
159
  is_cpu = str(gec_model.device) == "cpu" or not torch.cuda.is_available()
160
 
161
- # CPU优化参数:减少beam search和token长度
162
  if is_cpu:
163
- max_tokens = 256 # CPU模式减半
164
- beams = 2 # 减少beam数量加速
165
  else:
166
- max_tokens = 512 # GPU模式保持
167
  beams = 4
168
 
169
  with torch.no_grad():
@@ -176,55 +170,50 @@ Corrected:"""
176
  top_p=None
177
  )
178
 
179
- # 提取并清理输出
180
  full_output = gec_tokenizer.decode(outputs[0], skip_special_tokens=True)
181
  corrected_text = full_output.replace(prompt, "").strip()
182
 
183
- # 进一步清理可能的前缀
184
  if corrected_text.startswith("Corrected:"):
185
  corrected_text = corrected_text[len("Corrected:"):].strip()
186
 
187
  return corrected_text
188
 
189
- # ======================== 新增:WAC-GEC组合处理函数 ========================
190
  def call_wac_gec(text):
191
  """
192
- 使用WAC-GEC两步纠正:
193
- 1. GEC模型进行语法和拼写纠正
194
- 2. WAC模型进行空白符纠正
195
  """
196
  if not initialize_wac_gec():
197
- raise ValueError("⚠️ WAC-GEC模型未安装或加载失败")
198
 
199
  try:
200
- # Step 1: 使用GEC模型进行语法纠错
201
- print(f"🔍 GEC处理: {text[:50]}...")
202
  gec_corrected = correct_sentence_gec(text)
203
- print(f"✅ GEC结果: {gec_corrected[:50]}...")
204
 
205
- # Step 2: 使用WAC模型进行空白符纠正
206
- print(f"🔍 WAC处理: {gec_corrected[:50]}...")
207
  final_corrected = wac_corrector.correct_text(gec_corrected)
208
- print(f"✅ WAC结果: {final_corrected[:50]}...")
209
 
210
- # 格式化输出以匹配DeepSeek的格式
211
  return f"[output]: {final_corrected}"
212
 
213
  except Exception as e:
214
- raise Exception(f"WAC-GEC处理错误: {str(e)}")
215
 
216
- # ======================== 新增:颜色对比函数 ========================
217
  def generate_colored_diff(original, cleaned):
218
  """
219
- 生成带颜色标注的HTML差异对比
220
- 原始文本中的错误:红色
221
- 去噪后的修正:绿色
222
  """
223
- # 分词处理
224
  original_words = original.split()
225
  cleaned_words = cleaned.split()
226
 
227
- # 使用difflib进行序列匹配
228
  matcher = difflib.SequenceMatcher(None, original_words, cleaned_words)
229
 
230
  original_html = []
@@ -232,21 +221,17 @@ def generate_colored_diff(original, cleaned):
232
 
233
  for tag, i1, i2, j1, j2 in matcher.get_opcodes():
234
  if tag == 'equal':
235
- # 相同部分保持黑色
236
  original_html.extend(original_words[i1:i2])
237
  cleaned_html.extend(cleaned_words[j1:j2])
238
  elif tag == 'replace':
239
- # 替换部分:原文红色,新文绿色
240
  original_html.extend([f'<span style="color: #dc3545; font-weight: bold;">{w}</span>'
241
  for w in original_words[i1:i2]])
242
  cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
243
  for w in cleaned_words[j1:j2]])
244
  elif tag == 'delete':
245
- # 删除部分:原文红色带删除线
246
  original_html.extend([f'<span style="color: #dc3545; text-decoration: line-through;">{w}</span>'
247
  for w in original_words[i1:i2]])
248
  elif tag == 'insert':
249
- # 插入部分:新文绿色
250
  cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
251
  for w in cleaned_words[j1:j2]])
252
 
@@ -254,7 +239,7 @@ def generate_colored_diff(original, cleaned):
254
 
255
  def create_comparison_html(original_list, cleaned_list):
256
  """
257
- 创建HTML表格展示对比 - 样式匹配Leaderboard表格
258
  """
259
  html = """
260
  <div style="font-family: 'Times New Roman', serif; max-width: 100%; overflow-x: auto;">
@@ -290,8 +275,8 @@ def create_comparison_html(original_list, cleaned_list):
290
  <thead>
291
  <tr>
292
  <th class="index-col">#</th>
293
- <th>原始问题</th>
294
- <th>去噪后问题</th>
295
  </tr>
296
  </thead>
297
  <tbody>
@@ -315,11 +300,11 @@ def create_comparison_html(original_list, cleaned_list):
315
 
316
  return html
317
 
318
- # ======================== 工具函数 ========================
319
  def check_api_key(model_choice):
320
- """检查API密钥(仅DeepSeek需要)"""
321
  if model_choice == "deepseek-r1-distill-llama-8b" and not DEEPSEEK_API_KEY:
322
- raise ValueError("⚠️ 请在 Space Settings 中配置 DEEPSEEK_API_KEY!")
323
 
324
  def call_deepseek_api(prompt, model="deepseek-r1-distill-llama-8b", temperature=0.1, stream=True):
325
  check_api_key(model)
@@ -418,18 +403,17 @@ def calculate_spelling_error_density(sentences):
418
  return 0.0
419
  return total_errors / total_words * 100
420
 
421
- # ======================== Leaderboard数据处理 ========================
422
  def load_leaderboard_data():
423
  json_path = "leaderboard.json"
424
  try:
425
  with open(json_path, 'r', encoding='utf-8') as f:
426
  data = json.load(f)
427
 
428
- # Replace ID with hash based on Benchmark
429
  for item in data:
430
  benchmark = item['Benchmark']
431
  hash_object = hashlib.md5(benchmark.encode())
432
- item['ID'] = hash_object.hexdigest()[:8] # Use first 8 hex digits for brevity
433
 
434
  return pd.DataFrame(data)
435
  except Exception as e:
@@ -438,15 +422,13 @@ def load_leaderboard_data():
438
 
439
  def filter_leaderboard(df, category_query, version_query):
440
  """
441
- 同时按类别和版本筛选
442
  """
443
  result = df.copy()
444
 
445
- # 按类别筛选
446
  if category_query != "all":
447
  result = result[result['Category'] == category_query]
448
 
449
- # 按版本筛选
450
  if version_query != "all":
451
  if version_query == "original":
452
  result = result[result['Benchmark'].str.contains('_original', case=False, na=False)]
@@ -462,38 +444,35 @@ def search_leaderboard(df, query):
462
  return df
463
  return df[df['Benchmark'].str.contains(query, case=False, na=False)]
464
 
465
- # ======================== 数据去噪函数(修改版:支持双模型)========================
466
  def clean_dataset(file_path, question_column, model_choice, temperature, max_samples, progress=gr.Progress()):
467
  try:
468
- # 检查API密钥(仅DeepSeek需要)
469
  try:
470
  check_api_key(model_choice)
471
  except ValueError as e:
472
  if model_choice == "deepseek-r1-distill-llama-8b":
473
  return str(e), None, ""
474
 
475
- # 检查WAC-GEC可用性
476
  if model_choice == "WAC-GEC" and not WAC_GEC_AVAILABLE:
477
- return "❌ WAC-GEC模型未安装!请安装 whitespace_correction 包。", None, ""
478
 
479
- progress(0.05, desc="📁 读取数据文件...")
480
  df = pd.read_parquet(file_path)
481
 
482
  if question_column not in df.columns:
483
  available_columns = ", ".join(df.columns.tolist())
484
- return f"❌ 列名 '{question_column}' 不存在!\n可用列名: {available_columns}", None, ""
485
 
486
  data_ori = df[question_column].tolist()[:int(max_samples)]
487
  total = len(data_ori)
488
 
489
- progress(0.08, desc="📊 计算原始指标...")
490
  original_sentences = [str(item) for item in data_ori]
491
  war_original = calculate_whitespace_anomaly_rate(original_sentences)
492
  sed_original = calculate_spelling_error_density(original_sentences)
493
 
494
- progress(0.1, desc=f"🚀 开始去噪 {total} 个样本 (模型: {model_choice})...")
495
 
496
- # WAC-GEC不需要添加___标记
497
  if model_choice == "WAC-GEC":
498
  data_corrupt = [str(item) for item in data_ori]
499
  else:
@@ -501,11 +480,11 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
501
 
502
  results = []
503
  max_retries = 5 if model_choice == "deepseek-r1-distill-llama-8b" else 3
504
- log_text = f"🚀 开始处理 {total} 个样本...\n"
505
- log_text += f"📌 使用模型: {model_choice}\n\n"
506
 
507
  for idx in range(total):
508
- progress((0.1 + 0.7 * idx / total), desc=f"处理中: {idx+1}/{total}")
509
 
510
  unprocess_text = str(data_ori[idx])
511
  original_text = data_corrupt[idx]
@@ -514,7 +493,6 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
514
 
515
  while retry_count < max_retries:
516
  try:
517
- # 根据模型选择调用不同的API
518
  if model_choice == "WAC-GEC":
519
  response_content = call_wac_gec(original_text)
520
  else:
@@ -524,7 +502,6 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
524
  temperature=float(temperature)
525
  )
526
 
527
- # WAC-GEC的输出格式简单,无需复杂验证
528
  if model_choice == "WAC-GEC":
529
  if response_content.startswith('[output]:'):
530
  results.append(response_content)
@@ -540,12 +517,12 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
540
 
541
  except Exception as e:
542
  retry_count += 1
543
- log_text += f"⚠️ 样本 {idx+1} 处理错误,重试 {retry_count}/{max_retries}: {str(e)}\n"
544
  else:
545
  results.append(f"[ERROR] Failed to process: {original_text}")
546
- log_text += f"❌ 样本 {idx+1} 处理失败\n"
547
 
548
- progress(0.85, desc="📊 后处理中...")
549
 
550
  lst_extracted = []
551
  error_count = 0
@@ -571,7 +548,7 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
571
  else:
572
  lst_final.append(lst_extracted[i])
573
 
574
- progress(0.90, desc="📊 计算去噪后指标...")
575
  cleaned_sentences = [str(item) for item in lst_final]
576
  war_cleaned = calculate_whitespace_anomaly_rate(cleaned_sentences)
577
  sed_cleaned = calculate_spelling_error_density(cleaned_sentences)
@@ -579,7 +556,7 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
579
  delta_war = war_cleaned - war_original
580
  delta_sed = sed_cleaned - sed_original
581
 
582
- progress(0.95, desc="💾 保存结果...")
583
 
584
  df_cleaned = df.copy()
585
  df_cleaned[question_column + '_cleaned'] = lst_final[:len(df)]
@@ -592,144 +569,143 @@ def clean_dataset(file_path, question_column, model_choice, temperature, max_sam
592
 
593
  df_cleaned.to_parquet(output_path, index=False)
594
 
595
- log_text += f"\n\n📊 处理完成!\n"
596
  log_text += f"{'='*50}\n"
597
- log_text += f"【基础统计】\n"
598
- log_text += f"- 使用模型: {model_choice}\n"
599
- log_text += f"- 总样本数: {total}\n"
600
- log_text += f"- 成功处理: {total - error_count - unknown_count}\n"
601
- log_text += f"- 失败样本: {error_count}\n"
602
- log_text += f"- 未知格式: {unknown_count}\n"
603
- log_text += f"- 输出文件: {output_filename}\n\n"
604
-
605
- log_text += f"【质量指标】\n"
606
- log_text += f"📍 空白符异常率(WAR:\n"
607
- log_text += f" 原始: {war_original:.2f}% → 去噪后: {war_cleaned:.2f}%\n"
608
- log_text += f" 变化: {delta_war:+.2f}% {'✅ 改善' if delta_war < 0 else '⚠️ 增加'}\n\n"
609
-
610
- log_text += f"📍 拼写错误密度(SED:\n"
611
- log_text += f" 原始: {sed_original:.2f}% → 去噪后: {sed_cleaned:.2f}%\n"
612
- log_text += f" 变化: {delta_sed:+.2f}% {'✅ 改善' if delta_sed < 0 else '⚠️ 增加'}\n"
613
 
614
  if model_choice == "WAC-GEC":
615
- log_text += f"\n💡 注意: WAC-GEC使用两步纠正(GEC语法纠错 + WAC空白符纠正)\n"
616
 
617
  log_text += f"{'='*50}\n"
618
 
619
- # 生成带颜色的对比HTML
620
  preview_html = create_comparison_html(data_ori[:5], lst_final[:5])
621
 
622
- progress(1.0, desc="✅ 完成!")
623
 
624
  return log_text, output_path, preview_html
625
 
626
  except Exception as e:
627
  import traceback
628
  error_detail = traceback.format_exc()
629
- return f"❌ 处理出错: {str(e)}\n\n详细错误:\n{error_detail}", None, ""
630
 
631
- # ======================== 文本内容 ========================
632
  ABOUT_TEXT = """
633
- ## 去噪流程说明
634
 
635
- ### 支持的模型
636
 
637
  #### 1. DeepSeek-R1 (deepseek-r1-distill-llama-8b)
638
- - **功能**: 全面的语法、拼写、空格错误修正
639
- - **优势**: 综合性强,能处理多种类型的错误
640
- - **配置**: 需要在Space Settings中配置DEEPSEEK_API_KEY
641
 
642
  #### 2. WAC-GEC (Whitespace + Grammar Error Correction)
643
- - **功能**: 两步纠正流程
644
- - **Step 1 (GEC)**: 使用LLaMA-2-7B微调模型进行语法和拼写纠错
645
- - **Step 2 (WAC)**: 使用空白符纠正模型修正空格问题
646
- - **优势**:
647
- - 完全本地化,无需API密钥
648
- - 组合两个专门模型,各司其职
649
- - 适合离线环境和预算有限的场景
650
- - **模型来源**:
651
  - GEC: [lllouo/gec_Chat-LLaMa-2-7B-FT](https://huggingface.co/lllouo/gec_Chat-LLaMa-2-7B-FT)
652
- - WAC: whitespace_correction
653
 
654
- ### 核心算法
655
 
656
- 1. **预处理 (process_sentence)**
657
- - 检测句子完整性
658
- - 为不完整的句子添加标记 `___` (DeepSeek)
659
- - 保留多行文本格式
660
 
661
- 2. **模型去噪**
662
- - **DeepSeek**: 使用API进行全面错误修正,重试机制最多5
663
  - **WAC-GEC**:
664
- - 先使用GEC模型进行语法和拼写纠正
665
- - 再使用WAC模型进行空白符纠正
666
- - 重试机制最多3
667
 
668
- 3. **格式验证**
669
- - 验证输出格式正确性
670
- - 检查标记保留情况
671
- - 长度合理性检查
672
 
673
- 4. **后处理**
674
- - 提取去噪后的内容
675
- - 恢复原始多行格式
676
- - 生成带模型标识的Parquet文件
677
 
678
- ### 支持的数据集
679
 
680
- - **MMLU**: 57个学科的多选题
681
- - **GSM8K**: 数学推理题
682
- - **ARC-Challenge**: 科学问答
683
- - **MedMCQA**: 医学选择题
684
- - **CoQA**: 对话问答
685
- - 以及更多...
686
 
687
- ### 颜色标注说明
688
 
689
- - 🔴 **红色**: 原始文本中的错误(拼写、语法、空格等)
690
- - 🟢 **绿色**: 去噪后的修正内容
691
- - ⚫ **黑色**: 未修改的正确部分
692
 
693
- ### 技术栈
694
 
695
  - **LLM**: DeepSeek API (deepseek-r1-distill-llama-8b)
696
- - **本地模型**:
697
- - GEC: LLaMA-2-7B (微调于语法纠错任务)
698
  - WAC: Whitespace Correction Model
699
- - **前端**: Gradio 4.16.0
700
- - **数据处理**: Pandas + PyArrow (Parquet)
701
- - **差异对比**: Python difflib
702
- - **NLP工具**: spaCy, pyspellchecker
703
- - **API调用**: OpenAI SDK
704
- - **部署**: Hugging Face Spaces
705
 
706
- ### 质量指标
707
 
708
- - **WAR (Whitespace Anomaly Rate)**: 空白符异常率
709
- - **SED (Spelling Error Density)**: 拼写错误密度
710
 
711
- ### 模型选择建议
712
 
713
- - **需要全面去噪 + API预算**: 选择 DeepSeek-R1
714
- - **本地化部署 + 完整纠错**: 选择 WAC-GEC(推荐)
715
- - **仅需修正空格**: 单独使用WAC模块
716
- - **追求最快速度**: 使用GPU加速的WAC-GEC
717
 
718
  ---
719
 
720
- **研究生毕业论文成果展示** | Powered by DeepSeek API & WAC-GEC
721
  """
722
 
723
- # ======================== Gradio界面 ========================
724
- demo = gr.Blocks(title="数据集去噪框架展示系统", css="""
725
  .markdown-text { font-size: 16px; line-height: 1.6; }
726
  """)
727
 
728
  with demo:
729
  gr.Markdown(
730
- """<div style="text-align: center;"><h1>⭐ 基于基准去噪框架的 <span style='color: #e6b800;'>去噪工厂</span> 展示系统</h1></div>
731
  <br>
732
- <p>本系统展示了基于<a href="https://github.com/LLLoUo/bd-toolkit" target="_blank">BD-toolkit</a>的DeepSeek-R1和WAC-GEC两种方法对主流benchmark数据集的去噪效果。通过WAR(空白符异常率)SED(拼写错误密度)两个指标评估去噪质量。</p>
733
  """,
734
  elem_classes="markdown-text"
735
  )
@@ -739,27 +715,27 @@ with demo:
739
  with gr.Tabs(elem_classes="tab-buttons") as tabs:
740
  with gr.TabItem("📊 BD-benchmarks Leaderboard", id=0):
741
  with gr.Column():
742
- gr.Markdown("### BD去噪后主流基准排行榜")
743
 
744
  with gr.Row():
745
  search_bar = gr.Textbox(
746
- placeholder="🔍 搜索Benchmark名称并按ENTER...",
747
  show_label=False,
748
  elem_id="search-bar",
749
  )
750
  filter_categories = gr.Radio(
751
- label="📂 筛选Benchmark类别",
752
  choices=["all", "BT", "RA", "TG", "SU", "ME", "GR"],
753
  value="all",
754
  elem_id="filter-columns",
755
  )
756
  filter_versions = gr.Radio(
757
- label="🔖 筛选数据集版本",
758
  choices=[
759
- ("全部版本", "all"),
760
- ("原始数据集", "original"),
761
- ("DeepSeek-R1去噪", "deepseek"),
762
- ("WAC-GEC去噪", "wac_gec")
763
  ],
764
  value="all",
765
  elem_id="filter-versions",
@@ -767,7 +743,7 @@ with demo:
767
 
768
  leaderboard_table = gr.Dataframe(
769
  value=leaderboard_data[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
770
- headers=['ID', 'Category', 'Benchmark', 'WAR (%)', 'SED', '下载'],
771
  datatype=['number', 'str', 'str', 'number', 'number', 'markdown'],
772
  elem_id="leaderboard-table",
773
  interactive=False,
@@ -778,14 +754,12 @@ with demo:
778
  visible=False
779
  )
780
 
781
- # 搜索功能
782
  search_bar.submit(
783
  lambda df, query: search_leaderboard(df, query)[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
784
  [hidden_leaderboard, search_bar],
785
  leaderboard_table
786
  )
787
 
788
- # 类别筛选功能(需要考虑版本筛选)
789
  def combined_filter(df, category, version):
790
  filtered = filter_leaderboard(df, category, version)
791
  return filtered[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']]
@@ -796,7 +770,6 @@ with demo:
796
  leaderboard_table
797
  )
798
 
799
- # 版本筛选功能(需要考虑类别筛选)
800
  filter_versions.change(
801
  combined_filter,
802
  [hidden_leaderboard, filter_categories, filter_versions],
@@ -804,46 +777,40 @@ with demo:
804
  )
805
 
806
  gr.Markdown("""
807
- **说明:**
808
- - **Category**: BT=基础任务, RA=推理能力, TG=文本生成, SU=语音理解, ME=医学领域, GR=语法领域
809
- - **Version**: 原始=未处理数据集, DeepSeek-R1=DeepSeek去噪版本, WAC-GEC=WAC-GEC去噪版本
810
- - **WAR**: 空白符异常率(越低越好)
811
- - **SED**: 拼写错误密度(越低越好)
812
  """, elem_classes="markdown-text")
813
 
814
- with gr.TabItem("📈 Performance Plot", id=1):
815
- gr.Markdown("### 性能可视化分析")
816
- gr.Markdown("**注意**: 性能图表功能开发中,敬请期待。")
817
-
818
- with gr.TabItem("📝 About", id=2):
819
- gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text")
820
 
821
- with gr.TabItem("🚀 BD-toolkit Demo", id=3):
822
- gr.Markdown("## BD-toolkit轻量化Demo展示")
823
 
824
- # 模型可用性提示
825
- model_status = "✅ WAC-GEC: " + ("可用" if WAC_GEC_AVAILABLE else "未安装")
826
- model_status += " | ✅ DeepSeek-R1: " + ("已配置" if DEEPSEEK_API_KEY else "未配置API密钥")
827
- gr.Markdown(f"**模型状态**: {model_status}")
828
 
829
  with gr.Row():
830
  with gr.Column():
831
  file_input = gr.File(
832
- label="📁 上传 Parquet 文件",
833
  file_types=[".parquet"]
834
  )
835
 
836
  question_column = gr.Textbox(
837
- label="📝 问题列名",
838
  value="question",
839
- placeholder="例如: question, input_text, prompt"
840
  )
841
 
842
  model_choice = gr.Dropdown(
843
  choices=["WAC-GEC", "deepseek-r1-distill-llama-8b"],
844
  value="WAC-GEC",
845
- label="🤖 选择模型",
846
- info="DeepSeek: 全面纠错 | WAC-GEC: 语法+空白符纠正(本地模型)"
847
  )
848
 
849
  temperature = gr.Slider(
@@ -852,8 +819,8 @@ with demo:
852
  value=0.1,
853
  step=0.1,
854
  label="🌡️ Temperature",
855
- info="仅对DeepSeek生效",
856
- interactive=False # 默认不可交互(因为默认选择WAC-GEC)
857
  )
858
 
859
  max_samples = gr.Slider(
@@ -861,26 +828,25 @@ with demo:
861
  maximum=100,
862
  value=5,
863
  step=1,
864
- label="📊 处理样本数 (Demo限制)"
865
  )
866
 
867
- clean_btn = gr.Button("🚀 开始去噪", variant="primary", size="lg")
868
 
869
  with gr.Column():
870
  output_text = gr.Textbox(
871
- label="⏳ 处理进度",
872
  lines=10,
873
  max_lines=15
874
  )
875
 
876
- download_file = gr.File(label="📥 下载去噪后的数据集")
877
 
878
- # 添加交互逻辑:根据模型选择动态启用/禁用temperature滑块
879
  def update_temperature_interactive(model):
880
  if model == "deepseek-r1-distill-llama-8b":
881
- return gr.update(interactive=True, info="调整生成的随机性")
882
  else:
883
- return gr.update(interactive=False, info="WAC-GEC模型不支持temperature参数")
884
 
885
  model_choice.change(
886
  fn=update_temperature_interactive,
@@ -888,13 +854,12 @@ with demo:
888
  outputs=[temperature]
889
  )
890
 
891
- # 颜色对比预览区域
892
- gr.Markdown("### 🎨 去噪效果对比预览")
893
  gr.Markdown("""
894
- **颜色说明**:
895
- - 🔴 <span style="color: #dc3545;">红色</span> = 原始文本中的错误
896
- - 🟢 <span style="color: #28a745;">绿色</span> = 去噪后的修正
897
- - ⚫ 黑色 = 未修改的正确部分
898
  """)
899
 
900
  colored_preview = gr.HTML(label="")
@@ -905,10 +870,11 @@ with demo:
905
  outputs=[output_text, download_file, colored_preview]
906
  )
907
 
 
 
 
908
  if __name__ == "__main__":
909
- # 可选:预加载模型(会增加启动时间)
910
- # 如果想要预加载,取消下面两行的注释
911
- print("🚀 预加载WAC-GEC模型...")
912
  initialize_wac_gec()
913
 
914
  demo.launch(
 
14
  from transformers import AutoTokenizer, AutoModelForCausalLM
15
  import hashlib
16
 
17
+ # ======================== WAC-GEC Import ========================
18
  try:
19
  from whitespace_correction import WhitespaceCorrector
20
  WAC_GEC_AVAILABLE = True
21
+ # Initialize WAC-GEC model (lazy loading)
22
  wac_corrector = None
23
  except ImportError:
24
  WAC_GEC_AVAILABLE = False
25
  wac_corrector = None
26
+ print("⚠️ whitespace_correction not installed, WAC-GEC functionality unavailable")
27
 
28
+ # Initialize GEC model (lazy loading)
29
  gec_tokenizer = None
30
  gec_model = None
31
+ GEC_MODEL_NAME = "lllouo/gec_Chat-LLaMa-2-7B-FT"
32
 
33
+ # ======================== API Configuration ========================
34
  DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY", "")
35
  DEEPSEEK_BASE_URL = "https://dashscope.aliyuncs.com/compatible-mode/v1"
36
 
37
+ # ======================== NLP Tools Initialization ========================
38
  try:
39
  nlp = spacy.load("en_core_web_sm")
40
  except OSError:
 
51
  re.compile(r'([.,!?;:])\s{2,}'),
52
  ]
53
 
54
+ # ======================== Prompt Template ========================
55
  PROMPT_TEMPLATE = """## Positioning
56
  You are a **LANGUAGE grammatical error correction tool** that can identify and correct grammatical errors in a text.
57
  Reply with a corrected version of the input sentence with all **grammatical**, **spelling** and **whitespace errors** fixed, making only necessary changes.
 
79
 
80
  [input]: """
81
 
82
+ # ======================== Initialize WAC + GEC ========================
83
  def initialize_wac_gec():
84
+ """Lazy initialization of WAC-GEC models (Whitespace + Grammar Error Correction)"""
85
  global wac_corrector, gec_tokenizer, gec_model
86
 
87
+ # 1. Initialize WAC (Whitespace Correction)
88
  if not WAC_GEC_AVAILABLE:
89
+ print("❌ WAC module not installed")
90
  return False
91
 
92
  if wac_corrector is None:
 
97
  device=device,
98
  download_dir="./models"
99
  )
100
+ print(f"✅ WAC whitespace correction model loaded (device: {device})")
101
  except Exception as e:
102
+ print(f"❌ WAC model loading failed: {e}")
103
  return False
104
 
105
+ # 2. Initialize GEC (Grammar Error Correction)
106
  if gec_model is None or gec_tokenizer is None:
107
  try:
108
  device = "cuda" if torch.cuda.is_available() else "cpu"
109
 
110
+ print(f"📥 Downloading GEC model from HuggingFace: {GEC_MODEL_NAME}")
111
  gec_tokenizer = AutoTokenizer.from_pretrained(
112
  GEC_MODEL_NAME,
113
  trust_remote_code=True
 
119
  trust_remote_code=True
120
  )
121
 
 
122
  if device == "cpu":
123
  gec_model = gec_model.to(device)
124
 
 
125
  gec_tokenizer.pad_token_id = gec_tokenizer.eos_token_id
126
  gec_tokenizer.padding_side = "left"
127
 
128
+ print(f"✅ GEC grammar correction model loaded (device: {device})")
129
 
130
  except Exception as e:
131
+ print(f"❌ GEC model loading failed: {e}")
132
  return False
133
 
134
  return True
135
 
136
+ # ======================== GEC Grammar Correction Function ========================
137
  def correct_sentence_gec(input_sentence):
138
  """
139
+ Use GEC model for grammar correction
140
+ Args:
141
+ input_sentence (str): Sentence to be corrected
142
+ Returns:
143
+ str: Corrected sentence
144
  """
145
  if gec_model is None or gec_tokenizer is None:
146
+ raise ValueError("GEC model not initialized")
147
 
 
148
  prompt = f"""Rewrite the following sentence to correct grammatical errors. Return ONLY the corrected sentence.
149
  Original: {input_sentence}
150
  Corrected:"""
151
 
 
152
  inputs = gec_tokenizer(prompt, return_tensors="pt").to(gec_model.device)
153
 
 
154
  is_cpu = str(gec_model.device) == "cpu" or not torch.cuda.is_available()
155
 
 
156
  if is_cpu:
157
+ max_tokens = 256
158
+ beams = 2
159
  else:
160
+ max_tokens = 512
161
  beams = 4
162
 
163
  with torch.no_grad():
 
170
  top_p=None
171
  )
172
 
 
173
  full_output = gec_tokenizer.decode(outputs[0], skip_special_tokens=True)
174
  corrected_text = full_output.replace(prompt, "").strip()
175
 
 
176
  if corrected_text.startswith("Corrected:"):
177
  corrected_text = corrected_text[len("Corrected:"):].strip()
178
 
179
  return corrected_text
180
 
181
+ # ======================== WAC-GEC Combined Processing ========================
182
  def call_wac_gec(text):
183
  """
184
+ Use WAC-GEC two-step correction:
185
+ 1. GEC model for grammar and spelling correction
186
+ 2. WAC model for whitespace correction
187
  """
188
  if not initialize_wac_gec():
189
+ raise ValueError("⚠️ WAC-GEC models not installed or failed to load")
190
 
191
  try:
192
+ # Step 1: Use GEC model for grammar correction
193
+ print(f"🔍 GEC processing: {text[:50]}...")
194
  gec_corrected = correct_sentence_gec(text)
195
+ print(f"✅ GEC result: {gec_corrected[:50]}...")
196
 
197
+ # Step 2: Use WAC model for whitespace correction
198
+ print(f"🔍 WAC processing: {gec_corrected[:50]}...")
199
  final_corrected = wac_corrector.correct_text(gec_corrected)
200
+ print(f"✅ WAC result: {final_corrected[:50]}...")
201
 
 
202
  return f"[output]: {final_corrected}"
203
 
204
  except Exception as e:
205
+ raise Exception(f"WAC-GEC processing error: {str(e)}")
206
 
207
+ # ======================== Color Diff Functions ========================
208
  def generate_colored_diff(original, cleaned):
209
  """
210
+ Generate HTML diff with color annotations
211
+ Errors in original text: red
212
+ Corrections after denoising: green
213
  """
 
214
  original_words = original.split()
215
  cleaned_words = cleaned.split()
216
 
 
217
  matcher = difflib.SequenceMatcher(None, original_words, cleaned_words)
218
 
219
  original_html = []
 
221
 
222
  for tag, i1, i2, j1, j2 in matcher.get_opcodes():
223
  if tag == 'equal':
 
224
  original_html.extend(original_words[i1:i2])
225
  cleaned_html.extend(cleaned_words[j1:j2])
226
  elif tag == 'replace':
 
227
  original_html.extend([f'<span style="color: #dc3545; font-weight: bold;">{w}</span>'
228
  for w in original_words[i1:i2]])
229
  cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
230
  for w in cleaned_words[j1:j2]])
231
  elif tag == 'delete':
 
232
  original_html.extend([f'<span style="color: #dc3545; text-decoration: line-through;">{w}</span>'
233
  for w in original_words[i1:i2]])
234
  elif tag == 'insert':
 
235
  cleaned_html.extend([f'<span style="color: #28a745; font-weight: bold;">{w}</span>'
236
  for w in cleaned_words[j1:j2]])
237
 
 
239
 
240
  def create_comparison_html(original_list, cleaned_list):
241
  """
242
+ Create HTML table for comparison
243
  """
244
  html = """
245
  <div style="font-family: 'Times New Roman', serif; max-width: 100%; overflow-x: auto;">
 
275
  <thead>
276
  <tr>
277
  <th class="index-col">#</th>
278
+ <th>Original Question</th>
279
+ <th>Denoised Question</th>
280
  </tr>
281
  </thead>
282
  <tbody>
 
300
 
301
  return html
302
 
303
+ # ======================== Utility Functions ========================
304
  def check_api_key(model_choice):
305
+ """Check API key (only required for DeepSeek)"""
306
  if model_choice == "deepseek-r1-distill-llama-8b" and not DEEPSEEK_API_KEY:
307
+ raise ValueError("⚠️ Please configure DEEPSEEK_API_KEY in Space Settings!")
308
 
309
  def call_deepseek_api(prompt, model="deepseek-r1-distill-llama-8b", temperature=0.1, stream=True):
310
  check_api_key(model)
 
403
  return 0.0
404
  return total_errors / total_words * 100
405
 
406
+ # ======================== Leaderboard Data Processing ========================
407
  def load_leaderboard_data():
408
  json_path = "leaderboard.json"
409
  try:
410
  with open(json_path, 'r', encoding='utf-8') as f:
411
  data = json.load(f)
412
 
 
413
  for item in data:
414
  benchmark = item['Benchmark']
415
  hash_object = hashlib.md5(benchmark.encode())
416
+ item['ID'] = hash_object.hexdigest()[:8]
417
 
418
  return pd.DataFrame(data)
419
  except Exception as e:
 
422
 
423
  def filter_leaderboard(df, category_query, version_query):
424
  """
425
+ Filter by both category and version
426
  """
427
  result = df.copy()
428
 
 
429
  if category_query != "all":
430
  result = result[result['Category'] == category_query]
431
 
 
432
  if version_query != "all":
433
  if version_query == "original":
434
  result = result[result['Benchmark'].str.contains('_original', case=False, na=False)]
 
444
  return df
445
  return df[df['Benchmark'].str.contains(query, case=False, na=False)]
446
 
447
+ # ======================== Dataset Denoising Function ========================
448
  def clean_dataset(file_path, question_column, model_choice, temperature, max_samples, progress=gr.Progress()):
449
  try:
 
450
  try:
451
  check_api_key(model_choice)
452
  except ValueError as e:
453
  if model_choice == "deepseek-r1-distill-llama-8b":
454
  return str(e), None, ""
455
 
 
456
  if model_choice == "WAC-GEC" and not WAC_GEC_AVAILABLE:
457
+ return "❌ WAC-GEC model not installed! Please install whitespace_correction package.", None, ""
458
 
459
+ progress(0.05, desc="📁 Reading data file...")
460
  df = pd.read_parquet(file_path)
461
 
462
  if question_column not in df.columns:
463
  available_columns = ", ".join(df.columns.tolist())
464
+ return f"❌ Column '{question_column}' not found!\nAvailable columns: {available_columns}", None, ""
465
 
466
  data_ori = df[question_column].tolist()[:int(max_samples)]
467
  total = len(data_ori)
468
 
469
+ progress(0.08, desc="📊 Calculating original metrics...")
470
  original_sentences = [str(item) for item in data_ori]
471
  war_original = calculate_whitespace_anomaly_rate(original_sentences)
472
  sed_original = calculate_spelling_error_density(original_sentences)
473
 
474
+ progress(0.1, desc=f"🚀 Starting denoising of {total} samples (model: {model_choice})...")
475
 
 
476
  if model_choice == "WAC-GEC":
477
  data_corrupt = [str(item) for item in data_ori]
478
  else:
 
480
 
481
  results = []
482
  max_retries = 5 if model_choice == "deepseek-r1-distill-llama-8b" else 3
483
+ log_text = f"🚀 Processing {total} samples...\n"
484
+ log_text += f"📌 Using model: {model_choice}\n\n"
485
 
486
  for idx in range(total):
487
+ progress((0.1 + 0.7 * idx / total), desc=f"Processing: {idx+1}/{total}")
488
 
489
  unprocess_text = str(data_ori[idx])
490
  original_text = data_corrupt[idx]
 
493
 
494
  while retry_count < max_retries:
495
  try:
 
496
  if model_choice == "WAC-GEC":
497
  response_content = call_wac_gec(original_text)
498
  else:
 
502
  temperature=float(temperature)
503
  )
504
 
 
505
  if model_choice == "WAC-GEC":
506
  if response_content.startswith('[output]:'):
507
  results.append(response_content)
 
517
 
518
  except Exception as e:
519
  retry_count += 1
520
+ log_text += f"⚠️ Sample {idx+1} error, retry {retry_count}/{max_retries}: {str(e)}\n"
521
  else:
522
  results.append(f"[ERROR] Failed to process: {original_text}")
523
+ log_text += f"❌ Sample {idx+1} processing failed\n"
524
 
525
+ progress(0.85, desc="📊 Post-processing...")
526
 
527
  lst_extracted = []
528
  error_count = 0
 
548
  else:
549
  lst_final.append(lst_extracted[i])
550
 
551
+ progress(0.90, desc="📊 Calculating denoised metrics...")
552
  cleaned_sentences = [str(item) for item in lst_final]
553
  war_cleaned = calculate_whitespace_anomaly_rate(cleaned_sentences)
554
  sed_cleaned = calculate_spelling_error_density(cleaned_sentences)
 
556
  delta_war = war_cleaned - war_original
557
  delta_sed = sed_cleaned - sed_original
558
 
559
+ progress(0.95, desc="💾 Saving results...")
560
 
561
  df_cleaned = df.copy()
562
  df_cleaned[question_column + '_cleaned'] = lst_final[:len(df)]
 
569
 
570
  df_cleaned.to_parquet(output_path, index=False)
571
 
572
+ log_text += f"\n\n📊 Processing Complete!\n"
573
  log_text += f"{'='*50}\n"
574
+ log_text += f"【Basic Statistics】\n"
575
+ log_text += f"- Model used: {model_choice}\n"
576
+ log_text += f"- Total samples: {total}\n"
577
+ log_text += f"- Successfully processed: {total - error_count - unknown_count}\n"
578
+ log_text += f"- Failed samples: {error_count}\n"
579
+ log_text += f"- Unknown format: {unknown_count}\n"
580
+ log_text += f"- Output file: {output_filename}\n\n"
581
+
582
+ log_text += f"【Quality Metrics】\n"
583
+ log_text += f"📍 Whitespace Anomaly Rate (WAR):\n"
584
+ log_text += f" Original: {war_original:.2f}% → Denoised: {war_cleaned:.2f}%\n"
585
+ log_text += f" Change: {delta_war:+.2f}% {'✅ Improved' if delta_war < 0 else '⚠️ Increased'}\n\n"
586
+
587
+ log_text += f"📍 Spelling Error Density (SED):\n"
588
+ log_text += f" Original: {sed_original:.2f}% → Denoised: {sed_cleaned:.2f}%\n"
589
+ log_text += f" Change: {delta_sed:+.2f}% {'✅ Improved' if delta_sed < 0 else '⚠️ Increased'}\n"
590
 
591
  if model_choice == "WAC-GEC":
592
+ log_text += f"\n💡 Note: WAC-GEC uses two-step correction (GEC grammar + WAC whitespace)\n"
593
 
594
  log_text += f"{'='*50}\n"
595
 
 
596
  preview_html = create_comparison_html(data_ori[:5], lst_final[:5])
597
 
598
+ progress(1.0, desc="✅ Complete!")
599
 
600
  return log_text, output_path, preview_html
601
 
602
  except Exception as e:
603
  import traceback
604
  error_detail = traceback.format_exc()
605
+ return f"❌ Processing error: {str(e)}\n\nDetailed error:\n{error_detail}", None, ""
606
 
607
+ # ======================== Text Content ========================
608
  ABOUT_TEXT = """
609
+ ## Denoising Workflow
610
 
611
+ ### Supported Models
612
 
613
  #### 1. DeepSeek-R1 (deepseek-r1-distill-llama-8b)
614
+ - **Function**: Comprehensive grammar, spelling, and whitespace error correction
615
+ - **Advantages**: Strong comprehensive capability, handles multiple error types
616
+ - **Configuration**: Requires DEEPSEEK_API_KEY in Space Settings
617
 
618
  #### 2. WAC-GEC (Whitespace + Grammar Error Correction)
619
+ - **Function**: Two-step correction workflow
620
+ - **Step 1 (GEC)**: Use LLaMA-2-7B fine-tuned model for grammar and spelling correction
621
+ - **Step 2 (WAC)**: Use whitespace correction model for spacing issues
622
+ - **Advantages**:
623
+ - Fully local, no API key required
624
+ - Combines two specialized models
625
+ - Suitable for offline environments and limited budgets
626
+ - **Model Source**:
627
  - GEC: [lllouo/gec_Chat-LLaMa-2-7B-FT](https://huggingface.co/lllouo/gec_Chat-LLaMa-2-7B-FT)
628
+ - WAC: whitespace_correction library
629
 
630
+ ### Core Algorithm
631
 
632
+ 1. **Preprocessing (process_sentence)**
633
+ - Detect sentence completeness
634
+ - Add marker `___` for incomplete sentences (DeepSeek only)
635
+ - Preserve multi-line text format
636
 
637
+ 2. **Model Denoising**
638
+ - **DeepSeek**: Use API for comprehensive error correction, up to 5 retries
639
  - **WAC-GEC**:
640
+ - First use GEC model for grammar and spelling correction
641
+ - Then use WAC model for whitespace correction
642
+ - Up to 3 retries
643
 
644
+ 3. **Format Validation**
645
+ - Verify output format correctness
646
+ - Check marker preservation
647
+ - Length reasonability check
648
 
649
+ 4. **Post-processing**
650
+ - Extract denoised content
651
+ - Restore original multi-line format
652
+ - Generate Parquet file with model identifier
653
 
654
+ ### Supported Datasets
655
 
656
+ - **MMLU**: Multiple choice questions across 57 subjects
657
+ - **GSM8K**: Math reasoning problems
658
+ - **ARC-Challenge**: Science Q&A
659
+ - **MedMCQA**: Medical multiple choice
660
+ - **CoQA**: Conversational Q&A
661
+ - And more...
662
 
663
+ ### Color Annotation Legend
664
 
665
+ - 🔴 **Red**: Errors in original text (spelling, grammar, spacing, etc.)
666
+ - 🟢 **Green**: Corrections after denoising
667
+ - ⚫ **Black**: Unchanged correct parts
668
 
669
+ ### Tech Stack
670
 
671
  - **LLM**: DeepSeek API (deepseek-r1-distill-llama-8b)
672
+ - **Local Models**:
673
+ - GEC: LLaMA-2-7B (fine-tuned for grammar correction)
674
  - WAC: Whitespace Correction Model
675
+ - **Frontend**: Gradio 4.16.0
676
+ - **Data Processing**: Pandas + PyArrow (Parquet)
677
+ - **Diff Comparison**: Python difflib
678
+ - **NLP Tools**: spaCy, pyspellchecker
679
+ - **API Calls**: OpenAI SDK
680
+ - **Deployment**: Hugging Face Spaces
681
 
682
+ ### Quality Metrics
683
 
684
+ - **WAR (Whitespace Anomaly Rate)**: Whitespace anomaly rate
685
+ - **SED (Spelling Error Density)**: Spelling error density
686
 
687
+ ### Model Selection Guide
688
 
689
+ - **Need comprehensive denoising + API budget**: Choose DeepSeek-R1
690
+ - **Local deployment + complete correction**: Choose WAC-GEC (Recommended)
691
+ - **Only need spacing correction**: Use WAC module alone
692
+ - **Fastest speed**: Use GPU-accelerated WAC-GEC
693
 
694
  ---
695
 
696
+ **Graduate Thesis Research Showcase** | Powered by DeepSeek API & WAC-GEC
697
  """
698
 
699
+ # ======================== Gradio Interface ========================
700
+ demo = gr.Blocks(title="Dataset Denoising Framework Demo System", css="""
701
  .markdown-text { font-size: 16px; line-height: 1.6; }
702
  """)
703
 
704
  with demo:
705
  gr.Markdown(
706
+ """<div style="text-align: center;"><h1>⭐ <span style='color: #e6b800;'>Denoising Factory</span> Based on Benchmark Denoising Framework</h1></div>
707
  <br>
708
+ <p>This system demonstrates the denoising effects of DeepSeek-R1 and WAC-GEC methods on mainstream benchmark datasets based on <a href="https://github.com/LLLoUo/bd-toolkit" target="_blank">BD-toolkit</a>. Quality is evaluated using WAR (Whitespace Anomaly Rate) and SED (Spelling Error Density) metrics.</p>
709
  """,
710
  elem_classes="markdown-text"
711
  )
 
715
  with gr.Tabs(elem_classes="tab-buttons") as tabs:
716
  with gr.TabItem("📊 BD-benchmarks Leaderboard", id=0):
717
  with gr.Column():
718
+ gr.Markdown("### Mainstream Benchmark Leaderboard After BD Denoising")
719
 
720
  with gr.Row():
721
  search_bar = gr.Textbox(
722
+ placeholder="🔍 Search benchmark name and press ENTER...",
723
  show_label=False,
724
  elem_id="search-bar",
725
  )
726
  filter_categories = gr.Radio(
727
+ label="📂 Filter by Benchmark Category",
728
  choices=["all", "BT", "RA", "TG", "SU", "ME", "GR"],
729
  value="all",
730
  elem_id="filter-columns",
731
  )
732
  filter_versions = gr.Radio(
733
+ label="🔖 Filter by Dataset Version",
734
  choices=[
735
+ ("All Versions", "all"),
736
+ ("Original Dataset", "original"),
737
+ ("DeepSeek-R1 Denoised", "deepseek"),
738
+ ("WAC-GEC Denoised", "wac_gec")
739
  ],
740
  value="all",
741
  elem_id="filter-versions",
 
743
 
744
  leaderboard_table = gr.Dataframe(
745
  value=leaderboard_data[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
746
+ headers=['ID', 'Category', 'Benchmark', 'WAR (%)', 'SED', 'Download'],
747
  datatype=['number', 'str', 'str', 'number', 'number', 'markdown'],
748
  elem_id="leaderboard-table",
749
  interactive=False,
 
754
  visible=False
755
  )
756
 
 
757
  search_bar.submit(
758
  lambda df, query: search_leaderboard(df, query)[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']],
759
  [hidden_leaderboard, search_bar],
760
  leaderboard_table
761
  )
762
 
 
763
  def combined_filter(df, category, version):
764
  filtered = filter_leaderboard(df, category, version)
765
  return filtered[['ID', 'Category', 'Benchmark', 'WAR', 'SED', 'Download']]
 
770
  leaderboard_table
771
  )
772
 
 
773
  filter_versions.change(
774
  combined_filter,
775
  [hidden_leaderboard, filter_categories, filter_versions],
 
777
  )
778
 
779
  gr.Markdown("""
780
+ **Legend:**
781
+ - **Category**: BT=Basic Tasks, RA=Reasoning Abilities, TG=Text Generation, SU=Speech Understanding, ME=Medical, GR=Grammar
782
+ - **Version**: Original=Unprocessed dataset, DeepSeek-R1=DeepSeek denoised version, WAC-GEC=WAC-GEC denoised version
783
+ - **WAR**: Whitespace Anomaly Rate (lower is better)
784
+ - **SED**: Spelling Error Density (lower is better)
785
  """, elem_classes="markdown-text")
786
 
787
+
 
 
 
 
 
788
 
789
+ with gr.TabItem("🚀 BD-toolkit Demo", id=2):
790
+ gr.Markdown("## BD-toolkit Lightweight Demo")
791
 
792
+ model_status = "✅ WAC-GEC: " + ("Available" if WAC_GEC_AVAILABLE else "Not Installed")
793
+ model_status += " | DeepSeek-R1: " + ("Configured" if DEEPSEEK_API_KEY else "API Key Not Configured")
794
+ gr.Markdown(f"**Model Status**: {model_status}")
 
795
 
796
  with gr.Row():
797
  with gr.Column():
798
  file_input = gr.File(
799
+ label="📁 Upload Parquet File",
800
  file_types=[".parquet"]
801
  )
802
 
803
  question_column = gr.Textbox(
804
+ label="📝 Question Column Name",
805
  value="question",
806
+ placeholder="e.g., question, input_text, prompt"
807
  )
808
 
809
  model_choice = gr.Dropdown(
810
  choices=["WAC-GEC", "deepseek-r1-distill-llama-8b"],
811
  value="WAC-GEC",
812
+ label="🤖 Select Model",
813
+ info="DeepSeek: Comprehensive correction | WAC-GEC: Grammar + whitespace (local model)"
814
  )
815
 
816
  temperature = gr.Slider(
 
819
  value=0.1,
820
  step=0.1,
821
  label="🌡️ Temperature",
822
+ info="Only effective for DeepSeek",
823
+ interactive=False
824
  )
825
 
826
  max_samples = gr.Slider(
 
828
  maximum=100,
829
  value=5,
830
  step=1,
831
+ label="📊 Number of Samples to Process (Demo Limit)"
832
  )
833
 
834
+ clean_btn = gr.Button("🚀 Start Denoising", variant="primary", size="lg")
835
 
836
  with gr.Column():
837
  output_text = gr.Textbox(
838
+ label="⏳ Processing Progress",
839
  lines=10,
840
  max_lines=15
841
  )
842
 
843
+ download_file = gr.File(label="📥 Download Denoised Dataset")
844
 
 
845
  def update_temperature_interactive(model):
846
  if model == "deepseek-r1-distill-llama-8b":
847
+ return gr.update(interactive=True, info="Adjust generation randomness")
848
  else:
849
+ return gr.update(interactive=False, info="WAC-GEC model does not support temperature parameter")
850
 
851
  model_choice.change(
852
  fn=update_temperature_interactive,
 
854
  outputs=[temperature]
855
  )
856
 
857
+ gr.Markdown("### 🎨 Denoising Effect Comparison Preview")
 
858
  gr.Markdown("""
859
+ **Color Legend**:
860
+ - 🔴 <span style="color: #dc3545;">Red</span> = Errors in original text
861
+ - 🟢 <span style="color: #28a745;">Green</span> = Corrections after denoising
862
+ - ⚫ Black = Unchanged correct parts
863
  """)
864
 
865
  colored_preview = gr.HTML(label="")
 
870
  outputs=[output_text, download_file, colored_preview]
871
  )
872
 
873
+ with gr.TabItem("📝 About", id=3):
874
+ gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text")
875
+
876
  if __name__ == "__main__":
877
+ print("🚀 Preloading WAC-GEC models...")
 
 
878
  initialize_wac_gec()
879
 
880
  demo.launch(
leaderboard.json CHANGED
@@ -5,7 +5,7 @@
5
  "Benchmark": "ARC_original",
6
  "WAR": 0.11,
7
  "SED": 0.67,
8
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc)"
9
  },
10
  {
11
  "ID": 2,
@@ -13,7 +13,7 @@
13
  "Benchmark": "ARC_deepseek_r1_denoising",
14
  "WAR": 0.00,
15
  "SED": 0.67,
16
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_deepseek_r1_denoising)"
17
  },
18
  {
19
  "ID": 3,
@@ -21,7 +21,7 @@
21
  "Benchmark": "ARC_wac_gec",
22
  "WAR": 0.00,
23
  "SED": 0.66,
24
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_wac_gec)"
25
  },
26
  {
27
  "ID": 4,
@@ -29,7 +29,7 @@
29
  "Benchmark": "COQA_original",
30
  "WAR": 6.79,
31
  "SED": 2.74,
32
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa)"
33
  },
34
  {
35
  "ID": 5,
@@ -37,7 +37,7 @@
37
  "Benchmark": "COQA_deepseek_r1_denoising",
38
  "WAR": 4.18,
39
  "SED": 2.57,
40
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_deepseek_r1_denoising)"
41
  },
42
  {
43
  "ID": 6,
@@ -45,7 +45,7 @@
45
  "Benchmark": "COQA_wac_gec",
46
  "WAR": 4.70,
47
  "SED": 2.56,
48
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_wac_gec)"
49
  },
50
  {
51
  "ID": 7,
@@ -53,7 +53,7 @@
53
  "Benchmark": "DROP_original",
54
  "WAR": 1.50,
55
  "SED": 3.38,
56
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop)"
57
  },
58
  {
59
  "ID": 8,
@@ -61,7 +61,7 @@
61
  "Benchmark": "DROP_deepseek_r1_denoising",
62
  "WAR": 0.02,
63
  "SED": 3.24,
64
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_deepseek_r1_denoising)"
65
  },
66
  {
67
  "ID": 9,
@@ -69,7 +69,7 @@
69
  "Benchmark": "DROP_wac_gec",
70
  "WAR": 0.64,
71
  "SED": 3.25,
72
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_wac_gec)"
73
  },
74
  {
75
  "ID": 10,
@@ -77,7 +77,7 @@
77
  "Benchmark": "MRPC_original",
78
  "WAR": 100.00,
79
  "SED": 5.65,
80
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/mrpc)"
81
  },
82
  {
83
  "ID": 11,
@@ -85,7 +85,7 @@
85
  "Benchmark": "MRPC_deepseek_r1_denoising",
86
  "WAR": 3.80,
87
  "SED": 4.70,
88
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/mrpc)"
89
  },
90
  {
91
  "ID": 12,
@@ -93,7 +93,7 @@
93
  "Benchmark": "MRPC_wac_gec",
94
  "WAR": 1.84,
95
  "SED": 4.50,
96
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/mrpc)"
97
  },
98
  {
99
  "ID": 13,
@@ -101,7 +101,7 @@
101
  "Benchmark": "RTE_original",
102
  "WAR": 2.17,
103
  "SED": 4.47,
104
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/rte)"
105
  },
106
  {
107
  "ID": 14,
@@ -109,7 +109,7 @@
109
  "Benchmark": "RTE_deepseek_r1_denoising",
110
  "WAR": 0.36,
111
  "SED": 4.50,
112
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/rte)"
113
  },
114
  {
115
  "ID": 15,
@@ -117,7 +117,7 @@
117
  "Benchmark": "RTE_wac_gec",
118
  "WAR": 0.72,
119
  "SED": 4.43,
120
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/rte)"
121
  },
122
  {
123
  "ID": 16,
@@ -125,7 +125,7 @@
125
  "Benchmark": "SST2_original",
126
  "WAR": 98.97,
127
  "SED": 5.42,
128
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/sst2)"
129
  },
130
  {
131
  "ID": 17,
@@ -133,7 +133,7 @@
133
  "Benchmark": "SST2_deepseek_r1_denoising",
134
  "WAR": 7.22,
135
  "SED": 3.66,
136
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/sst2)"
137
  },
138
  {
139
  "ID": 18,
@@ -141,7 +141,7 @@
141
  "Benchmark": "SST2_wac_gec",
142
  "WAR": 5.39,
143
  "SED": 3.52,
144
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/sst2)"
145
  },
146
  {
147
  "ID": 19,
@@ -149,7 +149,7 @@
149
  "Benchmark": "WNLI_original",
150
  "WAR": 0.70,
151
  "SED": 0.64,
152
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/wnli)"
153
  },
154
  {
155
  "ID": 20,
@@ -157,7 +157,7 @@
157
  "Benchmark": "WNLI_deepseek_r1_denoising",
158
  "WAR": 0.00,
159
  "SED": 0.59,
160
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/wnli)"
161
  },
162
  {
163
  "ID": 21,
@@ -165,7 +165,7 @@
165
  "Benchmark": "WNLI_wac_gec",
166
  "WAR": 0.00,
167
  "SED": 0.64,
168
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/wnli)"
169
  },
170
  {
171
  "ID": 22,
@@ -173,7 +173,7 @@
173
  "Benchmark": "GSM8K_original",
174
  "WAR": 25.70,
175
  "SED": 1.11,
176
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k)"
177
  },
178
  {
179
  "ID": 23,
@@ -181,7 +181,7 @@
181
  "Benchmark": "GSM8K_deepseek_r1_denoising",
182
  "WAR": 0.30,
183
  "SED": 1.13,
184
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_deepseek_r1_denoising)"
185
  },
186
  {
187
  "ID": 24,
@@ -189,7 +189,7 @@
189
  "Benchmark": "GSM8K_wac_gec",
190
  "WAR": 1.97,
191
  "SED": 1.11,
192
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_wac_gec)"
193
  },
194
  {
195
  "ID": 25,
@@ -197,7 +197,7 @@
197
  "Benchmark": "MMLU_original",
198
  "WAR": 10.06,
199
  "SED": 2.21,
200
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu)"
201
  },
202
  {
203
  "ID": 26,
@@ -205,7 +205,7 @@
205
  "Benchmark": "MMLU_deepseek_r1_denoising",
206
  "WAR": 6.56,
207
  "SED": 2.15,
208
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_deepseek_r1_denoising)"
209
  },
210
  {
211
  "ID": 27,
@@ -213,7 +213,7 @@
213
  "Benchmark": "MMLU_wac_gec",
214
  "WAR": 2.98,
215
  "SED": 2.08,
216
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_wac_gec)"
217
  },
218
  {
219
  "ID": 28,
@@ -221,7 +221,7 @@
221
  "Benchmark": "MedMCQA_original",
222
  "WAR": 6.31,
223
  "SED": 6.18,
224
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa)"
225
  },
226
  {
227
  "ID": 29,
@@ -229,7 +229,7 @@
229
  "Benchmark": "MedMCQA_deepseek_r1_denoising",
230
  "WAR": 3.44,
231
  "SED": 5.70,
232
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_deepseek_r1_denoising)"
233
  },
234
  {
235
  "ID": 30,
@@ -237,7 +237,7 @@
237
  "Benchmark": "MedMCQA_wac_gec",
238
  "WAR": 2.44,
239
  "SED": 5.91,
240
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_wac_gec)"
241
  },
242
  {
243
  "ID": 31,
@@ -245,7 +245,7 @@
245
  "Benchmark": "MedQA_original",
246
  "WAR": 16.97,
247
  "SED": 6.49,
248
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA-USMLE-4-options)"
249
  },
250
  {
251
  "ID": 32,
@@ -253,7 +253,7 @@
253
  "Benchmark": "MedQA_deepseek_r1_denoising",
254
  "WAR": 16.26,
255
  "SED": 6.49,
256
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_deepseek_r1_denoising)"
257
  },
258
  {
259
  "ID": 33,
@@ -261,7 +261,7 @@
261
  "Benchmark": "MedQA_wac_gec",
262
  "WAR": 0.79,
263
  "SED": 6.51,
264
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_wac_gec)"
265
  },
266
  {
267
  "ID": 34,
@@ -269,7 +269,7 @@
269
  "Benchmark": "Natural_questions_original",
270
  "WAR": 0.17,
271
  "SED": 2.90,
272
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open)"
273
  },
274
  {
275
  "ID": 35,
@@ -277,7 +277,7 @@
277
  "Benchmark": "Natural_questions_deepseek_r1_denoising",
278
  "WAR": 0.06,
279
  "SED": 3.06,
280
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_deepseek_r1_denoising)"
281
  },
282
  {
283
  "ID": 36,
@@ -285,7 +285,7 @@
285
  "Benchmark": "Natural_questions_wac_gec",
286
  "WAR": 0.28,
287
  "SED": 2.93,
288
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_wac_gec)"
289
  },
290
  {
291
  "ID": 37,
@@ -293,7 +293,7 @@
293
  "Benchmark": "PubMedQA_original",
294
  "WAR": 0.60,
295
  "SED": 8.15,
296
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa)"
297
  },
298
  {
299
  "ID": 38,
@@ -301,7 +301,7 @@
301
  "Benchmark": "PubMedQA_deepseek_r1_denoising",
302
  "WAR": 0.20,
303
  "SED": 8.19,
304
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_deepseek_r1_denoising)"
305
  },
306
  {
307
  "ID": 39,
@@ -309,7 +309,7 @@
309
  "Benchmark": "PubMedQA_wac_gec",
310
  "WAR": 0.00,
311
  "SED": 8.10,
312
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_wac_gec)"
313
  },
314
  {
315
  "ID": 40,
@@ -317,7 +317,7 @@
317
  "Benchmark": "Truthful_QA_original",
318
  "WAR": 0.00,
319
  "SED": 1.75,
320
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa)"
321
  },
322
  {
323
  "ID": 41,
@@ -325,7 +325,7 @@
325
  "Benchmark": "Truthful_QA_deepseek_r1_denoising",
326
  "WAR": 0.00,
327
  "SED": 1.73,
328
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_deepseek_r1_denoising)"
329
  },
330
  {
331
  "ID": 42,
@@ -333,6 +333,6 @@
333
  "Benchmark": "Truthful_QA_wac_gec",
334
  "WAR": 0.00,
335
  "SED": 1.53,
336
- "Download": "[下载](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_wac_gec)"
337
  }
338
  ]
 
5
  "Benchmark": "ARC_original",
6
  "WAR": 0.11,
7
  "SED": 0.67,
8
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc)"
9
  },
10
  {
11
  "ID": 2,
 
13
  "Benchmark": "ARC_deepseek_r1_denoising",
14
  "WAR": 0.00,
15
  "SED": 0.67,
16
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_deepseek_r1_denoising)"
17
  },
18
  {
19
  "ID": 3,
 
21
  "Benchmark": "ARC_wac_gec",
22
  "WAR": 0.00,
23
  "SED": 0.66,
24
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/ARC/arc_wac_gec)"
25
  },
26
  {
27
  "ID": 4,
 
29
  "Benchmark": "COQA_original",
30
  "WAR": 6.79,
31
  "SED": 2.74,
32
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa)"
33
  },
34
  {
35
  "ID": 5,
 
37
  "Benchmark": "COQA_deepseek_r1_denoising",
38
  "WAR": 4.18,
39
  "SED": 2.57,
40
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_deepseek_r1_denoising)"
41
  },
42
  {
43
  "ID": 6,
 
45
  "Benchmark": "COQA_wac_gec",
46
  "WAR": 4.70,
47
  "SED": 2.56,
48
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/COQA/coqa_wac_gec)"
49
  },
50
  {
51
  "ID": 7,
 
53
  "Benchmark": "DROP_original",
54
  "WAR": 1.50,
55
  "SED": 3.38,
56
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop)"
57
  },
58
  {
59
  "ID": 8,
 
61
  "Benchmark": "DROP_deepseek_r1_denoising",
62
  "WAR": 0.02,
63
  "SED": 3.24,
64
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_deepseek_r1_denoising)"
65
  },
66
  {
67
  "ID": 9,
 
69
  "Benchmark": "DROP_wac_gec",
70
  "WAR": 0.64,
71
  "SED": 3.25,
72
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/DROP/drop_wac_gec)"
73
  },
74
  {
75
  "ID": 10,
 
77
  "Benchmark": "MRPC_original",
78
  "WAR": 100.00,
79
  "SED": 5.65,
80
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/mrpc)"
81
  },
82
  {
83
  "ID": 11,
 
85
  "Benchmark": "MRPC_deepseek_r1_denoising",
86
  "WAR": 3.80,
87
  "SED": 4.70,
88
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/mrpc)"
89
  },
90
  {
91
  "ID": 12,
 
93
  "Benchmark": "MRPC_wac_gec",
94
  "WAR": 1.84,
95
  "SED": 4.50,
96
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/mrpc)"
97
  },
98
  {
99
  "ID": 13,
 
101
  "Benchmark": "RTE_original",
102
  "WAR": 2.17,
103
  "SED": 4.47,
104
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/rte)"
105
  },
106
  {
107
  "ID": 14,
 
109
  "Benchmark": "RTE_deepseek_r1_denoising",
110
  "WAR": 0.36,
111
  "SED": 4.50,
112
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/rte)"
113
  },
114
  {
115
  "ID": 15,
 
117
  "Benchmark": "RTE_wac_gec",
118
  "WAR": 0.72,
119
  "SED": 4.43,
120
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/rte)"
121
  },
122
  {
123
  "ID": 16,
 
125
  "Benchmark": "SST2_original",
126
  "WAR": 98.97,
127
  "SED": 5.42,
128
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/sst2)"
129
  },
130
  {
131
  "ID": 17,
 
133
  "Benchmark": "SST2_deepseek_r1_denoising",
134
  "WAR": 7.22,
135
  "SED": 3.66,
136
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/sst2)"
137
  },
138
  {
139
  "ID": 18,
 
141
  "Benchmark": "SST2_wac_gec",
142
  "WAR": 5.39,
143
  "SED": 3.52,
144
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/sst2)"
145
  },
146
  {
147
  "ID": 19,
 
149
  "Benchmark": "WNLI_original",
150
  "WAR": 0.70,
151
  "SED": 0.64,
152
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue/wnli)"
153
  },
154
  {
155
  "ID": 20,
 
157
  "Benchmark": "WNLI_deepseek_r1_denoising",
158
  "WAR": 0.00,
159
  "SED": 0.59,
160
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_deepseek_r1_denoising/wnli)"
161
  },
162
  {
163
  "ID": 21,
 
165
  "Benchmark": "WNLI_wac_gec",
166
  "WAR": 0.00,
167
  "SED": 0.64,
168
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GLUE/glue_wac_gec/wnli)"
169
  },
170
  {
171
  "ID": 22,
 
173
  "Benchmark": "GSM8K_original",
174
  "WAR": 25.70,
175
  "SED": 1.11,
176
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k)"
177
  },
178
  {
179
  "ID": 23,
 
181
  "Benchmark": "GSM8K_deepseek_r1_denoising",
182
  "WAR": 0.30,
183
  "SED": 1.13,
184
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_deepseek_r1_denoising)"
185
  },
186
  {
187
  "ID": 24,
 
189
  "Benchmark": "GSM8K_wac_gec",
190
  "WAR": 1.97,
191
  "SED": 1.11,
192
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/GSM8K/gsm8k_wac_gec)"
193
  },
194
  {
195
  "ID": 25,
 
197
  "Benchmark": "MMLU_original",
198
  "WAR": 10.06,
199
  "SED": 2.21,
200
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu)"
201
  },
202
  {
203
  "ID": 26,
 
205
  "Benchmark": "MMLU_deepseek_r1_denoising",
206
  "WAR": 6.56,
207
  "SED": 2.15,
208
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_deepseek_r1_denoising)"
209
  },
210
  {
211
  "ID": 27,
 
213
  "Benchmark": "MMLU_wac_gec",
214
  "WAR": 2.98,
215
  "SED": 2.08,
216
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MMLU/mmlu_wac_gec)"
217
  },
218
  {
219
  "ID": 28,
 
221
  "Benchmark": "MedMCQA_original",
222
  "WAR": 6.31,
223
  "SED": 6.18,
224
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa)"
225
  },
226
  {
227
  "ID": 29,
 
229
  "Benchmark": "MedMCQA_deepseek_r1_denoising",
230
  "WAR": 3.44,
231
  "SED": 5.70,
232
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_deepseek_r1_denoising)"
233
  },
234
  {
235
  "ID": 30,
 
237
  "Benchmark": "MedMCQA_wac_gec",
238
  "WAR": 2.44,
239
  "SED": 5.91,
240
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedMCQA/medmcqa_wac_gec)"
241
  },
242
  {
243
  "ID": 31,
 
245
  "Benchmark": "MedQA_original",
246
  "WAR": 16.97,
247
  "SED": 6.49,
248
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA-USMLE-4-options)"
249
  },
250
  {
251
  "ID": 32,
 
253
  "Benchmark": "MedQA_deepseek_r1_denoising",
254
  "WAR": 16.26,
255
  "SED": 6.49,
256
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_deepseek_r1_denoising)"
257
  },
258
  {
259
  "ID": 33,
 
261
  "Benchmark": "MedQA_wac_gec",
262
  "WAR": 0.79,
263
  "SED": 6.51,
264
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/MedQA/MedQA_wac_gec)"
265
  },
266
  {
267
  "ID": 34,
 
269
  "Benchmark": "Natural_questions_original",
270
  "WAR": 0.17,
271
  "SED": 2.90,
272
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open)"
273
  },
274
  {
275
  "ID": 35,
 
277
  "Benchmark": "Natural_questions_deepseek_r1_denoising",
278
  "WAR": 0.06,
279
  "SED": 3.06,
280
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_deepseek_r1_denoising)"
281
  },
282
  {
283
  "ID": 36,
 
285
  "Benchmark": "Natural_questions_wac_gec",
286
  "WAR": 0.28,
287
  "SED": 2.93,
288
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Natural_questions/nq_open_wac_gec)"
289
  },
290
  {
291
  "ID": 37,
 
293
  "Benchmark": "PubMedQA_original",
294
  "WAR": 0.60,
295
  "SED": 8.15,
296
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa)"
297
  },
298
  {
299
  "ID": 38,
 
301
  "Benchmark": "PubMedQA_deepseek_r1_denoising",
302
  "WAR": 0.20,
303
  "SED": 8.19,
304
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_deepseek_r1_denoising)"
305
  },
306
  {
307
  "ID": 39,
 
309
  "Benchmark": "PubMedQA_wac_gec",
310
  "WAR": 0.00,
311
  "SED": 8.10,
312
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/PubMedQA/pubmed_qa_wac_gec)"
313
  },
314
  {
315
  "ID": 40,
 
317
  "Benchmark": "Truthful_QA_original",
318
  "WAR": 0.00,
319
  "SED": 1.75,
320
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa)"
321
  },
322
  {
323
  "ID": 41,
 
325
  "Benchmark": "Truthful_QA_deepseek_r1_denoising",
326
  "WAR": 0.00,
327
  "SED": 1.73,
328
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_deepseek_r1_denoising)"
329
  },
330
  {
331
  "ID": 42,
 
333
  "Benchmark": "Truthful_QA_wac_gec",
334
  "WAR": 0.00,
335
  "SED": 1.53,
336
+ "Download": "[Download](https://huggingface.co/datasets/lllouo/BD-benchmarks/tree/main/Truthful_QA/truthful_qa_wac_gec)"
337
  }
338
  ]