ZhouChuYue
commited on
Commit
·
968a16c
1
Parent(s):
d540654
fix: use correct API endpoint and model (GLM_ar7snd)
Browse files
README.md
CHANGED
|
@@ -1,186 +1,23 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
##
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
### 环境配置
|
| 26 |
-
|
| 27 |
-
```bash
|
| 28 |
-
# 设置 API Key
|
| 29 |
-
export OPENAI_API_KEY="your-api-key"
|
| 30 |
-
|
| 31 |
-
# 可选:设置自定义 API 地址(兼容 OpenAI 格式的 API)
|
| 32 |
-
export OPENAI_BASE_URL="https://your-api-endpoint/v1"
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
### 基本用法
|
| 36 |
-
|
| 37 |
-
```bash
|
| 38 |
-
python run_synthesis.py \
|
| 39 |
-
--input data.jsonl \
|
| 40 |
-
--output output.jsonl \
|
| 41 |
-
--task qa \
|
| 42 |
-
--level high_school \
|
| 43 |
-
--model gpt-4o \
|
| 44 |
-
--workers 10
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
## 📋 任务类型
|
| 48 |
-
|
| 49 |
-
### 1. Q&A 问答对合成 (`qa`)
|
| 50 |
-
|
| 51 |
-
根据数学内容生成问答对,按教育难度分级。
|
| 52 |
-
|
| 53 |
-
**参数 `--level`:**
|
| 54 |
-
| 值 | 说明 |
|
| 55 |
-
|:---|:---|
|
| 56 |
-
| `grade_school` | 小学 |
|
| 57 |
-
| `middle_school` | 初中 |
|
| 58 |
-
| `high_school` | 高中(默认) |
|
| 59 |
-
| `college` | 大学 |
|
| 60 |
-
|
| 61 |
-
```bash
|
| 62 |
-
python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --level high_school
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
### 2. 多轮对话合成 (`conversation`)
|
| 66 |
-
|
| 67 |
-
将数学内容转换为多轮对话格式。
|
| 68 |
-
|
| 69 |
-
**参数 `--style`:**
|
| 70 |
-
| 值 | 说明 |
|
| 71 |
-
|:---|:---|
|
| 72 |
-
| `two_professors` | 两位教授对话 |
|
| 73 |
-
| `teacher_student` | 师生对话(默认) |
|
| 74 |
-
| `two_students` | 两位学生对话 |
|
| 75 |
-
| `interview` | 面试风格 |
|
| 76 |
-
| `problem_solving` | 问题解决 |
|
| 77 |
-
| `layman_expert` | 外行与专家 |
|
| 78 |
-
| `debate` | 辩论风格 |
|
| 79 |
-
|
| 80 |
-
```bash
|
| 81 |
-
python run_synthesis.py -i data.jsonl -o output.jsonl -t conversation --style teacher_student
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
### 3. 多风格改写 (`rewrite`)
|
| 85 |
-
|
| 86 |
-
将数学内容改写为不同风格。
|
| 87 |
-
|
| 88 |
-
**参数 `--style`:**
|
| 89 |
-
| 值 | 说明 |
|
| 90 |
-
|:---|:---|
|
| 91 |
-
| `wikipedia` | 维基百科风格 |
|
| 92 |
-
| `textbook` | 教科书风格(默认) |
|
| 93 |
-
| `blog` | 博客风格 |
|
| 94 |
-
| `popular_science` | 科普风格 |
|
| 95 |
-
| `academic_paper` | 学术论文风格 |
|
| 96 |
-
| `learning_note` | 学习笔记风格 |
|
| 97 |
-
| `lecture_note` | 讲义风格 |
|
| 98 |
-
|
| 99 |
-
```bash
|
| 100 |
-
python run_synthesis.py -i data.jsonl -o output.jsonl -t rewrite --style textbook
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
### 4. 知识点提取 (`knowledge`)
|
| 104 |
-
|
| 105 |
-
从数学内容中提取定义、定理、性质等知识点。
|
| 106 |
-
|
| 107 |
-
```bash
|
| 108 |
-
python run_synthesis.py -i data.jsonl -o knowledge_output.jsonl -t knowledge
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
### 5. 教材练习生成 (`textbook`)
|
| 112 |
-
|
| 113 |
-
基于知识点生成不同难度的教材式练习。
|
| 114 |
-
|
| 115 |
-
**参数 `--difficulty`:**
|
| 116 |
-
| 值 | 说明 |
|
| 117 |
-
|:---|:---|
|
| 118 |
-
| `easy` | 简单(默认) |
|
| 119 |
-
| `medium` | 中等 |
|
| 120 |
-
| `hard` | 困难 |
|
| 121 |
-
|
| 122 |
-
```bash
|
| 123 |
-
python run_synthesis.py -i knowledge.jsonl -o output.jsonl -t textbook --difficulty medium
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
**注意:** 输入文件需包含 `knowledge_point` 字段(可通过 `--knowledge-field` 自定义)。
|
| 127 |
-
|
| 128 |
-
## ⚙️ 参数说明
|
| 129 |
-
|
| 130 |
-
| 参数 | 说明 | 默认值 |
|
| 131 |
-
|:---|:---|:---|
|
| 132 |
-
| `-i, --input` | 输入 JSONL 文件路径 | 必填 |
|
| 133 |
-
| `-o, --output` | 输出 JSONL 文件路径 | 必填 |
|
| 134 |
-
| `-t, --task` | 任务类型:`qa`, `conversation`, `rewrite`, `knowledge`, `textbook` | 必填 |
|
| 135 |
-
| `--level` | Q&A 难度级别 | `high_school` |
|
| 136 |
-
| `--style` | 对话/改写风格 | - |
|
| 137 |
-
| `--difficulty` | 教材练习难度 | `easy` |
|
| 138 |
-
| `--text-field` | 输入文本字段名 | `text` |
|
| 139 |
-
| `--knowledge-field` | 知识点字段名 | `knowledge_point` |
|
| 140 |
-
| `--api-key` | OpenAI API Key | 环境变量 |
|
| 141 |
-
| `--base-url` | API Base URL | 环境变量 |
|
| 142 |
-
| `--model` | 模型名称 | `gpt-4o` |
|
| 143 |
-
| `--temperature` | 采样温度 | `0.7` |
|
| 144 |
-
| `--max-tokens` | 最大生成 token 数 | `4096` |
|
| 145 |
-
| `-w, --workers` | 并发数 | `10` |
|
| 146 |
-
| `--max-retries` | 最大重试次数 | `3` |
|
| 147 |
-
| `--limit` | 限制处理样本数量 | - |
|
| 148 |
-
| `-q, --quiet` | 静默模式 | `False` |
|
| 149 |
-
|
| 150 |
-
## 📊 输入输出格式
|
| 151 |
-
|
| 152 |
-
**输入:** JSONL 格式,每行一个 JSON 对象(参见 `example_data.jsonl`):
|
| 153 |
-
|
| 154 |
-
```jsonl
|
| 155 |
-
{"text": "The quadratic formula states that for any quadratic equation..."}
|
| 156 |
-
{"text": "The Pythagorean theorem is a fundamental relation..."}
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
**输出:** 在原数据基础上添加 `synthesis_result` 字段:
|
| 160 |
-
|
| 161 |
-
```json
|
| 162 |
-
{
|
| 163 |
-
"text": "原始数学内容",
|
| 164 |
-
"synthesis_result": {
|
| 165 |
-
"raw": "完整响应",
|
| 166 |
-
"problem": "生成的问题",
|
| 167 |
-
"solution": "详细解答"
|
| 168 |
-
}
|
| 169 |
-
}
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
## 🔌 兼容其他 API
|
| 173 |
-
|
| 174 |
-
支持任何 OpenAI 兼容的 API(如 Qwen、DeepSeek、vLLM 等):
|
| 175 |
-
|
| 176 |
-
```bash
|
| 177 |
-
# 使用阿里云 Qwen API
|
| 178 |
-
export OPENAI_API_KEY="your-dashscope-api-key"
|
| 179 |
-
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
|
| 180 |
-
|
| 181 |
-
python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --model qwen-plus
|
| 182 |
-
```
|
| 183 |
-
|
| 184 |
-
## 📜 许可证
|
| 185 |
-
|
| 186 |
-
本项目基于 [Apache 2.0](../LICENSE) 许可证发布。
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: UltraData-Math L3 Generator
|
| 3 |
+
emoji: 🧮
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "5.9.1"
|
| 8 |
+
python_version: "3.10"
|
| 9 |
+
app_file: app.py
|
| 10 |
+
pinned: false
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# UltraData-Math L3 Generator
|
| 14 |
+
|
| 15 |
+
LLM-based Mathematical Data Synthesis Tool - Part of the UltraData-Math Project.
|
| 16 |
+
|
| 17 |
+
## Features
|
| 18 |
+
|
| 19 |
+
- **Q&A Synthesis**: Generate Q&A pairs from mathematical content
|
| 20 |
+
- **Conversation Synthesis**: Convert math content into multi-turn dialogues
|
| 21 |
+
- **Multi-style Rewrite**: Rewrite content in different styles (textbook, blog, etc.)
|
| 22 |
+
- **Knowledge Extraction**: Extract definitions, theorems, and properties
|
| 23 |
+
- **Textbook Exercise**: Generate textbook-style exercises from knowledge points
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -28,8 +28,8 @@ from run_synthesis import (
|
|
| 28 |
|
| 29 |
# API 配置从环境变量读取(通过 HF Secrets 设置)
|
| 30 |
API_KEY = os.getenv("OPENAI_API_KEY")
|
| 31 |
-
BASE_URL = os.getenv("OPENAI_BASE_URL", "https://
|
| 32 |
-
DEFAULT_MODEL = "
|
| 33 |
|
| 34 |
|
| 35 |
async def call_api(prompt: str, model: str = DEFAULT_MODEL, temperature: float = 0.7) -> str:
|
|
@@ -43,9 +43,15 @@ async def call_api(prompt: str, model: str = DEFAULT_MODEL, temperature: float =
|
|
| 43 |
model=model,
|
| 44 |
messages=[{"role": "user", "content": prompt}],
|
| 45 |
temperature=temperature,
|
| 46 |
-
max_tokens=
|
| 47 |
)
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
except Exception as e:
|
| 50 |
return f"Error: {str(e)}"
|
| 51 |
|
|
@@ -190,8 +196,8 @@ with gr.Blocks(title="UltraData-Math L3 Generator", css=custom_css) as demo:
|
|
| 190 |
|
| 191 |
with gr.Row():
|
| 192 |
model_select = gr.Dropdown(
|
| 193 |
-
choices=["
|
| 194 |
-
value="
|
| 195 |
label="Model",
|
| 196 |
scale=1,
|
| 197 |
)
|
|
|
|
| 28 |
|
| 29 |
# API 配置从环境变量读取(通过 HF Secrets 设置)
|
| 30 |
API_KEY = os.getenv("OPENAI_API_KEY")
|
| 31 |
+
BASE_URL = os.getenv("OPENAI_BASE_URL", "https://llm-center.ali.modelbest.cn/llm/openai/v1")
|
| 32 |
+
DEFAULT_MODEL = "GLM_ar7snd"
|
| 33 |
|
| 34 |
|
| 35 |
async def call_api(prompt: str, model: str = DEFAULT_MODEL, temperature: float = 0.7) -> str:
|
|
|
|
| 43 |
model=model,
|
| 44 |
messages=[{"role": "user", "content": prompt}],
|
| 45 |
temperature=temperature,
|
| 46 |
+
max_tokens=8192,
|
| 47 |
)
|
| 48 |
+
# 处理 reasoning model 的返回格式
|
| 49 |
+
message = response.choices[0].message
|
| 50 |
+
content = message.content
|
| 51 |
+
# 如果 content 为空,尝试获取 reasoning_content
|
| 52 |
+
if not content and hasattr(message, 'reasoning_content') and message.reasoning_content:
|
| 53 |
+
content = message.reasoning_content
|
| 54 |
+
return content or ""
|
| 55 |
except Exception as e:
|
| 56 |
return f"Error: {str(e)}"
|
| 57 |
|
|
|
|
| 196 |
|
| 197 |
with gr.Row():
|
| 198 |
model_select = gr.Dropdown(
|
| 199 |
+
choices=["GLM_ar7snd", "GLM_pq0dvd", "GLM_35a7cn", "QWEN_czrd3t", "DEEPSEEK_5jcwxs"],
|
| 200 |
+
value="GLM_ar7snd",
|
| 201 |
label="Model",
|
| 202 |
scale=1,
|
| 203 |
)
|