ZhouChuYue commited on
Commit
787a7ad
·
0 Parent(s):

Initial commit: UltraData-Math L3 Generator Space

Browse files
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # UltraData-Math-L3-Generator
2
+
3
+ L3 合成数据层:基于 LLM 的多格式数学数据合成工具。
4
+
5
+ ## 📂 目录结构
6
+
7
+ ```
8
+ UltraData-Math-L3-Generator/
9
+ ├── run_synthesis.py # OpenAI API 调用脚本
10
+ ├── qa_synthesis.py # Q&A 问答对合成 Prompt
11
+ ├── conversation_synthesis.py # 多轮对话合成 Prompt
12
+ ├── multistyle_rewrite.py # 多风格改写 Prompt
13
+ ├── knowledge_textbook.py # 知识点提取 + 教材练习 Prompt
14
+ └── README.md
15
+ ```
16
+
17
+ ## 🔧 安装依赖
18
+
19
+ ```bash
20
+ pip install openai
21
+ ```
22
+
23
+ ## 🚀 快速开始
24
+
25
+ ### 环境配置
26
+
27
+ ```bash
28
+ # 设置 API Key
29
+ export OPENAI_API_KEY="your-api-key"
30
+
31
+ # 可选:设置自定义 API 地址(兼容 OpenAI 格式的 API)
32
+ export OPENAI_BASE_URL="https://your-api-endpoint/v1"
33
+ ```
34
+
35
+ ### 基本用法
36
+
37
+ ```bash
38
+ python run_synthesis.py \
39
+ --input data.jsonl \
40
+ --output output.jsonl \
41
+ --task qa \
42
+ --level high_school \
43
+ --model gpt-4o \
44
+ --workers 10
45
+ ```
46
+
47
+ ## 📋 任务类型
48
+
49
+ ### 1. Q&A 问答对合成 (`qa`)
50
+
51
+ 根据数学内容生成问答对,按教育难度分级。
52
+
53
+ **参数 `--level`:**
54
+ | 值 | 说明 |
55
+ |:---|:---|
56
+ | `grade_school` | 小学 |
57
+ | `middle_school` | 初中 |
58
+ | `high_school` | 高中(默认) |
59
+ | `college` | 大学 |
60
+
61
+ ```bash
62
+ python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --level high_school
63
+ ```
64
+
65
+ ### 2. 多轮对话合成 (`conversation`)
66
+
67
+ 将数学内容转换为多轮对话格式。
68
+
69
+ **参数 `--style`:**
70
+ | 值 | 说明 |
71
+ |:---|:---|
72
+ | `two_professors` | 两位教授对话 |
73
+ | `teacher_student` | 师生对话(默认) |
74
+ | `two_students` | 两位学生对话 |
75
+ | `interview` | 面试风格 |
76
+ | `problem_solving` | 问题解决 |
77
+ | `layman_expert` | 外行与专家 |
78
+ | `debate` | 辩论风格 |
79
+
80
+ ```bash
81
+ python run_synthesis.py -i data.jsonl -o output.jsonl -t conversation --style teacher_student
82
+ ```
83
+
84
+ ### 3. 多风格改写 (`rewrite`)
85
+
86
+ 将数学内容改写为不同风格。
87
+
88
+ **参数 `--style`:**
89
+ | 值 | 说明 |
90
+ |:---|:---|
91
+ | `wikipedia` | 维基百科风格 |
92
+ | `textbook` | 教科书风格(默认) |
93
+ | `blog` | 博客风格 |
94
+ | `popular_science` | 科普风格 |
95
+ | `academic_paper` | 学术论文风格 |
96
+ | `learning_note` | 学习笔记风格 |
97
+ | `lecture_note` | 讲义风格 |
98
+
99
+ ```bash
100
+ python run_synthesis.py -i data.jsonl -o output.jsonl -t rewrite --style textbook
101
+ ```
102
+
103
+ ### 4. 知识点提取 (`knowledge`)
104
+
105
+ 从数学内容中提取定义、定理、性质等知识点。
106
+
107
+ ```bash
108
+ python run_synthesis.py -i data.jsonl -o knowledge_output.jsonl -t knowledge
109
+ ```
110
+
111
+ ### 5. 教材练习生成 (`textbook`)
112
+
113
+ 基于知识点生成不同难度的教材式练习。
114
+
115
+ **参数 `--difficulty`:**
116
+ | 值 | 说明 |
117
+ |:---|:---|
118
+ | `easy` | 简单(默认) |
119
+ | `medium` | 中等 |
120
+ | `hard` | 困难 |
121
+
122
+ ```bash
123
+ python run_synthesis.py -i knowledge.jsonl -o output.jsonl -t textbook --difficulty medium
124
+ ```
125
+
126
+ **注意:** 输入文件需包含 `knowledge_point` 字段(可通过 `--knowledge-field` 自定义)。
127
+
128
+ ## ⚙️ 参数说明
129
+
130
+ | 参数 | 说明 | 默认值 |
131
+ |:---|:---|:---|
132
+ | `-i, --input` | 输入 JSONL 文件路径 | 必填 |
133
+ | `-o, --output` | 输出 JSONL 文件路径 | 必填 |
134
+ | `-t, --task` | 任务类型:`qa`, `conversation`, `rewrite`, `knowledge`, `textbook` | 必填 |
135
+ | `--level` | Q&A 难度级别 | `high_school` |
136
+ | `--style` | 对话/改写风格 | - |
137
+ | `--difficulty` | 教材练习难度 | `easy` |
138
+ | `--text-field` | 输入文本字段名 | `text` |
139
+ | `--knowledge-field` | 知识点字段名 | `knowledge_point` |
140
+ | `--api-key` | OpenAI API Key | 环境变量 |
141
+ | `--base-url` | API Base URL | 环境变量 |
142
+ | `--model` | 模型名称 | `gpt-4o` |
143
+ | `--temperature` | 采样温度 | `0.7` |
144
+ | `--max-tokens` | 最大生成 token 数 | `4096` |
145
+ | `-w, --workers` | 并发数 | `10` |
146
+ | `--max-retries` | 最大重试次数 | `3` |
147
+ | `--limit` | 限制处理样本数量 | - |
148
+ | `-q, --quiet` | 静默模式 | `False` |
149
+
150
+ ## 📊 输入输出格式
151
+
152
+ **输入:** JSONL 格式,每行一个 JSON 对象(参见 `example_data.jsonl`):
153
+
154
+ ```jsonl
155
+ {"text": "The quadratic formula states that for any quadratic equation..."}
156
+ {"text": "The Pythagorean theorem is a fundamental relation..."}
157
+ ```
158
+
159
+ **输出:** 在原数据基础上添加 `synthesis_result` 字段:
160
+
161
+ ```json
162
+ {
163
+ "text": "原始数学内容",
164
+ "synthesis_result": {
165
+ "raw": "完整响应",
166
+ "problem": "生成的问题",
167
+ "solution": "详细解答"
168
+ }
169
+ }
170
+ ```
171
+
172
+ ## 🔌 兼容其他 API
173
+
174
+ 支持任何 OpenAI 兼容的 API(如 Qwen、DeepSeek、vLLM 等):
175
+
176
+ ```bash
177
+ # 使用阿里云 Qwen API
178
+ export OPENAI_API_KEY="your-dashscope-api-key"
179
+ export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
180
+
181
+ python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --model qwen-plus
182
+ ```
183
+
184
+ ## 📜 许可证
185
+
186
+ 本项目基于 [Apache 2.0](../LICENSE) 许可���发布。
app.py ADDED
@@ -0,0 +1,340 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ UltraData-Math L3 Generator - Hugging Face Space Demo
4
+ """
5
+
6
+ import os
7
+ import asyncio
8
+ import json
9
+ import gradio as gr
10
+
11
+ from openai import AsyncOpenAI
12
+
13
+ from qa_synthesis import QA_PROMPTS, get_qa_prompt
14
+ from conversation_synthesis import CONVERSATION_PROMPTS, get_conversation_prompt
15
+ from multistyle_rewrite import MULTISTYLE_PROMPTS, get_multistyle_prompt
16
+ from knowledge_textbook import (
17
+ get_knowledge_extraction_prompt,
18
+ get_textbook_exercise_prompt,
19
+ TEXTBOOK_EXERCISE_PROMPTS,
20
+ )
21
+ from run_synthesis import (
22
+ parse_qa_output,
23
+ parse_conversation_output,
24
+ parse_rewrite_output,
25
+ parse_knowledge_output,
26
+ parse_textbook_output,
27
+ )
28
+
29
+ # API 配置从环境变量读取(通过 HF Secrets 设置)
30
+ API_KEY = os.getenv("OPENAI_API_KEY")
31
+ BASE_URL = os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")
32
+ DEFAULT_MODEL = "gpt-4o"
33
+
34
+
35
+ async def call_api(prompt: str, model: str = DEFAULT_MODEL, temperature: float = 0.7) -> str:
36
+ """调用 API 生成内容"""
37
+ if not API_KEY:
38
+ return "Error: API Key not configured. Please contact administrator."
39
+
40
+ client = AsyncOpenAI(api_key=API_KEY, base_url=BASE_URL)
41
+ try:
42
+ response = await client.chat.completions.create(
43
+ model=model,
44
+ messages=[{"role": "user", "content": prompt}],
45
+ temperature=temperature,
46
+ max_tokens=4096,
47
+ )
48
+ return response.choices[0].message.content
49
+ except Exception as e:
50
+ return f"Error: {str(e)}"
51
+
52
+
53
+ def run_async(coro):
54
+ """运行异步函数"""
55
+ try:
56
+ loop = asyncio.get_event_loop()
57
+ except RuntimeError:
58
+ loop = asyncio.new_event_loop()
59
+ asyncio.set_event_loop(loop)
60
+ return loop.run_until_complete(coro)
61
+
62
+
63
+ # ============================================================================
64
+ # Task Handlers
65
+ # ============================================================================
66
+
67
+ def qa_synthesis(text: str, level: str, model: str, temperature: float):
68
+ """Q&A 问答对合成"""
69
+ if not text.strip():
70
+ return "", "", ""
71
+
72
+ prompt_template = get_qa_prompt(level)
73
+ prompt = prompt_template.format(text=text)
74
+
75
+ response = run_async(call_api(prompt, model, temperature))
76
+ parsed = parse_qa_output(response)
77
+
78
+ return (
79
+ parsed.get("problem", ""),
80
+ parsed.get("solution", ""),
81
+ response
82
+ )
83
+
84
+
85
+ def conversation_synthesis(text: str, style: str, model: str, temperature: float):
86
+ """多轮对话合成"""
87
+ if not text.strip():
88
+ return "", ""
89
+
90
+ prompt_template = get_conversation_prompt(style)
91
+ prompt = prompt_template.format(text=text)
92
+
93
+ response = run_async(call_api(prompt, model, temperature))
94
+ parsed = parse_conversation_output(response)
95
+
96
+ return parsed.get("content", response), response
97
+
98
+
99
+ def rewrite_synthesis(text: str, style: str, model: str, temperature: float):
100
+ """多风格改写"""
101
+ if not text.strip():
102
+ return "", ""
103
+
104
+ prompt_template = get_multistyle_prompt(style)
105
+ prompt = prompt_template.format(text=text)
106
+
107
+ response = run_async(call_api(prompt, model, temperature))
108
+ parsed = parse_rewrite_output(response)
109
+
110
+ return parsed.get("rewritten", response), response
111
+
112
+
113
+ def knowledge_extraction(text: str, model: str, temperature: float):
114
+ """知识点提取"""
115
+ if not text.strip():
116
+ return "", ""
117
+
118
+ prompt_template = get_knowledge_extraction_prompt()
119
+ prompt = prompt_template.format(text=text)
120
+
121
+ response = run_async(call_api(prompt, model, temperature))
122
+ parsed = parse_knowledge_output(response)
123
+
124
+ knowledge_points = parsed.get("knowledge_points", [])
125
+ formatted = "\n\n---\n\n".join(knowledge_points) if knowledge_points else "No knowledge points extracted."
126
+
127
+ return formatted, response
128
+
129
+
130
+ def textbook_exercise(knowledge_point: str, difficulty: str, model: str, temperature: float):
131
+ """教材练习生成"""
132
+ if not knowledge_point.strip():
133
+ return "", ""
134
+
135
+ prompt_template = get_textbook_exercise_prompt(difficulty)
136
+ prompt = prompt_template.format(mathematical_knowledge_point=knowledge_point)
137
+
138
+ response = run_async(call_api(prompt, model, temperature))
139
+ parsed = parse_textbook_output(response)
140
+
141
+ return parsed.get("material", response), response
142
+
143
+
144
+ # ============================================================================
145
+ # Gradio UI
146
+ # ============================================================================
147
+
148
+ custom_css = """
149
+ .gradio-container {
150
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif !important;
151
+ background: linear-gradient(135deg, #1a1a2e 0%, #16213e 50%, #0f3460 100%) !important;
152
+ }
153
+
154
+ .main-title {
155
+ font-weight: 700 !important;
156
+ font-size: 2.2rem !important;
157
+ background: linear-gradient(90deg, #e94560, #f39c12, #00d9ff) !important;
158
+ -webkit-background-clip: text !important;
159
+ -webkit-text-fill-color: transparent !important;
160
+ background-clip: text !important;
161
+ text-align: center !important;
162
+ }
163
+
164
+ .subtitle {
165
+ text-align: center !important;
166
+ color: #94a3b8 !important;
167
+ font-size: 1rem !important;
168
+ margin-bottom: 1.5rem !important;
169
+ }
170
+
171
+ .gr-button-primary {
172
+ background: linear-gradient(135deg, #e94560 0%, #f39c12 100%) !important;
173
+ border: none !important;
174
+ font-weight: 600 !important;
175
+ }
176
+
177
+ .gr-button-primary:hover {
178
+ transform: translateY(-2px) !important;
179
+ box-shadow: 0 8px 25px rgba(233, 69, 96, 0.4) !important;
180
+ }
181
+
182
+ footer {
183
+ display: none !important;
184
+ }
185
+ """
186
+
187
+ with gr.Blocks(title="UltraData-Math L3 Generator", css=custom_css) as demo:
188
+ gr.HTML('<h1 class="main-title">🧮 UltraData-Math L3 Generator</h1>')
189
+ gr.HTML('<p class="subtitle">LLM-based Mathematical Data Synthesis Tool</p>')
190
+
191
+ with gr.Row():
192
+ model_select = gr.Dropdown(
193
+ choices=["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo"],
194
+ value="gpt-4o",
195
+ label="Model",
196
+ scale=1,
197
+ )
198
+ temperature = gr.Slider(
199
+ minimum=0.0, maximum=1.5, value=0.7, step=0.1,
200
+ label="Temperature",
201
+ scale=1,
202
+ )
203
+
204
+ with gr.Tabs():
205
+ # Q&A Synthesis Tab
206
+ with gr.TabItem("📝 Q&A Synthesis"):
207
+ gr.Markdown("根据数学内容生成问答对,按教育难度分级。")
208
+ with gr.Row():
209
+ with gr.Column():
210
+ qa_input = gr.Textbox(
211
+ label="Input Mathematical Content",
212
+ placeholder="Enter mathematical content here...",
213
+ lines=8,
214
+ )
215
+ qa_level = gr.Radio(
216
+ choices=list(QA_PROMPTS.keys()),
217
+ value="high_school",
218
+ label="Difficulty Level",
219
+ )
220
+ qa_btn = gr.Button("🚀 Generate Q&A", variant="primary")
221
+ with gr.Column():
222
+ qa_problem = gr.Textbox(label="Generated Problem", lines=4)
223
+ qa_solution = gr.Textbox(label="Generated Solution", lines=8)
224
+ qa_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
225
+
226
+ qa_btn.click(
227
+ qa_synthesis,
228
+ inputs=[qa_input, qa_level, model_select, temperature],
229
+ outputs=[qa_problem, qa_solution, qa_raw],
230
+ )
231
+
232
+ # Conversation Synthesis Tab
233
+ with gr.TabItem("💬 Conversation Synthesis"):
234
+ gr.Markdown("将数学内容转换为多轮对话格式。")
235
+ with gr.Row():
236
+ with gr.Column():
237
+ conv_input = gr.Textbox(
238
+ label="Input Mathematical Content",
239
+ placeholder="Enter mathematical content here...",
240
+ lines=8,
241
+ )
242
+ conv_style = gr.Radio(
243
+ choices=list(CONVERSATION_PROMPTS.keys()),
244
+ value="teacher_student",
245
+ label="Conversation Style",
246
+ )
247
+ conv_btn = gr.Button("🚀 Generate Conversation", variant="primary")
248
+ with gr.Column():
249
+ conv_output = gr.Textbox(label="Generated Conversation", lines=15)
250
+ conv_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
251
+
252
+ conv_btn.click(
253
+ conversation_synthesis,
254
+ inputs=[conv_input, conv_style, model_select, temperature],
255
+ outputs=[conv_output, conv_raw],
256
+ )
257
+
258
+ # Rewrite Tab
259
+ with gr.TabItem("✨ Multi-style Rewrite"):
260
+ gr.Markdown("将数学内容改写为不同风格。")
261
+ with gr.Row():
262
+ with gr.Column():
263
+ rewrite_input = gr.Textbox(
264
+ label="Input Mathematical Content",
265
+ placeholder="Enter mathematical content here...",
266
+ lines=8,
267
+ )
268
+ rewrite_style = gr.Radio(
269
+ choices=list(MULTISTYLE_PROMPTS.keys()),
270
+ value="textbook",
271
+ label="Rewrite Style",
272
+ )
273
+ rewrite_btn = gr.Button("🚀 Rewrite", variant="primary")
274
+ with gr.Column():
275
+ rewrite_output = gr.Textbox(label="Rewritten Content", lines=15)
276
+ rewrite_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
277
+
278
+ rewrite_btn.click(
279
+ rewrite_synthesis,
280
+ inputs=[rewrite_input, rewrite_style, model_select, temperature],
281
+ outputs=[rewrite_output, rewrite_raw],
282
+ )
283
+
284
+ # Knowledge Extraction Tab
285
+ with gr.TabItem("📚 Knowledge Extraction"):
286
+ gr.Markdown("从数学内容中提取定义、定理、性质等知识点。")
287
+ with gr.Row():
288
+ with gr.Column():
289
+ know_input = gr.Textbox(
290
+ label="Input Mathematical Content",
291
+ placeholder="Enter mathematical content here...",
292
+ lines=10,
293
+ )
294
+ know_btn = gr.Button("🚀 Extract Knowledge", variant="primary")
295
+ with gr.Column():
296
+ know_output = gr.Textbox(label="Extracted Knowledge Points", lines=15)
297
+ know_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
298
+
299
+ know_btn.click(
300
+ knowledge_extraction,
301
+ inputs=[know_input, model_select, temperature],
302
+ outputs=[know_output, know_raw],
303
+ )
304
+
305
+ # Textbook Exercise Tab
306
+ with gr.TabItem("📖 Textbook Exercise"):
307
+ gr.Markdown("基于知识点生成不同难度的教材式练习。")
308
+ with gr.Row():
309
+ with gr.Column():
310
+ textbook_input = gr.Textbox(
311
+ label="Input Knowledge Point",
312
+ placeholder="Enter a mathematical knowledge point...",
313
+ lines=6,
314
+ )
315
+ textbook_diff = gr.Radio(
316
+ choices=list(TEXTBOOK_EXERCISE_PROMPTS.keys()),
317
+ value="easy",
318
+ label="Difficulty",
319
+ )
320
+ textbook_btn = gr.Button("🚀 Generate Exercise", variant="primary")
321
+ with gr.Column():
322
+ textbook_output = gr.Textbox(label="Generated Exercise Material", lines=15)
323
+ textbook_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
324
+
325
+ textbook_btn.click(
326
+ textbook_exercise,
327
+ inputs=[textbook_input, textbook_diff, model_select, temperature],
328
+ outputs=[textbook_output, textbook_raw],
329
+ )
330
+
331
+ gr.HTML("""
332
+ <div style="text-align: center; margin-top: 2rem; padding: 1rem; color: #64748b; font-size: 0.85rem;">
333
+ <p>🔬 <strong>UltraData-Math L3 Generator</strong> - Part of the UltraData-Math Project</p>
334
+ <p>LLM-based data synthesis for Q&A, conversations, rewriting, and more.</p>
335
+ </div>
336
+ """)
337
+
338
+
339
+ if __name__ == "__main__":
340
+ demo.launch(ssr_mode=False)
conversation_synthesis.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ UltraData-Math L3 - Conversation Synthesis Prompts
4
+
5
+ Reference: MIND
6
+ Conversation types: Two Professors, Teacher-Student, Two Students, Interview, Problem Solving, Layman-Expert, Debate
7
+ """
8
+
9
+ # ============================================================================
10
+ # Two Professors Discussion
11
+ # ============================================================================
12
+
13
+ MATH_INSTRUCT_TWO_PROFESSORS_PROMPT = '''Math Content:{text}
14
+
15
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
16
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn discussions between two professors, according to the following requirements.
17
+ - Make sure that their discussions strictly adhere to the provided math content and remains faithful to information in the provided math content.
18
+ - Please DONOT add any new information/reference other than the provided math content.
19
+ - All mathematical expressions in the discussions must be formatted using LaTeX.
20
+ Finally, please put the discussions within <discussions></discussions>.
21
+ The result format is as follows:
22
+ <discussions></discussions>
23
+
24
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
25
+
26
+
27
+ # ============================================================================
28
+ # Teacher-Student Discussion
29
+ # ============================================================================
30
+
31
+ MATH_INSTRUCT_TEACHER_STUDENT_PROMPT = '''Math Content:{text}
32
+
33
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
34
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn discussions between a teacher and a student, according to the following requirements.
35
+ - The student has questions about the provided math content and the teacher solves each of them step-by-step.
36
+ - Make sure that their discussions strictly adhere to the provided math content and remains faithful to information in the provided math content.
37
+ - Please DONOT add any new information/reference other than the provided math content.
38
+ - All mathematical expressions in the discussions must be formatted using LaTeX.
39
+ Finally, please put the discussions within <discussions></discussions>.
40
+ The result format is as follows:
41
+ <discussions></discussions>
42
+
43
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
44
+
45
+
46
+ # ============================================================================
47
+ # Two Students Discussion
48
+ # ============================================================================
49
+
50
+ MATH_INSTRUCT_TWO_STUDENTS_PROMPT = '''Math Content:{text}
51
+
52
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
53
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn discussions between two students who are working on their assignment related to the provided math content, according to the following requirements.
54
+ - Make sure that their discussions strictly adhere to the provided math content and remains faithful to information in the provided math content.
55
+ - Please DONOT add any new information/reference other than the provided math content.
56
+ - All mathematical expressions in the discussions must be formatted using LaTeX.
57
+ Finally, please put the discussions within <discussions></discussions>.
58
+ The result format is as follows:
59
+ <discussions></discussions>
60
+
61
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
62
+
63
+
64
+ # ============================================================================
65
+ # Interview Style
66
+ # ============================================================================
67
+
68
+ MATH_INSTRUCT_INTERVIEW_PROMPT = '''Math Content:{text}
69
+
70
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
71
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn interview-style conversation between a interviewer and a interviewee, according to the following requirements.
72
+ - One participant acts as the interviewer who asks questions exclusively related to the provided math content, while the other participant serves as the subject matter expert, providing detailed responses based on the provided math content.
73
+ - Make sure that their conversation strictly adhere to the provided math content and remains faithful to information in the provided math content.
74
+ - Please DONOT add any new information/reference other than the provided math content.
75
+ - All mathematical expressions in the conversation must be formatted using LaTeX.
76
+ Finally, please put the conversation within <conversation></conversation>.
77
+ The result format is as follows:
78
+ <conversation></conversation>
79
+
80
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
81
+
82
+
83
+ # ============================================================================
84
+ # Problem Solving
85
+ # ============================================================================
86
+
87
+ MATH_INSTRUCT_PROBLEM_SOLVING_PROMPT = '''Math Content:{text}
88
+
89
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
90
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn problem-solving conversation, according to the following requirements.
91
+ - Participants analyze challenges or scenarios presented in the provided math content and brainstorm solutions within the provided math content, avoiding speculation or unrelated discussions.
92
+ - Make sure that their conversation strictly adhere to the provided math content and remains faithful to information in the provided math content.
93
+ - Please DONOT add any new information/reference other than the provided math content.
94
+ - All mathematical expressions in the conversation must be formatted using LaTeX.
95
+ Finally, please put the conversation within <conversation></conversation>.
96
+ The result format is as follows:
97
+ <conversation></conversation>
98
+
99
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
100
+
101
+
102
+ # ============================================================================
103
+ # Layman-Expert
104
+ # ============================================================================
105
+
106
+ MATH_INSTRUCT_LAYMAN_EXPERT_PROMPT = '''Math Content:{text}
107
+
108
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
109
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn interaction between a layman and a expert, according to the following requirements.
110
+ - While the expert are presenting the provided math content step-by-step to a layman, the layman has a lot of followup questions regarding your presentation. The expert answer the questions step-by-step with chain-of-thoughts.
111
+ - Make sure that their interaction strictly adhere to the provided math content and remains faithful to information in the provided math content.
112
+ - Please DONOT add any new information/reference other than the provided math content.
113
+ - All mathematical expressions in the interaction must be formatted using LaTeX.
114
+ Finally, please put the interaction within <interaction></interaction>.
115
+ The result format is as follows:
116
+ <interaction></interaction>
117
+
118
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
119
+
120
+
121
+ # ============================================================================
122
+ # Debate Style
123
+ # ============================================================================
124
+
125
+ MATH_INSTRUCT_DEBATE_PROMPT = '''Math Content:{text}
126
+
127
+ As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
128
+ Your goal is to utilize your abilities, convert the provided math content as a multi-turn debate-style conversation, according to the following requirements.
129
+ - The participants present arguments and counterarguments based solely on the provided math content, without introducing external information or personal opinions. Each participant defends others arguments step-by-step with chain-of-thoughts.
130
+ - Make sure that their conversation strictly adhere to the provided math content and remains faithful to information in the provided math content.
131
+ - Please DONOT add any new information/reference other than the provided math content.
132
+ - All mathematical expressions in the conversation must be formatted using LaTeX.
133
+
134
+ The result format is as follows:
135
+ <conversation></conversation>
136
+
137
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
138
+
139
+
140
+ # ============================================================================
141
+ # Prompt Registry
142
+ # ============================================================================
143
+
144
+ CONVERSATION_PROMPTS = {
145
+ "two_professors": MATH_INSTRUCT_TWO_PROFESSORS_PROMPT,
146
+ "teacher_student": MATH_INSTRUCT_TEACHER_STUDENT_PROMPT,
147
+ "two_students": MATH_INSTRUCT_TWO_STUDENTS_PROMPT,
148
+ "interview": MATH_INSTRUCT_INTERVIEW_PROMPT,
149
+ "problem_solving": MATH_INSTRUCT_PROBLEM_SOLVING_PROMPT,
150
+ "layman_expert": MATH_INSTRUCT_LAYMAN_EXPERT_PROMPT,
151
+ "debate": MATH_INSTRUCT_DEBATE_PROMPT,
152
+ }
153
+
154
+
155
+ def get_conversation_prompt(style: str) -> str:
156
+ """
157
+ Get conversation synthesis prompt for specified style
158
+
159
+ Args:
160
+ style: Conversation style, see CONVERSATION_PROMPTS.keys() for options
161
+
162
+ Returns:
163
+ Corresponding prompt template string
164
+ """
165
+ if style not in CONVERSATION_PROMPTS:
166
+ raise ValueError(f"Unknown style: {style}. Available styles: {list(CONVERSATION_PROMPTS.keys())}")
167
+ return CONVERSATION_PROMPTS[style]
example_data.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {"text": "The quadratic formula states that for any quadratic equation of the form ax² + bx + c = 0, where a ≠ 0, the solutions are given by x = (-b ± √(b² - 4ac)) / (2a). The expression b² - 4ac is called the discriminant. When the discriminant is positive, the equation has two distinct real roots; when it equals zero, there is exactly one real root (a repeated root); when it is negative, the equation has two complex conjugate roots."}
2
+ {"text": "The Pythagorean theorem is a fundamental relation in Euclidean geometry among the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite the right angle) is equal to the sum of the areas of the squares on the other two sides. This can be written as a² + b² = c², where c represents the length of the hypotenuse and a and b represent the lengths of the triangle's other two sides."}
3
+ {"text": "In calculus, the derivative of a function measures the sensitivity to change of the function value with respect to a change in its argument. The derivative of f(x) with respect to x is written as f'(x) or df/dx. For example, the derivative of f(x) = x² is f'(x) = 2x, which means the rate of change of x² at any point x is 2x."}
knowledge_textbook.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ UltraData-Math L3 - Knowledge Extraction & Textbook Exercise Prompts
4
+
5
+ Features:
6
+ 1. Knowledge Extraction: Extract definitions, axioms, theorems, properties from math content
7
+ 2. Textbook Exercise Generation: Generate exercises at different difficulty levels (Easy/Medium/Hard)
8
+ """
9
+
10
+ # ============================================================================
11
+ # Knowledge Point Extraction
12
+ # ============================================================================
13
+
14
+ MATH_INSTRUCT_KNOWLEDGE_EXTRACTION_PROMPT = '''Math Content:{text}
15
+
16
+ As a math teacher, you are highly proficient in mathematical knowledge.
17
+ Your goal is to utilize your abilities, extract mathematical knowledge points based on the provided math content.
18
+ You should follow these steps:
19
+ 1. First, If the provided math content does not include specific mathematical definitions, axioms, assumptions, hypotheses, conjectures, propositions, lemmas, theorems, corollaries, properties, proofs, return 'no result' directly.
20
+ 2. Then, carefully read the provided math content to provide mathematical knowledge point according to the following requirements.
21
+ - The mathematical knowledge point must be specific mathematical definitions, axioms, assumptions, hypotheses, conjectures, propositions, lemmas, theorems, corollaries, properties, proofs. Otherwise, it must not be output.
22
+ - The mathematical knowledge point must be findable within the provided math content. Otherwise, it must not be output.
23
+ - The beginning of the mathematical knowledge point must state specific mathematical definitions, axioms, assumptions, hypotheses, conjectures, propositions, lemmas, theorems, corollaries, properties, and proofs.
24
+ - The mathematical knowledge point must not be repeated.
25
+ - The mathematical knowledge point must be clear, concise, accurate, and easy to learn.
26
+ - The mathematical knowledge point may appropriately include relevant explanations to make the knowledge point more complete.
27
+ - All mathematical expressions in the mathematical knowledge point must be formatted using LaTeX.
28
+
29
+ The result format is as follows:
30
+ <mathematical knowledge point1></mathematical knowledge point1>
31
+ <mathematical knowledge point2></mathematical knowledge point2>
32
+ and more
33
+
34
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
35
+
36
+
37
+ # ============================================================================
38
+ # Textbook Exercise - Easy
39
+ # ============================================================================
40
+
41
+ MATH_INSTRUCT_TEXTBOOK_EASY_PROMPT = '''Mathematical Knowledge Point:{mathematical_knowledge_point}
42
+
43
+ As a math teacher, you are highly proficient in mathematical knowledge.
44
+ Your goal is to utilize your abilities, generate informative, textbook-style learning mathematical material suitable for students.
45
+ You should follow these steps:
46
+ 1. First, provide a detailed explanation based on the given mathematical knowledge point.
47
+ 2. Second, generate an exercise based on the provided explanation according to the following requirements.
48
+ - The exercise must be self-contained.
49
+ - Ensure the exercise is fully text-based and solvable without images.
50
+ 3. Third, provide a solution based on the generated exercise according to the following requirements.
51
+ - The solution must be detailed and step-by-step.
52
+ 4. Finally, construct the generated explanation, exercise, and solution into textbook-style learning material according to the following requirements.
53
+ - The material must be logically structured, information-dense, concise and easy to learn.
54
+ - The material must be accurate to avoid misleading students.
55
+ - The material must maintain a formal and educational tone and avoid casual expressions.
56
+ - The explanation must be at the beginning of the material.
57
+ - The exercise in the material must be starts with 'The exercise:'.
58
+ - The solution in the material must be starts with 'The solution:'.
59
+ - All mathematical expressions in the material must be formatted using LaTeX.
60
+
61
+ The result format is as follows.
62
+ <material></material>
63
+
64
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
65
+
66
+
67
+ # ============================================================================
68
+ # Textbook Exercise - Medium
69
+ # ============================================================================
70
+
71
+ MATH_INSTRUCT_TEXTBOOK_MEDIUM_PROMPT = '''Mathematical Knowledge Point:{mathematical_knowledge_point}
72
+
73
+ As a math teacher, you are highly proficient in mathematical knowledge.
74
+ Your goal is to utilize your abilities, generate informative, textbook-style learning mathematical material suitable for students.
75
+ You should follow these steps:
76
+ 1. First, provide a detailed explanation based on the given mathematical knowledge point.
77
+ 2. Second, generate an medium-difficulty exercise based on the provided explanation according to the following requirements.
78
+ - The goal of the exercise is to help students master the given mathematical knowledge point.
79
+ - Other mathematical knowledge points can be incorporated into the exercises to increase the difficulty to medium level.
80
+ - The exercise must be self-contained.
81
+ - Ensure the exercise is fully text-based and solvable without images.
82
+ 3. Third, provide a solution based on the generated exercise according to the following requirements.
83
+ - The solution must be detailed and step-by-step.
84
+ 4. Finally, construct the generated explanation, exercise, and solution into textbook-style learning material according to the following requirements.
85
+ - The material must be logically structured, information-dense, concise and easy to learn.
86
+ - The material must be accurate to avoid misleading students.
87
+ - The material must maintain a formal and educational tone and avoid casual expressions.
88
+ - The explanation must be at the beginning of the material.
89
+ - The exercise in the material must be starts with 'The exercise:'.
90
+ - The solution in the material must be starts with 'The solution:'.
91
+ - All mathematical expressions in the material must be formatted using LaTeX.
92
+
93
+ The result format is as follows.
94
+ <material></material>
95
+
96
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
97
+
98
+
99
+ # ============================================================================
100
+ # Textbook Exercise - Hard
101
+ # ============================================================================
102
+
103
+ MATH_INSTRUCT_TEXTBOOK_HARD_PROMPT = '''Mathematical Knowledge Point:{mathematical_knowledge_point}
104
+
105
+ As a math teacher, you are highly proficient in mathematical knowledge.
106
+ Your goal is to utilize your abilities, generate informative, textbook-style learning mathematical material suitable for students.
107
+ You should follow these steps:
108
+ 1. First, provide a detailed explanation based on the given mathematical knowledge point.
109
+ 2. Second, generate an hard-difficulty exercise based on the provided explanation according to the following requirements.
110
+ - The goal of the exercise is to help students deeply understand and comprehensively apply the given mathematical knowledge point.
111
+ - Other mathematical knowledge points can be incorporated into the exercises to increase the difficulty to hard level.
112
+ - The exercise must be self-contained.
113
+ - Ensure the exercise is fully text-based and solvable without images.
114
+ 3. Third, provide a solution based on the generated exercise according to the following requirements.
115
+ - The solution must be detailed and step-by-step.
116
+ 4. Finally, construct the generated explanation, exercise, and solution into textbook-style learning material according to the following requirements.
117
+ - The material must be logically structured, information-dense, concise and easy to learn.
118
+ - The material must be accurate to avoid misleading students.
119
+ - The material must maintain a formal and educational tone and avoid casual expressions.
120
+ - The explanation must be at the beginning of the material.
121
+ - The exercise in the material must be starts with 'The exercise:'.
122
+ - The solution in the material must be starts with 'The solution:'.
123
+ - All mathematical expressions in the material must be formatted using LaTeX.
124
+
125
+ The result format is as follows.
126
+ <material></material>
127
+
128
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
129
+
130
+
131
+ # ============================================================================
132
+ # Prompt Registry
133
+ # ============================================================================
134
+
135
+ KNOWLEDGE_PROMPTS = {
136
+ "knowledge_extraction": MATH_INSTRUCT_KNOWLEDGE_EXTRACTION_PROMPT,
137
+ }
138
+
139
+ TEXTBOOK_EXERCISE_PROMPTS = {
140
+ "easy": MATH_INSTRUCT_TEXTBOOK_EASY_PROMPT,
141
+ "medium": MATH_INSTRUCT_TEXTBOOK_MEDIUM_PROMPT,
142
+ "hard": MATH_INSTRUCT_TEXTBOOK_HARD_PROMPT,
143
+ }
144
+
145
+
146
+ def get_knowledge_extraction_prompt() -> str:
147
+ """
148
+ Get knowledge extraction prompt
149
+
150
+ Returns:
151
+ Knowledge extraction prompt template string
152
+ """
153
+ return MATH_INSTRUCT_KNOWLEDGE_EXTRACTION_PROMPT
154
+
155
+
156
+ def get_textbook_exercise_prompt(difficulty: str) -> str:
157
+ """
158
+ Get textbook exercise prompt for specified difficulty
159
+
160
+ Args:
161
+ difficulty: Difficulty level, options: "easy", "medium", "hard"
162
+
163
+ Returns:
164
+ Corresponding prompt template string
165
+ """
166
+ if difficulty not in TEXTBOOK_EXERCISE_PROMPTS:
167
+ raise ValueError(f"Unknown difficulty: {difficulty}. Available: {list(TEXTBOOK_EXERCISE_PROMPTS.keys())}")
168
+ return TEXTBOOK_EXERCISE_PROMPTS[difficulty]
multistyle_rewrite.py ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ UltraData-Math L3 - Multi-Style Rewrite Prompts
4
+
5
+ Style types: Wikipedia, Textbook, Blog, Popular Science, Academic Paper, Learning Note, Lecture Note
6
+ """
7
+
8
+ # ============================================================================
9
+ # Wikipedia Style
10
+ # ============================================================================
11
+
12
+ MATH_INSTRUCT_WIKI_PROMPT = '''Math Content:{text}
13
+
14
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
15
+ Your goal is to utilize your abilities, rewrite the provided math content in the wiki style.
16
+ Before beginning the rewrite, you will consider the following requirements:
17
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
18
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
19
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
20
+ 2. Then, focus on the captured and preserved information, combine it with the wiki style, and rewrite the text to form an initial draft, according to the following requirements.
21
+ - The overall structure of the initial draft should follow the structure used by Wikipedia, employing a modular, encyclopedic organizational format.
22
+ - The sentence expression of the initial draft should follow the sentence expression used by Wikipedia, employing highly concise and objective declarative sentences. It adheres to the "definition-first" principle, rigorously uses standard terminology, maintains a formal sentence structures, and avoids colloquial or personalized expressions.
23
+ - The overall tone of the initial draft should follow the tone used by Wikipedia, maintaining an absolutely neutral, authoritative, and impersonal encyclopedic tone.
24
+ 3. Third, refine the initial draft according to the following requirements.
25
+ - The content of the refined content must be logically structured, high-quality, information-dense.
26
+ - The overall layout of the refined content must not use LaTeX formatting.
27
+ - The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
28
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
29
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
30
+
31
+ The result format is as follows:
32
+ <rewritten content></rewritten content>'''
33
+
34
+
35
+ # ============================================================================
36
+ # Textbook Style
37
+ # ============================================================================
38
+
39
+ MATH_INSTRUCT_TEXTBOOK_PROMPT = '''Math Content:{text}
40
+
41
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
42
+ Your goal is to utilize your abilities, rewrite the provided math content in the textbook style.
43
+ Before beginning the rewrite, you will consider the following requirements:
44
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
45
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
46
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
47
+ 2. Then, focus on the captured and preserved information, combine it with the textbook style, and rewrite the text to form an initial draft, according to the following requirements.
48
+ - The overall structure of the initial draft should follow the structure used by Textbook, employing a rigorous logical progression system, unfolding through a modular structure of "definition-theorem/proof/formula/property-example".
49
+ - The sentence expression of the initial draft should follow the sentence expression used by Textbook, combining standardized and precise disciplinary terminology with guided instructional language while avoiding colloquialism or ambiguity to ensure the accuracy and teachability of knowledge points. It must be accurate and complete.
50
+ - The overall tone of the initial draft should follow the tone used by Textbook, maintaining an authoritative, neutral, objective, and inquiry-based instructional tone. It should foster a positive learning environment while preserving professionalism.
51
+ 3. Third, refine the initial draft according to the following requirements.
52
+ - The content of the refined content must be logically structured, high-quality, information-dense.
53
+ - The overall layout of the refined content must not use LaTeX formatting.
54
+ - All examples in the refined content must include detailed and step-by-step solutions.
55
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
56
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
57
+
58
+ The result format is as follows:
59
+ <rewritten content></rewritten content>'''
60
+
61
+
62
+ # ============================================================================
63
+ # Blog Style
64
+ # ============================================================================
65
+
66
+ MATH_INSTRUCT_BLOG_PROMPT = '''Math Content:{text}
67
+
68
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
69
+ Your goal is to utilize your abilities, rewrite the provided math content in the blog style.
70
+ Before beginning the rewrite, you will consider the following requirements:
71
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
72
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
73
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
74
+ 2. Then, focus on the captured and preserved information, combine it with the blog style, and rewrite the text to form an initial draft, according to the following requirements.
75
+ - The overall structure of the initial draft should follow the structure used by Blog, employing a modular yet flexible content arrangement. It typically begins with captivating titles or thought-provoking questions, utilizes short paragraphs and subheadings to enhance readability, and establishes a relaxed and free-flowing reading rhythm.
76
+ - The sentence expression of the initial draft should follow the sentence expression used by Blog, employing simple and conversational sentence patterns. It should prioritize short sentences, questions, and exclamations to create rhythm and interactivity, while avoiding lengthy and complex professional jargon. Analogies, metaphors, and real-life examples should be skillfully utilized to explain complex mathematical concepts, thereby lowering the reader's barrier to comprehension. It must be accurate and complete.
77
+ - The overall tone of the initial draft should follow the tone used by Blog, maintaining a relatable and natural conversational style with infectious enthusiasm, aiming to spark readers' interest and encourage interaction and sharing.
78
+ 3. Third, refine the initial draft according to the following requirements.
79
+ - The content of the refined content must be logically structured, high-quality, information-dense.
80
+ - The overall layout of the refined content must not use LaTeX formatting.
81
+ - The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
82
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
83
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
84
+
85
+ The result format is as follows:
86
+ <rewritten content></rewritten content>'''
87
+
88
+
89
+ # ============================================================================
90
+ # Popular Science Style
91
+ # ============================================================================
92
+
93
+ MATH_INSTRUCT_POPULAR_SCIENCE_PROMPT = '''Math Content:{text}
94
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
95
+
96
+ Your goal is to utilize your abilities, rewrite the provided math content in the popular science style.
97
+ Before beginning the rewrite, you will consider the following requirements:
98
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
99
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
100
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
101
+ 2. Then, focus on the captured and preserved information, combine it with the popular science style, and rewrite the text to form an initial draft, according to the following requirements.
102
+ - The overall structure of the initial draft should follow the structure used by Popular Science, guided by an engaging narrative thread or real-world problem. It should progressively unfold step by step, gradually guiding readers to understand core concepts and construct cognitive pathways of knowledge.
103
+ - The sentence expression of the initial draft should follow the sentence expression used by Popular Science, actively avoiding specialized terminology and complex symbols. It should employ vivid, sensory descriptions and make extensive use of metaphors, analogies, and imaginative imagery to explain abstract concepts, prioritizing experiential resonance over the accumulation of technical jargon. It must be accurate and complete.
104
+ - The overall tone of the initial draft should follow the tone used by Popular Science, maintaining a narrative style filled with wonder and enthusiastic exploration. It should foster a relatable and natural conversational atmosphere, aiming to spark the imagination and interest of general readers.
105
+ 3. Third, refine the initial draft according to the following requirements.
106
+ - The content of the refined content must be logically structured, high-quality, information-dense.
107
+ - The overall layout of the refined content must not use LaTeX formatting.
108
+ - The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
109
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
110
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
111
+
112
+ The result format is as follows:
113
+ <rewritten content></rewritten content>'''
114
+
115
+
116
+ # ============================================================================
117
+ # Academic Paper Style
118
+ # ============================================================================
119
+
120
+ MATH_INSTRUCT_ACADEMIC_PAPER_PROMPT = '''Math Content:{text}
121
+
122
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
123
+ Your goal is to utilize your abilities, rewrite the provided math content in the academic paper style.
124
+ Before beginning the rewrite, you will consider the following requirements:
125
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
126
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
127
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
128
+ 2. Then, focus on the captured and preserved information, combine it with the academic paper style, and rewrite the text to form an initial draft, according to the following requirements.
129
+ - The overall structure of the initial draft should follow the structure used by Academic Paper, following highly standardized and rigorous formats, ensuring clear organization and logical progression.
130
+ - The sentence expression of the initial draft should follow the sentence expression used by Academic Paper, employing highly specialized disciplinary terminology and passive voice constructions, and utilizing complex sentence structures and quantitative expressions to ensure academic rigor, striving for absolute precision and clarity in order to avoid any ambiguity. It must be accurate and complete.
131
+ - The overall tone of the initial draft should follow the tone used by Academic Paper, maintaining an absolutely objective and neutral researcher's stance while eliminating any subjective elements. The focus shall be on presenting facts, evidence, and logical reasoning, aiming to engage in rigorous dialogue with academic peers.
132
+ 3. Third, refine the initial draft according to the following requirements.
133
+ - The content of the refined content must be logically structured, high-quality, information-dense.
134
+ - The overall layout of the refined content must not use LaTeX formatting.
135
+ - The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
136
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
137
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
138
+
139
+ The result format is as follows:
140
+ <rewritten content></rewritten content>'''
141
+
142
+
143
+ # ============================================================================
144
+ # Learning Note Style
145
+ # ============================================================================
146
+
147
+ MATH_INSTRUCT_LEARNING_NOTE_PROMPT = '''Math Content:{text}
148
+
149
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
150
+ Your goal is to utilize your abilities, rewrite the provided math content in the learning note style.
151
+ Before beginning the rewrite, you will consider the following requirements:
152
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
153
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
154
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
155
+ 2. Then, focus on the captured and preserved information, combine it with the learning note style, and rewrite the text to form an initial draft, according to the following requirements.
156
+ - The overall structure of the initial draft should follow the structure used by Learning Note, prioritizing personal comprehension over rigid formatting. It typically employs a modular approach with point-by-point enumeration to facilitate organization and clarity.
157
+ - The sentence expression of the initial draft should follow the sentence expression used by Learning Note, employing highly concise and fragmented language—predominantly keywords, phrases, and incomplete sentences. It should incorporate meta-cognitive elements such as self-posed questions and answers, error annotation, and insight notes to clarify thinking and reinforce memory. It must be accurate and complete.
158
+ - The overall tone of the initial draft should follow the tone used by Learning Note. It is subjective, direct, and exploratory, resembling a dialogue with oneself. It should focus on documenting "my" comprehension difficulties, sudden insights, and key points requiring review, all characterized by strong personal nuance.
159
+ 3. Third, refine the initial draft according to the following requirements.
160
+ - The content of the refined content must be logically structured, high-quality, information-dense.
161
+ - The overall layout of the refined content must not use LaTeX formatting.
162
+ - The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
163
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
164
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
165
+
166
+ The result format is as follows:
167
+ <rewritten content></rewritten content>'''
168
+
169
+
170
+ # ============================================================================
171
+ # Lecture Note Style
172
+ # ============================================================================
173
+
174
+ MATH_INSTRUCT_LECTURE_NOTE_PROMPT = '''Math Content:{text}
175
+
176
+ As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
177
+ Your goal is to utilize your abilities, rewrite the provided math content in the lecture note style.
178
+ Before beginning the rewrite, you will consider the following requirements:
179
+ 1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
180
+ - Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
181
+ - Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
182
+ 2. Then, focus on the captured and preserved information, combine it with the lecture note style, and rewrite the text to form an initial draft, according to the following requirements.
183
+ - The overall structure of the initial draft should follow the structure used by Lecture Note, guided by teaching objectives. It achieves systematic knowledge transfer through hierarchical organization of key points, formula derivation demonstrations, and case analysis modules.
184
+ - The sentence expression of the initial draft should follow the sentence expression used by Lecture Note, employing professional discourse that balances authority and guidance. It should integrate disciplinary terminology with instructional explanations, utilizing rhetorical questions, emphatic statements, and directive language to highlight key and challenging points. It must be accurate and complete.
185
+ - The overall tone of the initial draft should follow the tone used by Lecture Note, maintaining an authoritative narrative stance that combines credibility with guidance. Like an invisible teacher directing the reader's thinking in real time, it emphasizes the mastery of methods and thought processes, often anticipating potential reader confusion to create an immersive learning atmosphere.
186
+ 3. Third, refine the initial draft according to the following requirements.
187
+ - The content of the refined content must be logically structured, high-quality, information-dense.
188
+ - The overall layout of the refined content must not use LaTeX formatting.
189
+ - The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
190
+ - All mathematical expressions in the refined content must be formatted using LaTeX.
191
+ 4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
192
+
193
+ The result format is as follows:
194
+ <rewritten content></rewritten content>'''
195
+
196
+
197
+ # ============================================================================
198
+ # Prompt Registry
199
+ # ============================================================================
200
+
201
+ MULTISTYLE_PROMPTS = {
202
+ "wikipedia": MATH_INSTRUCT_WIKI_PROMPT,
203
+ "textbook": MATH_INSTRUCT_TEXTBOOK_PROMPT,
204
+ "blog": MATH_INSTRUCT_BLOG_PROMPT,
205
+ "popular_science": MATH_INSTRUCT_POPULAR_SCIENCE_PROMPT,
206
+ "academic_paper": MATH_INSTRUCT_ACADEMIC_PAPER_PROMPT,
207
+ "learning_note": MATH_INSTRUCT_LEARNING_NOTE_PROMPT,
208
+ "lecture_note": MATH_INSTRUCT_LECTURE_NOTE_PROMPT,
209
+ }
210
+
211
+
212
+ def get_multistyle_prompt(style: str) -> str:
213
+ """
214
+ Get multi-style rewrite prompt for specified style
215
+
216
+ Args:
217
+ style: Style type, see MULTISTYLE_PROMPTS.keys() for options
218
+
219
+ Returns:
220
+ Corresponding prompt template string
221
+ """
222
+ if style not in MULTISTYLE_PROMPTS:
223
+ raise ValueError(f"Unknown style: {style}. Available styles: {list(MULTISTYLE_PROMPTS.keys())}")
224
+ return MULTISTYLE_PROMPTS[style]
qa_synthesis.py ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ UltraData-Math L3 - Q&A Synthesis Prompts
4
+
5
+ Reference: Jiuzhang-Math, MathGPT
6
+ Difficulty levels: Grade School, Middle School, High School, College
7
+ """
8
+
9
+ # ============================================================================
10
+ # Grade School Q&A Prompt
11
+ # ============================================================================
12
+
13
+ MATH_INSTRUCT_GRADE_SCHOOL_PROMPT = '''Math Content:{text}
14
+
15
+ As a math teacher, you are highly proficient in mathematical knowledge.
16
+ Your goal is to utilize your abilities, create an age-appropriate math word problem for grade school students based on the provided math content.
17
+ You should follow these steps:
18
+ 1. First, craft a concise math word problem suitable for grade school, according to the following requirements.
19
+ - The crafted problem must focus on basic arithmetic operations (addition, subtraction, multiplication, division), number sense, simple shapes, or introductory measurements.
20
+ - The crafted problem must use relatable, real-world scenarios appropriate for the age group.
21
+ - The crafted problem must include all necessary information for solving it.
22
+ - The crafted problem must be purely text-based and solvable without images.
23
+ 2. Then, provide a clear, step-by-step solution to the crafted problem, according to the following requirements.
24
+ - The solution must use simple language that a grade school student could understand.
25
+ - The solution must explain the reasoning behind each step.
26
+ 3. Finally, please put the crafted problem within <problem></problem> and put the solution within <solution></solution>.
27
+ The result format is as follows:
28
+ <result>
29
+ <problem></problem>
30
+ <solution></solution>
31
+ </result>
32
+
33
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
34
+
35
+
36
+ # ============================================================================
37
+ # Middle School Q&A Prompt
38
+ # ============================================================================
39
+
40
+ MATH_INSTRUCT_MIDDLE_SCHOOL_PROMPT = '''Math Content:{text}
41
+
42
+ As a math teacher, you are highly proficient in mathematical knowledge.
43
+ Your goal is to utilize your abilities, create an middle school level math problem and solution based on the provided math content.
44
+ You should follow these steps:
45
+ 1. First, create a self-contained problem for middle school student that directly incorporates a concept from the provided math content, according to the following requirements.
46
+ - The created problem must target a difficulty level appropriate for grades 6-8 (ages 11-14), assuming knowledge of arithmetic, pre-algebra, basic probability/statistics, and geometry.
47
+ - The created problem must include all necessary information for solving it.
48
+ - The created problem must be fully text-based and solvable without images.
49
+ - The created problem must use concepts typically covered by the end of 8th grade.
50
+ 2. Then, provide a detailed, step-by-step solution to the created problem, according to the following requirements.
51
+ - The solution must demonstrate the mathematical reasoning from problem statement to conclusion.
52
+ - The solution must explain each step to reinforce the underlying math principles being applied.
53
+ - All mathematical expressions in the solution must be formatted using LaTeX.
54
+ 3. Finally, please put the created problem within <problem></problem> and put the solution within <solution></solution>.
55
+ The result format is as follows:
56
+ <result>
57
+ <problem></problem>
58
+ <solution></solution>
59
+ </result>
60
+
61
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
62
+
63
+
64
+ # ============================================================================
65
+ # High School Q&A Prompt
66
+ # ============================================================================
67
+
68
+ MATH_INSTRUCT_HIGH_SCHOOL_PROMPT = '''Math Content:{text}
69
+
70
+ As a math teacher, you are highly proficient in mathematical knowledge.
71
+ Your goal is to utilize your abilities, inspired by the provided math content, create high school-level math problem that combines concepts from at least two math subjects.
72
+ You should follow these steps:
73
+ 1. First, draft a self-contained math problem for high school students based on the provided math content, according to the following requirements.
74
+ - The drafted problem must require knowledge from one of these subjects: Algebra I and II, Pre-Calculus, Calculus, Geometry, Trigonometry, Statistics and Probability.
75
+ - The drafted problem must include all necessary information for solving it.
76
+ - The drafted problem must be fully text-based and solvable without images.
77
+ - The drafted problem must use concepts typically covered by the end of 11th grade.
78
+ 2. Then, provide a detailed, step-by-step solution to the drafted problem, according to the following requirements.
79
+ - The solution must demonstrate the mathematical reasoning from problem statement to conclusion.
80
+ - The solution must explain each step to reinforce the underlying math principles being applied.
81
+ - All mathematical expressions in the solution must be formatted using LaTeX.
82
+ 3. Finally, please put the drafted problem within <problem></problem> and put the solution within <solution></solution>.
83
+ The result format is as follows:
84
+ <result>
85
+ <problem></problem>
86
+ <solution></solution>
87
+ </result>
88
+
89
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
90
+
91
+
92
+ # ============================================================================
93
+ # College/University Q&A Prompt
94
+ # ============================================================================
95
+
96
+ MATH_INSTRUCT_COLLEGE_PROMPT = '''Math Content:{text}
97
+
98
+ As a math teacher, you are highly proficient in mathematical knowledge.
99
+ Your goal is to utilize your abilities, inspired by the provided math content, create a college-level math problem.
100
+ You should follow these steps:
101
+ 1. First, draft a self-contained, college-level math problem inspired by the math content, according to the following requirements.
102
+ - The drafted problem must be intellectually stimulating and designed for an audience familiar with advanced mathematics, such as Calculus, Linear Algebra, Abstract Algebra, etc.
103
+ - The drafted problem must include all necessary information for solving it.
104
+ - The drafted problem must be fully text-based and solvable without images.
105
+ 2. Then, provide a detailed, step-by-step solution to the drafted problem, according to the following requirements.
106
+ - The solution must clearly explain the reasoning, mathematical principles, and steps used.
107
+ - Call out any key theorems or properties being applied at each step.
108
+ - All mathematical expressions in the solution must be formatted using LaTeX.
109
+ 3. Finally, please put the drafted problem within <problem></problem> and put the solution within <solution></solution>.
110
+ TThe result format is as follows:
111
+ <result>
112
+ <problem></problem>
113
+ <solution></solution>
114
+ </result>
115
+
116
+ In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
117
+
118
+
119
+ # ============================================================================
120
+ # Prompt Registry
121
+ # ============================================================================
122
+
123
+ QA_PROMPTS = {
124
+ "grade_school": MATH_INSTRUCT_GRADE_SCHOOL_PROMPT,
125
+ "middle_school": MATH_INSTRUCT_MIDDLE_SCHOOL_PROMPT,
126
+ "high_school": MATH_INSTRUCT_HIGH_SCHOOL_PROMPT,
127
+ "college": MATH_INSTRUCT_COLLEGE_PROMPT,
128
+ }
129
+
130
+
131
+ def get_qa_prompt(level: str) -> str:
132
+ """
133
+ Get Q&A synthesis prompt for specified difficulty level
134
+
135
+ Args:
136
+ level: Difficulty level, options: "grade_school", "middle_school", "high_school", "college"
137
+
138
+ Returns:
139
+ Corresponding prompt template string
140
+ """
141
+ if level not in QA_PROMPTS:
142
+ raise ValueError(f"Unknown level: {level}. Available levels: {list(QA_PROMPTS.keys())}")
143
+ return QA_PROMPTS[level]
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio>=4.0.0
2
+ openai>=1.0.0
run_synthesis.py ADDED
@@ -0,0 +1,514 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """
3
+ UltraData-Math L3 - Data Synthesis Script
4
+
5
+ OpenAI API-based data synthesis tool, supporting:
6
+ - Q&A synthesis
7
+ - Multi-turn conversation synthesis
8
+ - Multi-style rewriting
9
+ - Knowledge extraction and textbook exercise generation
10
+
11
+ Usage:
12
+ python run_synthesis.py \
13
+ --input data.jsonl \
14
+ --output output.jsonl \
15
+ --task qa \
16
+ --level high_school \
17
+ --model gpt-4o \
18
+ --workers 10
19
+ """
20
+
21
+ import argparse
22
+ import asyncio
23
+ import json
24
+ import os
25
+ import re
26
+ import time
27
+ from pathlib import Path
28
+ from typing import Optional
29
+
30
+ from openai import AsyncOpenAI
31
+
32
+ # Import prompt templates
33
+ from qa_synthesis import QA_PROMPTS, get_qa_prompt
34
+ from conversation_synthesis import CONVERSATION_PROMPTS, get_conversation_prompt
35
+ from multistyle_rewrite import MULTISTYLE_PROMPTS, get_multistyle_prompt
36
+ from knowledge_textbook import (
37
+ get_knowledge_extraction_prompt,
38
+ get_textbook_exercise_prompt,
39
+ TEXTBOOK_EXERCISE_PROMPTS,
40
+ )
41
+
42
+
43
+ # ============================================================================
44
+ # Configuration
45
+ # ============================================================================
46
+
47
+ DEFAULT_MODEL = "gpt-4o"
48
+ DEFAULT_TEMPERATURE = 0.7
49
+ DEFAULT_MAX_TOKENS = 4096
50
+ DEFAULT_WORKERS = 10
51
+ DEFAULT_MAX_RETRIES = 3
52
+ DEFAULT_RETRY_DELAY = 1.0
53
+
54
+
55
+ # ============================================================================
56
+ # Output Parsers
57
+ # ============================================================================
58
+
59
+ def parse_qa_output(response: str) -> dict:
60
+ """Parse Q&A synthesis output"""
61
+ result = {"raw": response}
62
+
63
+ # Extract <problem> and <solution>
64
+ problem_match = re.search(r"<problem>(.*?)</problem>", response, re.DOTALL)
65
+ solution_match = re.search(r"<solution>(.*?)</solution>", response, re.DOTALL)
66
+
67
+ if problem_match:
68
+ result["problem"] = problem_match.group(1).strip()
69
+ if solution_match:
70
+ result["solution"] = solution_match.group(1).strip()
71
+
72
+ return result
73
+
74
+
75
+ def parse_conversation_output(response: str) -> dict:
76
+ """Parse conversation synthesis output"""
77
+ result = {"raw": response}
78
+
79
+ # Try multiple tags
80
+ for tag in ["discussions", "conversation", "interaction"]:
81
+ match = re.search(rf"<{tag}>(.*?)</{tag}>", response, re.DOTALL)
82
+ if match:
83
+ result["content"] = match.group(1).strip()
84
+ result["type"] = tag
85
+ break
86
+
87
+ return result
88
+
89
+
90
+ def parse_rewrite_output(response: str) -> dict:
91
+ """Parse multi-style rewrite output"""
92
+ result = {"raw": response}
93
+
94
+ match = re.search(r"<rewritten content>(.*?)</rewritten content>", response, re.DOTALL)
95
+ if match:
96
+ result["rewritten"] = match.group(1).strip()
97
+
98
+ return result
99
+
100
+
101
+ def parse_knowledge_output(response: str) -> dict:
102
+ """Parse knowledge extraction output"""
103
+ result = {"raw": response}
104
+
105
+ if "no result" in response.lower():
106
+ result["knowledge_points"] = []
107
+ return result
108
+
109
+ # Extract all knowledge points
110
+ pattern = r"<mathematical knowledge point\d*>(.*?)</mathematical knowledge point\d*>"
111
+ matches = re.findall(pattern, response, re.DOTALL)
112
+ result["knowledge_points"] = [m.strip() for m in matches]
113
+
114
+ return result
115
+
116
+
117
+ def parse_textbook_output(response: str) -> dict:
118
+ """Parse textbook exercise output"""
119
+ result = {"raw": response}
120
+
121
+ match = re.search(r"<material>(.*?)</material>", response, re.DOTALL)
122
+ if match:
123
+ result["material"] = match.group(1).strip()
124
+
125
+ return result
126
+
127
+
128
+ OUTPUT_PARSERS = {
129
+ "qa": parse_qa_output,
130
+ "conversation": parse_conversation_output,
131
+ "rewrite": parse_rewrite_output,
132
+ "knowledge": parse_knowledge_output,
133
+ "textbook": parse_textbook_output,
134
+ }
135
+
136
+
137
+ # ============================================================================
138
+ # API Client
139
+ # ============================================================================
140
+
141
+ class SynthesisClient:
142
+ """Data synthesis client"""
143
+
144
+ def __init__(
145
+ self,
146
+ api_key: Optional[str] = None,
147
+ base_url: Optional[str] = None,
148
+ model: str = DEFAULT_MODEL,
149
+ temperature: float = DEFAULT_TEMPERATURE,
150
+ max_tokens: int = DEFAULT_MAX_TOKENS,
151
+ max_retries: int = DEFAULT_MAX_RETRIES,
152
+ retry_delay: float = DEFAULT_RETRY_DELAY,
153
+ ):
154
+ self.client = AsyncOpenAI(
155
+ api_key=api_key or os.getenv("OPENAI_API_KEY"),
156
+ base_url=base_url or os.getenv("OPENAI_BASE_URL"),
157
+ )
158
+ self.model = model
159
+ self.temperature = temperature
160
+ self.max_tokens = max_tokens
161
+ self.max_retries = max_retries
162
+ self.retry_delay = retry_delay
163
+
164
+ async def generate(self, prompt: str) -> str:
165
+ """Call API to generate content"""
166
+ for attempt in range(self.max_retries):
167
+ try:
168
+ response = await self.client.chat.completions.create(
169
+ model=self.model,
170
+ messages=[{"role": "user", "content": prompt}],
171
+ temperature=self.temperature,
172
+ max_tokens=self.max_tokens,
173
+ )
174
+ return response.choices[0].message.content
175
+ except Exception as e:
176
+ if attempt < self.max_retries - 1:
177
+ await asyncio.sleep(self.retry_delay * (2 ** attempt))
178
+ else:
179
+ raise e
180
+ return ""
181
+
182
+
183
+ # ============================================================================
184
+ # Synthesis Tasks
185
+ # ============================================================================
186
+
187
+ class SynthesisTask:
188
+ """Base class for synthesis tasks"""
189
+
190
+ def __init__(self, client: SynthesisClient, text_field: str = "text"):
191
+ self.client = client
192
+ self.text_field = text_field
193
+
194
+ def get_prompt(self, sample: dict) -> str:
195
+ raise NotImplementedError
196
+
197
+ def parse_output(self, response: str) -> dict:
198
+ raise NotImplementedError
199
+
200
+ async def process(self, sample: dict) -> dict:
201
+ """Process a single sample"""
202
+ prompt = self.get_prompt(sample)
203
+ response = await self.client.generate(prompt)
204
+ parsed = self.parse_output(response)
205
+ return {**sample, "synthesis_result": parsed}
206
+
207
+
208
+ class QASynthesisTask(SynthesisTask):
209
+ """Q&A synthesis task"""
210
+
211
+ def __init__(self, client: SynthesisClient, level: str, text_field: str = "text"):
212
+ super().__init__(client, text_field)
213
+ self.level = level
214
+ self.prompt_template = get_qa_prompt(level)
215
+
216
+ def get_prompt(self, sample: dict) -> str:
217
+ text = sample.get(self.text_field, "")
218
+ return self.prompt_template.format(text=text)
219
+
220
+ def parse_output(self, response: str) -> dict:
221
+ return parse_qa_output(response)
222
+
223
+
224
+ class ConversationSynthesisTask(SynthesisTask):
225
+ """Conversation synthesis task"""
226
+
227
+ def __init__(self, client: SynthesisClient, style: str, text_field: str = "text"):
228
+ super().__init__(client, text_field)
229
+ self.style = style
230
+ self.prompt_template = get_conversation_prompt(style)
231
+
232
+ def get_prompt(self, sample: dict) -> str:
233
+ text = sample.get(self.text_field, "")
234
+ return self.prompt_template.format(text=text)
235
+
236
+ def parse_output(self, response: str) -> dict:
237
+ return parse_conversation_output(response)
238
+
239
+
240
+ class RewriteSynthesisTask(SynthesisTask):
241
+ """Multi-style rewrite task"""
242
+
243
+ def __init__(self, client: SynthesisClient, style: str, text_field: str = "text"):
244
+ super().__init__(client, text_field)
245
+ self.style = style
246
+ self.prompt_template = get_multistyle_prompt(style)
247
+
248
+ def get_prompt(self, sample: dict) -> str:
249
+ text = sample.get(self.text_field, "")
250
+ return self.prompt_template.format(text=text)
251
+
252
+ def parse_output(self, response: str) -> dict:
253
+ return parse_rewrite_output(response)
254
+
255
+
256
+ class KnowledgeExtractionTask(SynthesisTask):
257
+ """Knowledge extraction task"""
258
+
259
+ def __init__(self, client: SynthesisClient, text_field: str = "text"):
260
+ super().__init__(client, text_field)
261
+ self.prompt_template = get_knowledge_extraction_prompt()
262
+
263
+ def get_prompt(self, sample: dict) -> str:
264
+ text = sample.get(self.text_field, "")
265
+ return self.prompt_template.format(text=text)
266
+
267
+ def parse_output(self, response: str) -> dict:
268
+ return parse_knowledge_output(response)
269
+
270
+
271
+ class TextbookExerciseTask(SynthesisTask):
272
+ """Textbook exercise generation task"""
273
+
274
+ def __init__(self, client: SynthesisClient, difficulty: str, knowledge_field: str = "knowledge_point"):
275
+ super().__init__(client)
276
+ self.difficulty = difficulty
277
+ self.knowledge_field = knowledge_field
278
+ self.prompt_template = get_textbook_exercise_prompt(difficulty)
279
+
280
+ def get_prompt(self, sample: dict) -> str:
281
+ knowledge = sample.get(self.knowledge_field, "")
282
+ return self.prompt_template.format(mathematical_knowledge_point=knowledge)
283
+
284
+ def parse_output(self, response: str) -> dict:
285
+ return parse_textbook_output(response)
286
+
287
+
288
+ # ============================================================================
289
+ # Batch Processing
290
+ # ============================================================================
291
+
292
+ async def process_batch(
293
+ task: SynthesisTask,
294
+ samples: list[dict],
295
+ workers: int,
296
+ progress_callback=None,
297
+ ) -> list[dict]:
298
+ """Process batch data concurrently"""
299
+ semaphore = asyncio.Semaphore(workers)
300
+ results = []
301
+ completed = 0
302
+
303
+ async def process_with_semaphore(sample: dict, idx: int):
304
+ nonlocal completed
305
+ async with semaphore:
306
+ try:
307
+ result = await task.process(sample)
308
+ result["_status"] = "success"
309
+ except Exception as e:
310
+ result = {**sample, "_status": "error", "_error": str(e)}
311
+
312
+ completed += 1
313
+ if progress_callback:
314
+ progress_callback(completed, len(samples))
315
+
316
+ return idx, result
317
+
318
+ tasks = [process_with_semaphore(sample, i) for i, sample in enumerate(samples)]
319
+ task_results = await asyncio.gather(*tasks)
320
+
321
+ # Sort by original order
322
+ task_results.sort(key=lambda x: x[0])
323
+ results = [r[1] for r in task_results]
324
+
325
+ return results
326
+
327
+
328
+ def load_jsonl(filepath: str) -> list[dict]:
329
+ """Load JSONL file"""
330
+ data = []
331
+ with open(filepath, "r", encoding="utf-8") as f:
332
+ for line in f:
333
+ line = line.strip()
334
+ if line:
335
+ data.append(json.loads(line))
336
+ return data
337
+
338
+
339
+ def save_jsonl(data: list[dict], filepath: str):
340
+ """Save JSONL file"""
341
+ with open(filepath, "w", encoding="utf-8") as f:
342
+ for item in data:
343
+ f.write(json.dumps(item, ensure_ascii=False) + "\n")
344
+
345
+
346
+ # ============================================================================
347
+ # Command Line Interface
348
+ # ============================================================================
349
+
350
+ def create_task(args, client: SynthesisClient) -> SynthesisTask:
351
+ """Create synthesis task based on arguments"""
352
+ task_type = args.task
353
+
354
+ if task_type == "qa":
355
+ level = args.level or "high_school"
356
+ if level not in QA_PROMPTS:
357
+ raise ValueError(f"Invalid QA level: {level}. Available: {list(QA_PROMPTS.keys())}")
358
+ return QASynthesisTask(client, level, args.text_field)
359
+
360
+ elif task_type == "conversation":
361
+ style = args.style or "teacher_student"
362
+ if style not in CONVERSATION_PROMPTS:
363
+ raise ValueError(f"Invalid conversation style: {style}. Available: {list(CONVERSATION_PROMPTS.keys())}")
364
+ return ConversationSynthesisTask(client, style, args.text_field)
365
+
366
+ elif task_type == "rewrite":
367
+ style = args.style or "textbook"
368
+ if style not in MULTISTYLE_PROMPTS:
369
+ raise ValueError(f"Invalid rewrite style: {style}. Available: {list(MULTISTYLE_PROMPTS.keys())}")
370
+ return RewriteSynthesisTask(client, style, args.text_field)
371
+
372
+ elif task_type == "knowledge":
373
+ return KnowledgeExtractionTask(client, args.text_field)
374
+
375
+ elif task_type == "textbook":
376
+ difficulty = args.difficulty or "easy"
377
+ if difficulty not in TEXTBOOK_EXERCISE_PROMPTS:
378
+ raise ValueError(f"Invalid difficulty: {difficulty}. Available: {list(TEXTBOOK_EXERCISE_PROMPTS.keys())}")
379
+ return TextbookExerciseTask(client, difficulty, args.knowledge_field)
380
+
381
+ else:
382
+ raise ValueError(f"Unknown task type: {task_type}")
383
+
384
+
385
+ def print_progress(completed: int, total: int):
386
+ """Print progress"""
387
+ percent = completed / total * 100
388
+ print(f"\rProgress: {completed}/{total} ({percent:.1f}%)", end="", flush=True)
389
+
390
+
391
+ async def main_async(args):
392
+ """Async main function"""
393
+ # Create client
394
+ client = SynthesisClient(
395
+ api_key=args.api_key,
396
+ base_url=args.base_url,
397
+ model=args.model,
398
+ temperature=args.temperature,
399
+ max_tokens=args.max_tokens,
400
+ max_retries=args.max_retries,
401
+ )
402
+
403
+ # Create task
404
+ task = create_task(args, client)
405
+
406
+ # Load data
407
+ print(f"Loading data from {args.input}...")
408
+ samples = load_jsonl(args.input)
409
+
410
+ # Limit processing count
411
+ if args.limit:
412
+ samples = samples[:args.limit]
413
+
414
+ print(f"Processing {len(samples)} samples with {args.workers} workers...")
415
+ start_time = time.time()
416
+
417
+ # Process data
418
+ results = await process_batch(
419
+ task,
420
+ samples,
421
+ args.workers,
422
+ progress_callback=print_progress if not args.quiet else None,
423
+ )
424
+
425
+ elapsed = time.time() - start_time
426
+ print(f"\nCompleted in {elapsed:.2f}s ({len(samples)/elapsed:.1f} samples/s)")
427
+
428
+ # Statistics
429
+ success_count = sum(1 for r in results if r.get("_status") == "success")
430
+ error_count = len(results) - success_count
431
+ print(f"Success: {success_count}, Error: {error_count}")
432
+
433
+ # Save results
434
+ save_jsonl(results, args.output)
435
+ print(f"Results saved to {args.output}")
436
+
437
+
438
+ def main():
439
+ parser = argparse.ArgumentParser(
440
+ description="UltraData-Math L3 Data Synthesis Tool",
441
+ formatter_class=argparse.RawDescriptionHelpFormatter,
442
+ epilog="""
443
+ Examples:
444
+ # Q&A synthesis (high school level)
445
+ python run_synthesis.py -i data.jsonl -o qa_output.jsonl -t qa --level high_school
446
+
447
+ # Multi-turn conversation synthesis (teacher-student)
448
+ python run_synthesis.py -i data.jsonl -o conv_output.jsonl -t conversation --style teacher_student
449
+
450
+ # Multi-style rewrite (textbook style)
451
+ python run_synthesis.py -i data.jsonl -o rewrite_output.jsonl -t rewrite --style textbook
452
+
453
+ # Knowledge extraction
454
+ python run_synthesis.py -i data.jsonl -o knowledge_output.jsonl -t knowledge
455
+
456
+ # Textbook exercise generation (medium difficulty)
457
+ python run_synthesis.py -i knowledge.jsonl -o textbook_output.jsonl -t textbook --difficulty medium
458
+
459
+ Task Types:
460
+ qa Q&A synthesis
461
+ --level: grade_school, middle_school, high_school, college
462
+
463
+ conversation Multi-turn conversation synthesis
464
+ --style: two_professors, teacher_student, two_students,
465
+ interview, problem_solving, layman_expert, debate
466
+
467
+ rewrite Multi-style rewrite
468
+ --style: wikipedia, textbook, blog, popular_science,
469
+ academic_paper, learning_note, lecture_note
470
+
471
+ knowledge Knowledge extraction
472
+
473
+ textbook Textbook exercise generation
474
+ --difficulty: easy, medium, hard
475
+ """
476
+ )
477
+
478
+ # Input/Output
479
+ parser.add_argument("-i", "--input", required=True, help="Input JSONL file path")
480
+ parser.add_argument("-o", "--output", required=True, help="Output JSONL file path")
481
+
482
+ # Task configuration
483
+ parser.add_argument("-t", "--task", required=True,
484
+ choices=["qa", "conversation", "rewrite", "knowledge", "textbook"],
485
+ help="Synthesis task type")
486
+ parser.add_argument("--level", help="Q&A difficulty level")
487
+ parser.add_argument("--style", help="Conversation/rewrite style")
488
+ parser.add_argument("--difficulty", help="Textbook exercise difficulty")
489
+
490
+ # Field configuration
491
+ parser.add_argument("--text-field", default="text", help="Input text field name (default: text)")
492
+ parser.add_argument("--knowledge-field", default="knowledge_point", help="Knowledge point field name (default: knowledge_point)")
493
+
494
+ # API configuration
495
+ parser.add_argument("--api-key", help="OpenAI API Key (or set OPENAI_API_KEY env var)")
496
+ parser.add_argument("--base-url", help="API Base URL (or set OPENAI_BASE_URL env var)")
497
+ parser.add_argument("--model", default=DEFAULT_MODEL, help=f"Model name (default: {DEFAULT_MODEL})")
498
+ parser.add_argument("--temperature", type=float, default=DEFAULT_TEMPERATURE, help=f"Sampling temperature (default: {DEFAULT_TEMPERATURE})")
499
+ parser.add_argument("--max-tokens", type=int, default=DEFAULT_MAX_TOKENS, help=f"Max tokens to generate (default: {DEFAULT_MAX_TOKENS})")
500
+
501
+ # Execution configuration
502
+ parser.add_argument("-w", "--workers", type=int, default=DEFAULT_WORKERS, help=f"Concurrency (default: {DEFAULT_WORKERS})")
503
+ parser.add_argument("--max-retries", type=int, default=DEFAULT_MAX_RETRIES, help=f"Max retries (default: {DEFAULT_MAX_RETRIES})")
504
+ parser.add_argument("--limit", type=int, help="Limit number of samples to process")
505
+ parser.add_argument("-q", "--quiet", action="store_true", help="Quiet mode")
506
+
507
+ args = parser.parse_args()
508
+
509
+ # Run
510
+ asyncio.run(main_async(args))
511
+
512
+
513
+ if __name__ == "__main__":
514
+ main()