ZhouChuYue
commited on
Commit
·
787a7ad
0
Parent(s):
Initial commit: UltraData-Math L3 Generator Space
Browse files- README.md +186 -0
- app.py +340 -0
- conversation_synthesis.py +167 -0
- example_data.jsonl +3 -0
- knowledge_textbook.py +168 -0
- multistyle_rewrite.py +224 -0
- qa_synthesis.py +143 -0
- requirements.txt +2 -0
- run_synthesis.py +514 -0
README.md
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# UltraData-Math-L3-Generator
|
| 2 |
+
|
| 3 |
+
L3 合成数据层:基于 LLM 的多格式数学数据合成工具。
|
| 4 |
+
|
| 5 |
+
## 📂 目录结构
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
UltraData-Math-L3-Generator/
|
| 9 |
+
├── run_synthesis.py # OpenAI API 调用脚本
|
| 10 |
+
├── qa_synthesis.py # Q&A 问答对合成 Prompt
|
| 11 |
+
├── conversation_synthesis.py # 多轮对话合成 Prompt
|
| 12 |
+
├── multistyle_rewrite.py # 多风格改写 Prompt
|
| 13 |
+
├── knowledge_textbook.py # 知识点提取 + 教材练习 Prompt
|
| 14 |
+
└── README.md
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
## 🔧 安装依赖
|
| 18 |
+
|
| 19 |
+
```bash
|
| 20 |
+
pip install openai
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
## 🚀 快速开始
|
| 24 |
+
|
| 25 |
+
### 环境配置
|
| 26 |
+
|
| 27 |
+
```bash
|
| 28 |
+
# 设置 API Key
|
| 29 |
+
export OPENAI_API_KEY="your-api-key"
|
| 30 |
+
|
| 31 |
+
# 可选:设置自定义 API 地址(兼容 OpenAI 格式的 API)
|
| 32 |
+
export OPENAI_BASE_URL="https://your-api-endpoint/v1"
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
### 基本用法
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
python run_synthesis.py \
|
| 39 |
+
--input data.jsonl \
|
| 40 |
+
--output output.jsonl \
|
| 41 |
+
--task qa \
|
| 42 |
+
--level high_school \
|
| 43 |
+
--model gpt-4o \
|
| 44 |
+
--workers 10
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## 📋 任务类型
|
| 48 |
+
|
| 49 |
+
### 1. Q&A 问答对合成 (`qa`)
|
| 50 |
+
|
| 51 |
+
根据数学内容生成问答对,按教育难度分级。
|
| 52 |
+
|
| 53 |
+
**参数 `--level`:**
|
| 54 |
+
| 值 | 说明 |
|
| 55 |
+
|:---|:---|
|
| 56 |
+
| `grade_school` | 小学 |
|
| 57 |
+
| `middle_school` | 初中 |
|
| 58 |
+
| `high_school` | 高中(默认) |
|
| 59 |
+
| `college` | 大学 |
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --level high_school
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### 2. 多轮对话合成 (`conversation`)
|
| 66 |
+
|
| 67 |
+
将数学内容转换为多轮对话格式。
|
| 68 |
+
|
| 69 |
+
**参数 `--style`:**
|
| 70 |
+
| 值 | 说明 |
|
| 71 |
+
|:---|:---|
|
| 72 |
+
| `two_professors` | 两位教授对话 |
|
| 73 |
+
| `teacher_student` | 师生对话(默认) |
|
| 74 |
+
| `two_students` | 两位学生对话 |
|
| 75 |
+
| `interview` | 面试风格 |
|
| 76 |
+
| `problem_solving` | 问题解决 |
|
| 77 |
+
| `layman_expert` | 外行与专家 |
|
| 78 |
+
| `debate` | 辩论风格 |
|
| 79 |
+
|
| 80 |
+
```bash
|
| 81 |
+
python run_synthesis.py -i data.jsonl -o output.jsonl -t conversation --style teacher_student
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### 3. 多风格改写 (`rewrite`)
|
| 85 |
+
|
| 86 |
+
将数学内容改写为不同风格。
|
| 87 |
+
|
| 88 |
+
**参数 `--style`:**
|
| 89 |
+
| 值 | 说明 |
|
| 90 |
+
|:---|:---|
|
| 91 |
+
| `wikipedia` | 维基百科风格 |
|
| 92 |
+
| `textbook` | 教科书风格(默认) |
|
| 93 |
+
| `blog` | 博客风格 |
|
| 94 |
+
| `popular_science` | 科普风格 |
|
| 95 |
+
| `academic_paper` | 学术论文风格 |
|
| 96 |
+
| `learning_note` | 学习笔记风格 |
|
| 97 |
+
| `lecture_note` | 讲义风格 |
|
| 98 |
+
|
| 99 |
+
```bash
|
| 100 |
+
python run_synthesis.py -i data.jsonl -o output.jsonl -t rewrite --style textbook
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### 4. 知识点提取 (`knowledge`)
|
| 104 |
+
|
| 105 |
+
从数学内容中提取定义、定理、性质等知识点。
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
python run_synthesis.py -i data.jsonl -o knowledge_output.jsonl -t knowledge
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### 5. 教材练习生成 (`textbook`)
|
| 112 |
+
|
| 113 |
+
基于知识点生成不同难度的教材式练习。
|
| 114 |
+
|
| 115 |
+
**参数 `--difficulty`:**
|
| 116 |
+
| 值 | 说明 |
|
| 117 |
+
|:---|:---|
|
| 118 |
+
| `easy` | 简单(默认) |
|
| 119 |
+
| `medium` | 中等 |
|
| 120 |
+
| `hard` | 困难 |
|
| 121 |
+
|
| 122 |
+
```bash
|
| 123 |
+
python run_synthesis.py -i knowledge.jsonl -o output.jsonl -t textbook --difficulty medium
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
**注意:** 输入文件需包含 `knowledge_point` 字段(可通过 `--knowledge-field` 自定义)。
|
| 127 |
+
|
| 128 |
+
## ⚙️ 参数说明
|
| 129 |
+
|
| 130 |
+
| 参数 | 说明 | 默认值 |
|
| 131 |
+
|:---|:---|:---|
|
| 132 |
+
| `-i, --input` | 输入 JSONL 文件路径 | 必填 |
|
| 133 |
+
| `-o, --output` | 输出 JSONL 文件路径 | 必填 |
|
| 134 |
+
| `-t, --task` | 任务类型:`qa`, `conversation`, `rewrite`, `knowledge`, `textbook` | 必填 |
|
| 135 |
+
| `--level` | Q&A 难度级别 | `high_school` |
|
| 136 |
+
| `--style` | 对话/改写风格 | - |
|
| 137 |
+
| `--difficulty` | 教材练习难度 | `easy` |
|
| 138 |
+
| `--text-field` | 输入文本字段名 | `text` |
|
| 139 |
+
| `--knowledge-field` | 知识点字段名 | `knowledge_point` |
|
| 140 |
+
| `--api-key` | OpenAI API Key | 环境变量 |
|
| 141 |
+
| `--base-url` | API Base URL | 环境变量 |
|
| 142 |
+
| `--model` | 模型名称 | `gpt-4o` |
|
| 143 |
+
| `--temperature` | 采样温度 | `0.7` |
|
| 144 |
+
| `--max-tokens` | 最大生成 token 数 | `4096` |
|
| 145 |
+
| `-w, --workers` | 并发数 | `10` |
|
| 146 |
+
| `--max-retries` | 最大重试次数 | `3` |
|
| 147 |
+
| `--limit` | 限制处理样本数量 | - |
|
| 148 |
+
| `-q, --quiet` | 静默模式 | `False` |
|
| 149 |
+
|
| 150 |
+
## 📊 输入输出格式
|
| 151 |
+
|
| 152 |
+
**输入:** JSONL 格式,每行一个 JSON 对象(参见 `example_data.jsonl`):
|
| 153 |
+
|
| 154 |
+
```jsonl
|
| 155 |
+
{"text": "The quadratic formula states that for any quadratic equation..."}
|
| 156 |
+
{"text": "The Pythagorean theorem is a fundamental relation..."}
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
**输出:** 在原数据基础上添加 `synthesis_result` 字段:
|
| 160 |
+
|
| 161 |
+
```json
|
| 162 |
+
{
|
| 163 |
+
"text": "原始数学内容",
|
| 164 |
+
"synthesis_result": {
|
| 165 |
+
"raw": "完整响应",
|
| 166 |
+
"problem": "生成的问题",
|
| 167 |
+
"solution": "详细解答"
|
| 168 |
+
}
|
| 169 |
+
}
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
## 🔌 兼容其他 API
|
| 173 |
+
|
| 174 |
+
支持任何 OpenAI 兼容的 API(如 Qwen、DeepSeek、vLLM 等):
|
| 175 |
+
|
| 176 |
+
```bash
|
| 177 |
+
# 使用阿里云 Qwen API
|
| 178 |
+
export OPENAI_API_KEY="your-dashscope-api-key"
|
| 179 |
+
export OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
|
| 180 |
+
|
| 181 |
+
python run_synthesis.py -i data.jsonl -o output.jsonl -t qa --model qwen-plus
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
## 📜 许可证
|
| 185 |
+
|
| 186 |
+
本项目基于 [Apache 2.0](../LICENSE) 许可���发布。
|
app.py
ADDED
|
@@ -0,0 +1,340 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
UltraData-Math L3 Generator - Hugging Face Space Demo
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import asyncio
|
| 8 |
+
import json
|
| 9 |
+
import gradio as gr
|
| 10 |
+
|
| 11 |
+
from openai import AsyncOpenAI
|
| 12 |
+
|
| 13 |
+
from qa_synthesis import QA_PROMPTS, get_qa_prompt
|
| 14 |
+
from conversation_synthesis import CONVERSATION_PROMPTS, get_conversation_prompt
|
| 15 |
+
from multistyle_rewrite import MULTISTYLE_PROMPTS, get_multistyle_prompt
|
| 16 |
+
from knowledge_textbook import (
|
| 17 |
+
get_knowledge_extraction_prompt,
|
| 18 |
+
get_textbook_exercise_prompt,
|
| 19 |
+
TEXTBOOK_EXERCISE_PROMPTS,
|
| 20 |
+
)
|
| 21 |
+
from run_synthesis import (
|
| 22 |
+
parse_qa_output,
|
| 23 |
+
parse_conversation_output,
|
| 24 |
+
parse_rewrite_output,
|
| 25 |
+
parse_knowledge_output,
|
| 26 |
+
parse_textbook_output,
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
# API 配置从环境变量读取(通过 HF Secrets 设置)
|
| 30 |
+
API_KEY = os.getenv("OPENAI_API_KEY")
|
| 31 |
+
BASE_URL = os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")
|
| 32 |
+
DEFAULT_MODEL = "gpt-4o"
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
async def call_api(prompt: str, model: str = DEFAULT_MODEL, temperature: float = 0.7) -> str:
|
| 36 |
+
"""调用 API 生成内容"""
|
| 37 |
+
if not API_KEY:
|
| 38 |
+
return "Error: API Key not configured. Please contact administrator."
|
| 39 |
+
|
| 40 |
+
client = AsyncOpenAI(api_key=API_KEY, base_url=BASE_URL)
|
| 41 |
+
try:
|
| 42 |
+
response = await client.chat.completions.create(
|
| 43 |
+
model=model,
|
| 44 |
+
messages=[{"role": "user", "content": prompt}],
|
| 45 |
+
temperature=temperature,
|
| 46 |
+
max_tokens=4096,
|
| 47 |
+
)
|
| 48 |
+
return response.choices[0].message.content
|
| 49 |
+
except Exception as e:
|
| 50 |
+
return f"Error: {str(e)}"
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def run_async(coro):
|
| 54 |
+
"""运行异步函数"""
|
| 55 |
+
try:
|
| 56 |
+
loop = asyncio.get_event_loop()
|
| 57 |
+
except RuntimeError:
|
| 58 |
+
loop = asyncio.new_event_loop()
|
| 59 |
+
asyncio.set_event_loop(loop)
|
| 60 |
+
return loop.run_until_complete(coro)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
# ============================================================================
|
| 64 |
+
# Task Handlers
|
| 65 |
+
# ============================================================================
|
| 66 |
+
|
| 67 |
+
def qa_synthesis(text: str, level: str, model: str, temperature: float):
|
| 68 |
+
"""Q&A 问答对合成"""
|
| 69 |
+
if not text.strip():
|
| 70 |
+
return "", "", ""
|
| 71 |
+
|
| 72 |
+
prompt_template = get_qa_prompt(level)
|
| 73 |
+
prompt = prompt_template.format(text=text)
|
| 74 |
+
|
| 75 |
+
response = run_async(call_api(prompt, model, temperature))
|
| 76 |
+
parsed = parse_qa_output(response)
|
| 77 |
+
|
| 78 |
+
return (
|
| 79 |
+
parsed.get("problem", ""),
|
| 80 |
+
parsed.get("solution", ""),
|
| 81 |
+
response
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def conversation_synthesis(text: str, style: str, model: str, temperature: float):
|
| 86 |
+
"""多轮对话合成"""
|
| 87 |
+
if not text.strip():
|
| 88 |
+
return "", ""
|
| 89 |
+
|
| 90 |
+
prompt_template = get_conversation_prompt(style)
|
| 91 |
+
prompt = prompt_template.format(text=text)
|
| 92 |
+
|
| 93 |
+
response = run_async(call_api(prompt, model, temperature))
|
| 94 |
+
parsed = parse_conversation_output(response)
|
| 95 |
+
|
| 96 |
+
return parsed.get("content", response), response
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def rewrite_synthesis(text: str, style: str, model: str, temperature: float):
|
| 100 |
+
"""多风格改写"""
|
| 101 |
+
if not text.strip():
|
| 102 |
+
return "", ""
|
| 103 |
+
|
| 104 |
+
prompt_template = get_multistyle_prompt(style)
|
| 105 |
+
prompt = prompt_template.format(text=text)
|
| 106 |
+
|
| 107 |
+
response = run_async(call_api(prompt, model, temperature))
|
| 108 |
+
parsed = parse_rewrite_output(response)
|
| 109 |
+
|
| 110 |
+
return parsed.get("rewritten", response), response
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def knowledge_extraction(text: str, model: str, temperature: float):
|
| 114 |
+
"""知识点提取"""
|
| 115 |
+
if not text.strip():
|
| 116 |
+
return "", ""
|
| 117 |
+
|
| 118 |
+
prompt_template = get_knowledge_extraction_prompt()
|
| 119 |
+
prompt = prompt_template.format(text=text)
|
| 120 |
+
|
| 121 |
+
response = run_async(call_api(prompt, model, temperature))
|
| 122 |
+
parsed = parse_knowledge_output(response)
|
| 123 |
+
|
| 124 |
+
knowledge_points = parsed.get("knowledge_points", [])
|
| 125 |
+
formatted = "\n\n---\n\n".join(knowledge_points) if knowledge_points else "No knowledge points extracted."
|
| 126 |
+
|
| 127 |
+
return formatted, response
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
def textbook_exercise(knowledge_point: str, difficulty: str, model: str, temperature: float):
|
| 131 |
+
"""教材练习生成"""
|
| 132 |
+
if not knowledge_point.strip():
|
| 133 |
+
return "", ""
|
| 134 |
+
|
| 135 |
+
prompt_template = get_textbook_exercise_prompt(difficulty)
|
| 136 |
+
prompt = prompt_template.format(mathematical_knowledge_point=knowledge_point)
|
| 137 |
+
|
| 138 |
+
response = run_async(call_api(prompt, model, temperature))
|
| 139 |
+
parsed = parse_textbook_output(response)
|
| 140 |
+
|
| 141 |
+
return parsed.get("material", response), response
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
# ============================================================================
|
| 145 |
+
# Gradio UI
|
| 146 |
+
# ============================================================================
|
| 147 |
+
|
| 148 |
+
custom_css = """
|
| 149 |
+
.gradio-container {
|
| 150 |
+
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif !important;
|
| 151 |
+
background: linear-gradient(135deg, #1a1a2e 0%, #16213e 50%, #0f3460 100%) !important;
|
| 152 |
+
}
|
| 153 |
+
|
| 154 |
+
.main-title {
|
| 155 |
+
font-weight: 700 !important;
|
| 156 |
+
font-size: 2.2rem !important;
|
| 157 |
+
background: linear-gradient(90deg, #e94560, #f39c12, #00d9ff) !important;
|
| 158 |
+
-webkit-background-clip: text !important;
|
| 159 |
+
-webkit-text-fill-color: transparent !important;
|
| 160 |
+
background-clip: text !important;
|
| 161 |
+
text-align: center !important;
|
| 162 |
+
}
|
| 163 |
+
|
| 164 |
+
.subtitle {
|
| 165 |
+
text-align: center !important;
|
| 166 |
+
color: #94a3b8 !important;
|
| 167 |
+
font-size: 1rem !important;
|
| 168 |
+
margin-bottom: 1.5rem !important;
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
.gr-button-primary {
|
| 172 |
+
background: linear-gradient(135deg, #e94560 0%, #f39c12 100%) !important;
|
| 173 |
+
border: none !important;
|
| 174 |
+
font-weight: 600 !important;
|
| 175 |
+
}
|
| 176 |
+
|
| 177 |
+
.gr-button-primary:hover {
|
| 178 |
+
transform: translateY(-2px) !important;
|
| 179 |
+
box-shadow: 0 8px 25px rgba(233, 69, 96, 0.4) !important;
|
| 180 |
+
}
|
| 181 |
+
|
| 182 |
+
footer {
|
| 183 |
+
display: none !important;
|
| 184 |
+
}
|
| 185 |
+
"""
|
| 186 |
+
|
| 187 |
+
with gr.Blocks(title="UltraData-Math L3 Generator", css=custom_css) as demo:
|
| 188 |
+
gr.HTML('<h1 class="main-title">🧮 UltraData-Math L3 Generator</h1>')
|
| 189 |
+
gr.HTML('<p class="subtitle">LLM-based Mathematical Data Synthesis Tool</p>')
|
| 190 |
+
|
| 191 |
+
with gr.Row():
|
| 192 |
+
model_select = gr.Dropdown(
|
| 193 |
+
choices=["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo"],
|
| 194 |
+
value="gpt-4o",
|
| 195 |
+
label="Model",
|
| 196 |
+
scale=1,
|
| 197 |
+
)
|
| 198 |
+
temperature = gr.Slider(
|
| 199 |
+
minimum=0.0, maximum=1.5, value=0.7, step=0.1,
|
| 200 |
+
label="Temperature",
|
| 201 |
+
scale=1,
|
| 202 |
+
)
|
| 203 |
+
|
| 204 |
+
with gr.Tabs():
|
| 205 |
+
# Q&A Synthesis Tab
|
| 206 |
+
with gr.TabItem("📝 Q&A Synthesis"):
|
| 207 |
+
gr.Markdown("根据数学内容生成问答对,按教育难度分级。")
|
| 208 |
+
with gr.Row():
|
| 209 |
+
with gr.Column():
|
| 210 |
+
qa_input = gr.Textbox(
|
| 211 |
+
label="Input Mathematical Content",
|
| 212 |
+
placeholder="Enter mathematical content here...",
|
| 213 |
+
lines=8,
|
| 214 |
+
)
|
| 215 |
+
qa_level = gr.Radio(
|
| 216 |
+
choices=list(QA_PROMPTS.keys()),
|
| 217 |
+
value="high_school",
|
| 218 |
+
label="Difficulty Level",
|
| 219 |
+
)
|
| 220 |
+
qa_btn = gr.Button("🚀 Generate Q&A", variant="primary")
|
| 221 |
+
with gr.Column():
|
| 222 |
+
qa_problem = gr.Textbox(label="Generated Problem", lines=4)
|
| 223 |
+
qa_solution = gr.Textbox(label="Generated Solution", lines=8)
|
| 224 |
+
qa_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
|
| 225 |
+
|
| 226 |
+
qa_btn.click(
|
| 227 |
+
qa_synthesis,
|
| 228 |
+
inputs=[qa_input, qa_level, model_select, temperature],
|
| 229 |
+
outputs=[qa_problem, qa_solution, qa_raw],
|
| 230 |
+
)
|
| 231 |
+
|
| 232 |
+
# Conversation Synthesis Tab
|
| 233 |
+
with gr.TabItem("💬 Conversation Synthesis"):
|
| 234 |
+
gr.Markdown("将数学内容转换为多轮对话格式。")
|
| 235 |
+
with gr.Row():
|
| 236 |
+
with gr.Column():
|
| 237 |
+
conv_input = gr.Textbox(
|
| 238 |
+
label="Input Mathematical Content",
|
| 239 |
+
placeholder="Enter mathematical content here...",
|
| 240 |
+
lines=8,
|
| 241 |
+
)
|
| 242 |
+
conv_style = gr.Radio(
|
| 243 |
+
choices=list(CONVERSATION_PROMPTS.keys()),
|
| 244 |
+
value="teacher_student",
|
| 245 |
+
label="Conversation Style",
|
| 246 |
+
)
|
| 247 |
+
conv_btn = gr.Button("🚀 Generate Conversation", variant="primary")
|
| 248 |
+
with gr.Column():
|
| 249 |
+
conv_output = gr.Textbox(label="Generated Conversation", lines=15)
|
| 250 |
+
conv_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
|
| 251 |
+
|
| 252 |
+
conv_btn.click(
|
| 253 |
+
conversation_synthesis,
|
| 254 |
+
inputs=[conv_input, conv_style, model_select, temperature],
|
| 255 |
+
outputs=[conv_output, conv_raw],
|
| 256 |
+
)
|
| 257 |
+
|
| 258 |
+
# Rewrite Tab
|
| 259 |
+
with gr.TabItem("✨ Multi-style Rewrite"):
|
| 260 |
+
gr.Markdown("将数学内容改写为不同风格。")
|
| 261 |
+
with gr.Row():
|
| 262 |
+
with gr.Column():
|
| 263 |
+
rewrite_input = gr.Textbox(
|
| 264 |
+
label="Input Mathematical Content",
|
| 265 |
+
placeholder="Enter mathematical content here...",
|
| 266 |
+
lines=8,
|
| 267 |
+
)
|
| 268 |
+
rewrite_style = gr.Radio(
|
| 269 |
+
choices=list(MULTISTYLE_PROMPTS.keys()),
|
| 270 |
+
value="textbook",
|
| 271 |
+
label="Rewrite Style",
|
| 272 |
+
)
|
| 273 |
+
rewrite_btn = gr.Button("🚀 Rewrite", variant="primary")
|
| 274 |
+
with gr.Column():
|
| 275 |
+
rewrite_output = gr.Textbox(label="Rewritten Content", lines=15)
|
| 276 |
+
rewrite_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
|
| 277 |
+
|
| 278 |
+
rewrite_btn.click(
|
| 279 |
+
rewrite_synthesis,
|
| 280 |
+
inputs=[rewrite_input, rewrite_style, model_select, temperature],
|
| 281 |
+
outputs=[rewrite_output, rewrite_raw],
|
| 282 |
+
)
|
| 283 |
+
|
| 284 |
+
# Knowledge Extraction Tab
|
| 285 |
+
with gr.TabItem("📚 Knowledge Extraction"):
|
| 286 |
+
gr.Markdown("从数学内容中提取定义、定理、性质等知识点。")
|
| 287 |
+
with gr.Row():
|
| 288 |
+
with gr.Column():
|
| 289 |
+
know_input = gr.Textbox(
|
| 290 |
+
label="Input Mathematical Content",
|
| 291 |
+
placeholder="Enter mathematical content here...",
|
| 292 |
+
lines=10,
|
| 293 |
+
)
|
| 294 |
+
know_btn = gr.Button("🚀 Extract Knowledge", variant="primary")
|
| 295 |
+
with gr.Column():
|
| 296 |
+
know_output = gr.Textbox(label="Extracted Knowledge Points", lines=15)
|
| 297 |
+
know_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
|
| 298 |
+
|
| 299 |
+
know_btn.click(
|
| 300 |
+
knowledge_extraction,
|
| 301 |
+
inputs=[know_input, model_select, temperature],
|
| 302 |
+
outputs=[know_output, know_raw],
|
| 303 |
+
)
|
| 304 |
+
|
| 305 |
+
# Textbook Exercise Tab
|
| 306 |
+
with gr.TabItem("📖 Textbook Exercise"):
|
| 307 |
+
gr.Markdown("基于知识点生成不同难度的教材式练习。")
|
| 308 |
+
with gr.Row():
|
| 309 |
+
with gr.Column():
|
| 310 |
+
textbook_input = gr.Textbox(
|
| 311 |
+
label="Input Knowledge Point",
|
| 312 |
+
placeholder="Enter a mathematical knowledge point...",
|
| 313 |
+
lines=6,
|
| 314 |
+
)
|
| 315 |
+
textbook_diff = gr.Radio(
|
| 316 |
+
choices=list(TEXTBOOK_EXERCISE_PROMPTS.keys()),
|
| 317 |
+
value="easy",
|
| 318 |
+
label="Difficulty",
|
| 319 |
+
)
|
| 320 |
+
textbook_btn = gr.Button("🚀 Generate Exercise", variant="primary")
|
| 321 |
+
with gr.Column():
|
| 322 |
+
textbook_output = gr.Textbox(label="Generated Exercise Material", lines=15)
|
| 323 |
+
textbook_raw = gr.Textbox(label="Raw Response", lines=4, visible=False)
|
| 324 |
+
|
| 325 |
+
textbook_btn.click(
|
| 326 |
+
textbook_exercise,
|
| 327 |
+
inputs=[textbook_input, textbook_diff, model_select, temperature],
|
| 328 |
+
outputs=[textbook_output, textbook_raw],
|
| 329 |
+
)
|
| 330 |
+
|
| 331 |
+
gr.HTML("""
|
| 332 |
+
<div style="text-align: center; margin-top: 2rem; padding: 1rem; color: #64748b; font-size: 0.85rem;">
|
| 333 |
+
<p>🔬 <strong>UltraData-Math L3 Generator</strong> - Part of the UltraData-Math Project</p>
|
| 334 |
+
<p>LLM-based data synthesis for Q&A, conversations, rewriting, and more.</p>
|
| 335 |
+
</div>
|
| 336 |
+
""")
|
| 337 |
+
|
| 338 |
+
|
| 339 |
+
if __name__ == "__main__":
|
| 340 |
+
demo.launch(ssr_mode=False)
|
conversation_synthesis.py
ADDED
|
@@ -0,0 +1,167 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
UltraData-Math L3 - Conversation Synthesis Prompts
|
| 4 |
+
|
| 5 |
+
Reference: MIND
|
| 6 |
+
Conversation types: Two Professors, Teacher-Student, Two Students, Interview, Problem Solving, Layman-Expert, Debate
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
# ============================================================================
|
| 10 |
+
# Two Professors Discussion
|
| 11 |
+
# ============================================================================
|
| 12 |
+
|
| 13 |
+
MATH_INSTRUCT_TWO_PROFESSORS_PROMPT = '''Math Content:{text}
|
| 14 |
+
|
| 15 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 16 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn discussions between two professors, according to the following requirements.
|
| 17 |
+
- Make sure that their discussions strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 18 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 19 |
+
- All mathematical expressions in the discussions must be formatted using LaTeX.
|
| 20 |
+
Finally, please put the discussions within <discussions></discussions>.
|
| 21 |
+
The result format is as follows:
|
| 22 |
+
<discussions></discussions>
|
| 23 |
+
|
| 24 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
# ============================================================================
|
| 28 |
+
# Teacher-Student Discussion
|
| 29 |
+
# ============================================================================
|
| 30 |
+
|
| 31 |
+
MATH_INSTRUCT_TEACHER_STUDENT_PROMPT = '''Math Content:{text}
|
| 32 |
+
|
| 33 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 34 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn discussions between a teacher and a student, according to the following requirements.
|
| 35 |
+
- The student has questions about the provided math content and the teacher solves each of them step-by-step.
|
| 36 |
+
- Make sure that their discussions strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 37 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 38 |
+
- All mathematical expressions in the discussions must be formatted using LaTeX.
|
| 39 |
+
Finally, please put the discussions within <discussions></discussions>.
|
| 40 |
+
The result format is as follows:
|
| 41 |
+
<discussions></discussions>
|
| 42 |
+
|
| 43 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
# ============================================================================
|
| 47 |
+
# Two Students Discussion
|
| 48 |
+
# ============================================================================
|
| 49 |
+
|
| 50 |
+
MATH_INSTRUCT_TWO_STUDENTS_PROMPT = '''Math Content:{text}
|
| 51 |
+
|
| 52 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 53 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn discussions between two students who are working on their assignment related to the provided math content, according to the following requirements.
|
| 54 |
+
- Make sure that their discussions strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 55 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 56 |
+
- All mathematical expressions in the discussions must be formatted using LaTeX.
|
| 57 |
+
Finally, please put the discussions within <discussions></discussions>.
|
| 58 |
+
The result format is as follows:
|
| 59 |
+
<discussions></discussions>
|
| 60 |
+
|
| 61 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
# ============================================================================
|
| 65 |
+
# Interview Style
|
| 66 |
+
# ============================================================================
|
| 67 |
+
|
| 68 |
+
MATH_INSTRUCT_INTERVIEW_PROMPT = '''Math Content:{text}
|
| 69 |
+
|
| 70 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 71 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn interview-style conversation between a interviewer and a interviewee, according to the following requirements.
|
| 72 |
+
- One participant acts as the interviewer who asks questions exclusively related to the provided math content, while the other participant serves as the subject matter expert, providing detailed responses based on the provided math content.
|
| 73 |
+
- Make sure that their conversation strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 74 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 75 |
+
- All mathematical expressions in the conversation must be formatted using LaTeX.
|
| 76 |
+
Finally, please put the conversation within <conversation></conversation>.
|
| 77 |
+
The result format is as follows:
|
| 78 |
+
<conversation></conversation>
|
| 79 |
+
|
| 80 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
# ============================================================================
|
| 84 |
+
# Problem Solving
|
| 85 |
+
# ============================================================================
|
| 86 |
+
|
| 87 |
+
MATH_INSTRUCT_PROBLEM_SOLVING_PROMPT = '''Math Content:{text}
|
| 88 |
+
|
| 89 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 90 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn problem-solving conversation, according to the following requirements.
|
| 91 |
+
- Participants analyze challenges or scenarios presented in the provided math content and brainstorm solutions within the provided math content, avoiding speculation or unrelated discussions.
|
| 92 |
+
- Make sure that their conversation strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 93 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 94 |
+
- All mathematical expressions in the conversation must be formatted using LaTeX.
|
| 95 |
+
Finally, please put the conversation within <conversation></conversation>.
|
| 96 |
+
The result format is as follows:
|
| 97 |
+
<conversation></conversation>
|
| 98 |
+
|
| 99 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
# ============================================================================
|
| 103 |
+
# Layman-Expert
|
| 104 |
+
# ============================================================================
|
| 105 |
+
|
| 106 |
+
MATH_INSTRUCT_LAYMAN_EXPERT_PROMPT = '''Math Content:{text}
|
| 107 |
+
|
| 108 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 109 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn interaction between a layman and a expert, according to the following requirements.
|
| 110 |
+
- While the expert are presenting the provided math content step-by-step to a layman, the layman has a lot of followup questions regarding your presentation. The expert answer the questions step-by-step with chain-of-thoughts.
|
| 111 |
+
- Make sure that their interaction strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 112 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 113 |
+
- All mathematical expressions in the interaction must be formatted using LaTeX.
|
| 114 |
+
Finally, please put the interaction within <interaction></interaction>.
|
| 115 |
+
The result format is as follows:
|
| 116 |
+
<interaction></interaction>
|
| 117 |
+
|
| 118 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
# ============================================================================
|
| 122 |
+
# Debate Style
|
| 123 |
+
# ============================================================================
|
| 124 |
+
|
| 125 |
+
MATH_INSTRUCT_DEBATE_PROMPT = '''Math Content:{text}
|
| 126 |
+
|
| 127 |
+
As a mathematics expert and mathematics content creation expert, you are highly proficient in mathematical knowledge, mathematical content analysis and creating.
|
| 128 |
+
Your goal is to utilize your abilities, convert the provided math content as a multi-turn debate-style conversation, according to the following requirements.
|
| 129 |
+
- The participants present arguments and counterarguments based solely on the provided math content, without introducing external information or personal opinions. Each participant defends others arguments step-by-step with chain-of-thoughts.
|
| 130 |
+
- Make sure that their conversation strictly adhere to the provided math content and remains faithful to information in the provided math content.
|
| 131 |
+
- Please DONOT add any new information/reference other than the provided math content.
|
| 132 |
+
- All mathematical expressions in the conversation must be formatted using LaTeX.
|
| 133 |
+
|
| 134 |
+
The result format is as follows:
|
| 135 |
+
<conversation></conversation>
|
| 136 |
+
|
| 137 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
# ============================================================================
|
| 141 |
+
# Prompt Registry
|
| 142 |
+
# ============================================================================
|
| 143 |
+
|
| 144 |
+
CONVERSATION_PROMPTS = {
|
| 145 |
+
"two_professors": MATH_INSTRUCT_TWO_PROFESSORS_PROMPT,
|
| 146 |
+
"teacher_student": MATH_INSTRUCT_TEACHER_STUDENT_PROMPT,
|
| 147 |
+
"two_students": MATH_INSTRUCT_TWO_STUDENTS_PROMPT,
|
| 148 |
+
"interview": MATH_INSTRUCT_INTERVIEW_PROMPT,
|
| 149 |
+
"problem_solving": MATH_INSTRUCT_PROBLEM_SOLVING_PROMPT,
|
| 150 |
+
"layman_expert": MATH_INSTRUCT_LAYMAN_EXPERT_PROMPT,
|
| 151 |
+
"debate": MATH_INSTRUCT_DEBATE_PROMPT,
|
| 152 |
+
}
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def get_conversation_prompt(style: str) -> str:
|
| 156 |
+
"""
|
| 157 |
+
Get conversation synthesis prompt for specified style
|
| 158 |
+
|
| 159 |
+
Args:
|
| 160 |
+
style: Conversation style, see CONVERSATION_PROMPTS.keys() for options
|
| 161 |
+
|
| 162 |
+
Returns:
|
| 163 |
+
Corresponding prompt template string
|
| 164 |
+
"""
|
| 165 |
+
if style not in CONVERSATION_PROMPTS:
|
| 166 |
+
raise ValueError(f"Unknown style: {style}. Available styles: {list(CONVERSATION_PROMPTS.keys())}")
|
| 167 |
+
return CONVERSATION_PROMPTS[style]
|
example_data.jsonl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"text": "The quadratic formula states that for any quadratic equation of the form ax² + bx + c = 0, where a ≠ 0, the solutions are given by x = (-b ± √(b² - 4ac)) / (2a). The expression b² - 4ac is called the discriminant. When the discriminant is positive, the equation has two distinct real roots; when it equals zero, there is exactly one real root (a repeated root); when it is negative, the equation has two complex conjugate roots."}
|
| 2 |
+
{"text": "The Pythagorean theorem is a fundamental relation in Euclidean geometry among the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse (the side opposite the right angle) is equal to the sum of the areas of the squares on the other two sides. This can be written as a² + b² = c², where c represents the length of the hypotenuse and a and b represent the lengths of the triangle's other two sides."}
|
| 3 |
+
{"text": "In calculus, the derivative of a function measures the sensitivity to change of the function value with respect to a change in its argument. The derivative of f(x) with respect to x is written as f'(x) or df/dx. For example, the derivative of f(x) = x² is f'(x) = 2x, which means the rate of change of x² at any point x is 2x."}
|
knowledge_textbook.py
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
UltraData-Math L3 - Knowledge Extraction & Textbook Exercise Prompts
|
| 4 |
+
|
| 5 |
+
Features:
|
| 6 |
+
1. Knowledge Extraction: Extract definitions, axioms, theorems, properties from math content
|
| 7 |
+
2. Textbook Exercise Generation: Generate exercises at different difficulty levels (Easy/Medium/Hard)
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
# ============================================================================
|
| 11 |
+
# Knowledge Point Extraction
|
| 12 |
+
# ============================================================================
|
| 13 |
+
|
| 14 |
+
MATH_INSTRUCT_KNOWLEDGE_EXTRACTION_PROMPT = '''Math Content:{text}
|
| 15 |
+
|
| 16 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 17 |
+
Your goal is to utilize your abilities, extract mathematical knowledge points based on the provided math content.
|
| 18 |
+
You should follow these steps:
|
| 19 |
+
1. First, If the provided math content does not include specific mathematical definitions, axioms, assumptions, hypotheses, conjectures, propositions, lemmas, theorems, corollaries, properties, proofs, return 'no result' directly.
|
| 20 |
+
2. Then, carefully read the provided math content to provide mathematical knowledge point according to the following requirements.
|
| 21 |
+
- The mathematical knowledge point must be specific mathematical definitions, axioms, assumptions, hypotheses, conjectures, propositions, lemmas, theorems, corollaries, properties, proofs. Otherwise, it must not be output.
|
| 22 |
+
- The mathematical knowledge point must be findable within the provided math content. Otherwise, it must not be output.
|
| 23 |
+
- The beginning of the mathematical knowledge point must state specific mathematical definitions, axioms, assumptions, hypotheses, conjectures, propositions, lemmas, theorems, corollaries, properties, and proofs.
|
| 24 |
+
- The mathematical knowledge point must not be repeated.
|
| 25 |
+
- The mathematical knowledge point must be clear, concise, accurate, and easy to learn.
|
| 26 |
+
- The mathematical knowledge point may appropriately include relevant explanations to make the knowledge point more complete.
|
| 27 |
+
- All mathematical expressions in the mathematical knowledge point must be formatted using LaTeX.
|
| 28 |
+
|
| 29 |
+
The result format is as follows:
|
| 30 |
+
<mathematical knowledge point1></mathematical knowledge point1>
|
| 31 |
+
<mathematical knowledge point2></mathematical knowledge point2>
|
| 32 |
+
and more
|
| 33 |
+
|
| 34 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
# ============================================================================
|
| 38 |
+
# Textbook Exercise - Easy
|
| 39 |
+
# ============================================================================
|
| 40 |
+
|
| 41 |
+
MATH_INSTRUCT_TEXTBOOK_EASY_PROMPT = '''Mathematical Knowledge Point:{mathematical_knowledge_point}
|
| 42 |
+
|
| 43 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 44 |
+
Your goal is to utilize your abilities, generate informative, textbook-style learning mathematical material suitable for students.
|
| 45 |
+
You should follow these steps:
|
| 46 |
+
1. First, provide a detailed explanation based on the given mathematical knowledge point.
|
| 47 |
+
2. Second, generate an exercise based on the provided explanation according to the following requirements.
|
| 48 |
+
- The exercise must be self-contained.
|
| 49 |
+
- Ensure the exercise is fully text-based and solvable without images.
|
| 50 |
+
3. Third, provide a solution based on the generated exercise according to the following requirements.
|
| 51 |
+
- The solution must be detailed and step-by-step.
|
| 52 |
+
4. Finally, construct the generated explanation, exercise, and solution into textbook-style learning material according to the following requirements.
|
| 53 |
+
- The material must be logically structured, information-dense, concise and easy to learn.
|
| 54 |
+
- The material must be accurate to avoid misleading students.
|
| 55 |
+
- The material must maintain a formal and educational tone and avoid casual expressions.
|
| 56 |
+
- The explanation must be at the beginning of the material.
|
| 57 |
+
- The exercise in the material must be starts with 'The exercise:'.
|
| 58 |
+
- The solution in the material must be starts with 'The solution:'.
|
| 59 |
+
- All mathematical expressions in the material must be formatted using LaTeX.
|
| 60 |
+
|
| 61 |
+
The result format is as follows.
|
| 62 |
+
<material></material>
|
| 63 |
+
|
| 64 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
# ============================================================================
|
| 68 |
+
# Textbook Exercise - Medium
|
| 69 |
+
# ============================================================================
|
| 70 |
+
|
| 71 |
+
MATH_INSTRUCT_TEXTBOOK_MEDIUM_PROMPT = '''Mathematical Knowledge Point:{mathematical_knowledge_point}
|
| 72 |
+
|
| 73 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 74 |
+
Your goal is to utilize your abilities, generate informative, textbook-style learning mathematical material suitable for students.
|
| 75 |
+
You should follow these steps:
|
| 76 |
+
1. First, provide a detailed explanation based on the given mathematical knowledge point.
|
| 77 |
+
2. Second, generate an medium-difficulty exercise based on the provided explanation according to the following requirements.
|
| 78 |
+
- The goal of the exercise is to help students master the given mathematical knowledge point.
|
| 79 |
+
- Other mathematical knowledge points can be incorporated into the exercises to increase the difficulty to medium level.
|
| 80 |
+
- The exercise must be self-contained.
|
| 81 |
+
- Ensure the exercise is fully text-based and solvable without images.
|
| 82 |
+
3. Third, provide a solution based on the generated exercise according to the following requirements.
|
| 83 |
+
- The solution must be detailed and step-by-step.
|
| 84 |
+
4. Finally, construct the generated explanation, exercise, and solution into textbook-style learning material according to the following requirements.
|
| 85 |
+
- The material must be logically structured, information-dense, concise and easy to learn.
|
| 86 |
+
- The material must be accurate to avoid misleading students.
|
| 87 |
+
- The material must maintain a formal and educational tone and avoid casual expressions.
|
| 88 |
+
- The explanation must be at the beginning of the material.
|
| 89 |
+
- The exercise in the material must be starts with 'The exercise:'.
|
| 90 |
+
- The solution in the material must be starts with 'The solution:'.
|
| 91 |
+
- All mathematical expressions in the material must be formatted using LaTeX.
|
| 92 |
+
|
| 93 |
+
The result format is as follows.
|
| 94 |
+
<material></material>
|
| 95 |
+
|
| 96 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
# ============================================================================
|
| 100 |
+
# Textbook Exercise - Hard
|
| 101 |
+
# ============================================================================
|
| 102 |
+
|
| 103 |
+
MATH_INSTRUCT_TEXTBOOK_HARD_PROMPT = '''Mathematical Knowledge Point:{mathematical_knowledge_point}
|
| 104 |
+
|
| 105 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 106 |
+
Your goal is to utilize your abilities, generate informative, textbook-style learning mathematical material suitable for students.
|
| 107 |
+
You should follow these steps:
|
| 108 |
+
1. First, provide a detailed explanation based on the given mathematical knowledge point.
|
| 109 |
+
2. Second, generate an hard-difficulty exercise based on the provided explanation according to the following requirements.
|
| 110 |
+
- The goal of the exercise is to help students deeply understand and comprehensively apply the given mathematical knowledge point.
|
| 111 |
+
- Other mathematical knowledge points can be incorporated into the exercises to increase the difficulty to hard level.
|
| 112 |
+
- The exercise must be self-contained.
|
| 113 |
+
- Ensure the exercise is fully text-based and solvable without images.
|
| 114 |
+
3. Third, provide a solution based on the generated exercise according to the following requirements.
|
| 115 |
+
- The solution must be detailed and step-by-step.
|
| 116 |
+
4. Finally, construct the generated explanation, exercise, and solution into textbook-style learning material according to the following requirements.
|
| 117 |
+
- The material must be logically structured, information-dense, concise and easy to learn.
|
| 118 |
+
- The material must be accurate to avoid misleading students.
|
| 119 |
+
- The material must maintain a formal and educational tone and avoid casual expressions.
|
| 120 |
+
- The explanation must be at the beginning of the material.
|
| 121 |
+
- The exercise in the material must be starts with 'The exercise:'.
|
| 122 |
+
- The solution in the material must be starts with 'The solution:'.
|
| 123 |
+
- All mathematical expressions in the material must be formatted using LaTeX.
|
| 124 |
+
|
| 125 |
+
The result format is as follows.
|
| 126 |
+
<material></material>
|
| 127 |
+
|
| 128 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
# ============================================================================
|
| 132 |
+
# Prompt Registry
|
| 133 |
+
# ============================================================================
|
| 134 |
+
|
| 135 |
+
KNOWLEDGE_PROMPTS = {
|
| 136 |
+
"knowledge_extraction": MATH_INSTRUCT_KNOWLEDGE_EXTRACTION_PROMPT,
|
| 137 |
+
}
|
| 138 |
+
|
| 139 |
+
TEXTBOOK_EXERCISE_PROMPTS = {
|
| 140 |
+
"easy": MATH_INSTRUCT_TEXTBOOK_EASY_PROMPT,
|
| 141 |
+
"medium": MATH_INSTRUCT_TEXTBOOK_MEDIUM_PROMPT,
|
| 142 |
+
"hard": MATH_INSTRUCT_TEXTBOOK_HARD_PROMPT,
|
| 143 |
+
}
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def get_knowledge_extraction_prompt() -> str:
|
| 147 |
+
"""
|
| 148 |
+
Get knowledge extraction prompt
|
| 149 |
+
|
| 150 |
+
Returns:
|
| 151 |
+
Knowledge extraction prompt template string
|
| 152 |
+
"""
|
| 153 |
+
return MATH_INSTRUCT_KNOWLEDGE_EXTRACTION_PROMPT
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def get_textbook_exercise_prompt(difficulty: str) -> str:
|
| 157 |
+
"""
|
| 158 |
+
Get textbook exercise prompt for specified difficulty
|
| 159 |
+
|
| 160 |
+
Args:
|
| 161 |
+
difficulty: Difficulty level, options: "easy", "medium", "hard"
|
| 162 |
+
|
| 163 |
+
Returns:
|
| 164 |
+
Corresponding prompt template string
|
| 165 |
+
"""
|
| 166 |
+
if difficulty not in TEXTBOOK_EXERCISE_PROMPTS:
|
| 167 |
+
raise ValueError(f"Unknown difficulty: {difficulty}. Available: {list(TEXTBOOK_EXERCISE_PROMPTS.keys())}")
|
| 168 |
+
return TEXTBOOK_EXERCISE_PROMPTS[difficulty]
|
multistyle_rewrite.py
ADDED
|
@@ -0,0 +1,224 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
UltraData-Math L3 - Multi-Style Rewrite Prompts
|
| 4 |
+
|
| 5 |
+
Style types: Wikipedia, Textbook, Blog, Popular Science, Academic Paper, Learning Note, Lecture Note
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
# ============================================================================
|
| 9 |
+
# Wikipedia Style
|
| 10 |
+
# ============================================================================
|
| 11 |
+
|
| 12 |
+
MATH_INSTRUCT_WIKI_PROMPT = '''Math Content:{text}
|
| 13 |
+
|
| 14 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 15 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the wiki style.
|
| 16 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 17 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 18 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 19 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 20 |
+
2. Then, focus on the captured and preserved information, combine it with the wiki style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 21 |
+
- The overall structure of the initial draft should follow the structure used by Wikipedia, employing a modular, encyclopedic organizational format.
|
| 22 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Wikipedia, employing highly concise and objective declarative sentences. It adheres to the "definition-first" principle, rigorously uses standard terminology, maintains a formal sentence structures, and avoids colloquial or personalized expressions.
|
| 23 |
+
- The overall tone of the initial draft should follow the tone used by Wikipedia, maintaining an absolutely neutral, authoritative, and impersonal encyclopedic tone.
|
| 24 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 25 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 26 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 27 |
+
- The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
|
| 28 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 29 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 30 |
+
|
| 31 |
+
The result format is as follows:
|
| 32 |
+
<rewritten content></rewritten content>'''
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# ============================================================================
|
| 36 |
+
# Textbook Style
|
| 37 |
+
# ============================================================================
|
| 38 |
+
|
| 39 |
+
MATH_INSTRUCT_TEXTBOOK_PROMPT = '''Math Content:{text}
|
| 40 |
+
|
| 41 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 42 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the textbook style.
|
| 43 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 44 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 45 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 46 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 47 |
+
2. Then, focus on the captured and preserved information, combine it with the textbook style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 48 |
+
- The overall structure of the initial draft should follow the structure used by Textbook, employing a rigorous logical progression system, unfolding through a modular structure of "definition-theorem/proof/formula/property-example".
|
| 49 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Textbook, combining standardized and precise disciplinary terminology with guided instructional language while avoiding colloquialism or ambiguity to ensure the accuracy and teachability of knowledge points. It must be accurate and complete.
|
| 50 |
+
- The overall tone of the initial draft should follow the tone used by Textbook, maintaining an authoritative, neutral, objective, and inquiry-based instructional tone. It should foster a positive learning environment while preserving professionalism.
|
| 51 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 52 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 53 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 54 |
+
- All examples in the refined content must include detailed and step-by-step solutions.
|
| 55 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 56 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 57 |
+
|
| 58 |
+
The result format is as follows:
|
| 59 |
+
<rewritten content></rewritten content>'''
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
# ============================================================================
|
| 63 |
+
# Blog Style
|
| 64 |
+
# ============================================================================
|
| 65 |
+
|
| 66 |
+
MATH_INSTRUCT_BLOG_PROMPT = '''Math Content:{text}
|
| 67 |
+
|
| 68 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 69 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the blog style.
|
| 70 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 71 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 72 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 73 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 74 |
+
2. Then, focus on the captured and preserved information, combine it with the blog style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 75 |
+
- The overall structure of the initial draft should follow the structure used by Blog, employing a modular yet flexible content arrangement. It typically begins with captivating titles or thought-provoking questions, utilizes short paragraphs and subheadings to enhance readability, and establishes a relaxed and free-flowing reading rhythm.
|
| 76 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Blog, employing simple and conversational sentence patterns. It should prioritize short sentences, questions, and exclamations to create rhythm and interactivity, while avoiding lengthy and complex professional jargon. Analogies, metaphors, and real-life examples should be skillfully utilized to explain complex mathematical concepts, thereby lowering the reader's barrier to comprehension. It must be accurate and complete.
|
| 77 |
+
- The overall tone of the initial draft should follow the tone used by Blog, maintaining a relatable and natural conversational style with infectious enthusiasm, aiming to spark readers' interest and encourage interaction and sharing.
|
| 78 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 79 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 80 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 81 |
+
- The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
|
| 82 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 83 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 84 |
+
|
| 85 |
+
The result format is as follows:
|
| 86 |
+
<rewritten content></rewritten content>'''
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
# ============================================================================
|
| 90 |
+
# Popular Science Style
|
| 91 |
+
# ============================================================================
|
| 92 |
+
|
| 93 |
+
MATH_INSTRUCT_POPULAR_SCIENCE_PROMPT = '''Math Content:{text}
|
| 94 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 95 |
+
|
| 96 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the popular science style.
|
| 97 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 98 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 99 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 100 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 101 |
+
2. Then, focus on the captured and preserved information, combine it with the popular science style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 102 |
+
- The overall structure of the initial draft should follow the structure used by Popular Science, guided by an engaging narrative thread or real-world problem. It should progressively unfold step by step, gradually guiding readers to understand core concepts and construct cognitive pathways of knowledge.
|
| 103 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Popular Science, actively avoiding specialized terminology and complex symbols. It should employ vivid, sensory descriptions and make extensive use of metaphors, analogies, and imaginative imagery to explain abstract concepts, prioritizing experiential resonance over the accumulation of technical jargon. It must be accurate and complete.
|
| 104 |
+
- The overall tone of the initial draft should follow the tone used by Popular Science, maintaining a narrative style filled with wonder and enthusiastic exploration. It should foster a relatable and natural conversational atmosphere, aiming to spark the imagination and interest of general readers.
|
| 105 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 106 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 107 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 108 |
+
- The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
|
| 109 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 110 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 111 |
+
|
| 112 |
+
The result format is as follows:
|
| 113 |
+
<rewritten content></rewritten content>'''
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
# ============================================================================
|
| 117 |
+
# Academic Paper Style
|
| 118 |
+
# ============================================================================
|
| 119 |
+
|
| 120 |
+
MATH_INSTRUCT_ACADEMIC_PAPER_PROMPT = '''Math Content:{text}
|
| 121 |
+
|
| 122 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 123 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the academic paper style.
|
| 124 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 125 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 126 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 127 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 128 |
+
2. Then, focus on the captured and preserved information, combine it with the academic paper style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 129 |
+
- The overall structure of the initial draft should follow the structure used by Academic Paper, following highly standardized and rigorous formats, ensuring clear organization and logical progression.
|
| 130 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Academic Paper, employing highly specialized disciplinary terminology and passive voice constructions, and utilizing complex sentence structures and quantitative expressions to ensure academic rigor, striving for absolute precision and clarity in order to avoid any ambiguity. It must be accurate and complete.
|
| 131 |
+
- The overall tone of the initial draft should follow the tone used by Academic Paper, maintaining an absolutely objective and neutral researcher's stance while eliminating any subjective elements. The focus shall be on presenting facts, evidence, and logical reasoning, aiming to engage in rigorous dialogue with academic peers.
|
| 132 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 133 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 134 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 135 |
+
- The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
|
| 136 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 137 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 138 |
+
|
| 139 |
+
The result format is as follows:
|
| 140 |
+
<rewritten content></rewritten content>'''
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
# ============================================================================
|
| 144 |
+
# Learning Note Style
|
| 145 |
+
# ============================================================================
|
| 146 |
+
|
| 147 |
+
MATH_INSTRUCT_LEARNING_NOTE_PROMPT = '''Math Content:{text}
|
| 148 |
+
|
| 149 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 150 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the learning note style.
|
| 151 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 152 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 153 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 154 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 155 |
+
2. Then, focus on the captured and preserved information, combine it with the learning note style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 156 |
+
- The overall structure of the initial draft should follow the structure used by Learning Note, prioritizing personal comprehension over rigid formatting. It typically employs a modular approach with point-by-point enumeration to facilitate organization and clarity.
|
| 157 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Learning Note, employing highly concise and fragmented language—predominantly keywords, phrases, and incomplete sentences. It should incorporate meta-cognitive elements such as self-posed questions and answers, error annotation, and insight notes to clarify thinking and reinforce memory. It must be accurate and complete.
|
| 158 |
+
- The overall tone of the initial draft should follow the tone used by Learning Note. It is subjective, direct, and exploratory, resembling a dialogue with oneself. It should focus on documenting "my" comprehension difficulties, sudden insights, and key points requiring review, all characterized by strong personal nuance.
|
| 159 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 160 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 161 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 162 |
+
- The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
|
| 163 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 164 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 165 |
+
|
| 166 |
+
The result format is as follows:
|
| 167 |
+
<rewritten content></rewritten content>'''
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
# ============================================================================
|
| 171 |
+
# Lecture Note Style
|
| 172 |
+
# ============================================================================
|
| 173 |
+
|
| 174 |
+
MATH_INSTRUCT_LECTURE_NOTE_PROMPT = '''Math Content:{text}
|
| 175 |
+
|
| 176 |
+
As s a mathematical content creation expert, you are highly proficient in mathematical knowledge and in the analysis and rewriting of mathematical content, capable of adapting content based on different creation styles to produce diverse, informative, and high-quality mathematical content.
|
| 177 |
+
Your goal is to utilize your abilities, rewrite the provided math content in the lecture note style.
|
| 178 |
+
Before beginning the rewrite, you will consider the following requirements:
|
| 179 |
+
1. First, read the provided math content thoroughly, carefully analyze the provided math content to capture and preserve information according to the following requirements.
|
| 180 |
+
- Capture and preserve crucial mathematical information, key mathematical concepts, important mathematical values, and factual mathematical details in the original text.
|
| 181 |
+
- Capture and preserve mathematical examples, reasoning processes, as well as related explanations and proofs in the original text.
|
| 182 |
+
2. Then, focus on the captured and preserved information, combine it with the lecture note style, and rewrite the text to form an initial draft, according to the following requirements.
|
| 183 |
+
- The overall structure of the initial draft should follow the structure used by Lecture Note, guided by teaching objectives. It achieves systematic knowledge transfer through hierarchical organization of key points, formula derivation demonstrations, and case analysis modules.
|
| 184 |
+
- The sentence expression of the initial draft should follow the sentence expression used by Lecture Note, employing professional discourse that balances authority and guidance. It should integrate disciplinary terminology with instructional explanations, utilizing rhetorical questions, emphatic statements, and directive language to highlight key and challenging points. It must be accurate and complete.
|
| 185 |
+
- The overall tone of the initial draft should follow the tone used by Lecture Note, maintaining an authoritative narrative stance that combines credibility with guidance. Like an invisible teacher directing the reader's thinking in real time, it emphasizes the mastery of methods and thought processes, often anticipating potential reader confusion to create an immersive learning atmosphere.
|
| 186 |
+
3. Third, refine the initial draft according to the following requirements.
|
| 187 |
+
- The content of the refined content must be logically structured, high-quality, information-dense.
|
| 188 |
+
- The overall layout of the refined content must not use LaTeX formatting.
|
| 189 |
+
- The refined content may appropriately include relevant examples to enhance overall comprehensibility, and these examples must include detailed and step-by-step solutions.
|
| 190 |
+
- All mathematical expressions in the refined content must be formatted using LaTeX.
|
| 191 |
+
4. Finally, please put the final rewritten content within <rewritten content></rewritten content>.
|
| 192 |
+
|
| 193 |
+
The result format is as follows:
|
| 194 |
+
<rewritten content></rewritten content>'''
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
# ============================================================================
|
| 198 |
+
# Prompt Registry
|
| 199 |
+
# ============================================================================
|
| 200 |
+
|
| 201 |
+
MULTISTYLE_PROMPTS = {
|
| 202 |
+
"wikipedia": MATH_INSTRUCT_WIKI_PROMPT,
|
| 203 |
+
"textbook": MATH_INSTRUCT_TEXTBOOK_PROMPT,
|
| 204 |
+
"blog": MATH_INSTRUCT_BLOG_PROMPT,
|
| 205 |
+
"popular_science": MATH_INSTRUCT_POPULAR_SCIENCE_PROMPT,
|
| 206 |
+
"academic_paper": MATH_INSTRUCT_ACADEMIC_PAPER_PROMPT,
|
| 207 |
+
"learning_note": MATH_INSTRUCT_LEARNING_NOTE_PROMPT,
|
| 208 |
+
"lecture_note": MATH_INSTRUCT_LECTURE_NOTE_PROMPT,
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
def get_multistyle_prompt(style: str) -> str:
|
| 213 |
+
"""
|
| 214 |
+
Get multi-style rewrite prompt for specified style
|
| 215 |
+
|
| 216 |
+
Args:
|
| 217 |
+
style: Style type, see MULTISTYLE_PROMPTS.keys() for options
|
| 218 |
+
|
| 219 |
+
Returns:
|
| 220 |
+
Corresponding prompt template string
|
| 221 |
+
"""
|
| 222 |
+
if style not in MULTISTYLE_PROMPTS:
|
| 223 |
+
raise ValueError(f"Unknown style: {style}. Available styles: {list(MULTISTYLE_PROMPTS.keys())}")
|
| 224 |
+
return MULTISTYLE_PROMPTS[style]
|
qa_synthesis.py
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
UltraData-Math L3 - Q&A Synthesis Prompts
|
| 4 |
+
|
| 5 |
+
Reference: Jiuzhang-Math, MathGPT
|
| 6 |
+
Difficulty levels: Grade School, Middle School, High School, College
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
# ============================================================================
|
| 10 |
+
# Grade School Q&A Prompt
|
| 11 |
+
# ============================================================================
|
| 12 |
+
|
| 13 |
+
MATH_INSTRUCT_GRADE_SCHOOL_PROMPT = '''Math Content:{text}
|
| 14 |
+
|
| 15 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 16 |
+
Your goal is to utilize your abilities, create an age-appropriate math word problem for grade school students based on the provided math content.
|
| 17 |
+
You should follow these steps:
|
| 18 |
+
1. First, craft a concise math word problem suitable for grade school, according to the following requirements.
|
| 19 |
+
- The crafted problem must focus on basic arithmetic operations (addition, subtraction, multiplication, division), number sense, simple shapes, or introductory measurements.
|
| 20 |
+
- The crafted problem must use relatable, real-world scenarios appropriate for the age group.
|
| 21 |
+
- The crafted problem must include all necessary information for solving it.
|
| 22 |
+
- The crafted problem must be purely text-based and solvable without images.
|
| 23 |
+
2. Then, provide a clear, step-by-step solution to the crafted problem, according to the following requirements.
|
| 24 |
+
- The solution must use simple language that a grade school student could understand.
|
| 25 |
+
- The solution must explain the reasoning behind each step.
|
| 26 |
+
3. Finally, please put the crafted problem within <problem></problem> and put the solution within <solution></solution>.
|
| 27 |
+
The result format is as follows:
|
| 28 |
+
<result>
|
| 29 |
+
<problem></problem>
|
| 30 |
+
<solution></solution>
|
| 31 |
+
</result>
|
| 32 |
+
|
| 33 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# ============================================================================
|
| 37 |
+
# Middle School Q&A Prompt
|
| 38 |
+
# ============================================================================
|
| 39 |
+
|
| 40 |
+
MATH_INSTRUCT_MIDDLE_SCHOOL_PROMPT = '''Math Content:{text}
|
| 41 |
+
|
| 42 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 43 |
+
Your goal is to utilize your abilities, create an middle school level math problem and solution based on the provided math content.
|
| 44 |
+
You should follow these steps:
|
| 45 |
+
1. First, create a self-contained problem for middle school student that directly incorporates a concept from the provided math content, according to the following requirements.
|
| 46 |
+
- The created problem must target a difficulty level appropriate for grades 6-8 (ages 11-14), assuming knowledge of arithmetic, pre-algebra, basic probability/statistics, and geometry.
|
| 47 |
+
- The created problem must include all necessary information for solving it.
|
| 48 |
+
- The created problem must be fully text-based and solvable without images.
|
| 49 |
+
- The created problem must use concepts typically covered by the end of 8th grade.
|
| 50 |
+
2. Then, provide a detailed, step-by-step solution to the created problem, according to the following requirements.
|
| 51 |
+
- The solution must demonstrate the mathematical reasoning from problem statement to conclusion.
|
| 52 |
+
- The solution must explain each step to reinforce the underlying math principles being applied.
|
| 53 |
+
- All mathematical expressions in the solution must be formatted using LaTeX.
|
| 54 |
+
3. Finally, please put the created problem within <problem></problem> and put the solution within <solution></solution>.
|
| 55 |
+
The result format is as follows:
|
| 56 |
+
<result>
|
| 57 |
+
<problem></problem>
|
| 58 |
+
<solution></solution>
|
| 59 |
+
</result>
|
| 60 |
+
|
| 61 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
# ============================================================================
|
| 65 |
+
# High School Q&A Prompt
|
| 66 |
+
# ============================================================================
|
| 67 |
+
|
| 68 |
+
MATH_INSTRUCT_HIGH_SCHOOL_PROMPT = '''Math Content:{text}
|
| 69 |
+
|
| 70 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 71 |
+
Your goal is to utilize your abilities, inspired by the provided math content, create high school-level math problem that combines concepts from at least two math subjects.
|
| 72 |
+
You should follow these steps:
|
| 73 |
+
1. First, draft a self-contained math problem for high school students based on the provided math content, according to the following requirements.
|
| 74 |
+
- The drafted problem must require knowledge from one of these subjects: Algebra I and II, Pre-Calculus, Calculus, Geometry, Trigonometry, Statistics and Probability.
|
| 75 |
+
- The drafted problem must include all necessary information for solving it.
|
| 76 |
+
- The drafted problem must be fully text-based and solvable without images.
|
| 77 |
+
- The drafted problem must use concepts typically covered by the end of 11th grade.
|
| 78 |
+
2. Then, provide a detailed, step-by-step solution to the drafted problem, according to the following requirements.
|
| 79 |
+
- The solution must demonstrate the mathematical reasoning from problem statement to conclusion.
|
| 80 |
+
- The solution must explain each step to reinforce the underlying math principles being applied.
|
| 81 |
+
- All mathematical expressions in the solution must be formatted using LaTeX.
|
| 82 |
+
3. Finally, please put the drafted problem within <problem></problem> and put the solution within <solution></solution>.
|
| 83 |
+
The result format is as follows:
|
| 84 |
+
<result>
|
| 85 |
+
<problem></problem>
|
| 86 |
+
<solution></solution>
|
| 87 |
+
</result>
|
| 88 |
+
|
| 89 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
# ============================================================================
|
| 93 |
+
# College/University Q&A Prompt
|
| 94 |
+
# ============================================================================
|
| 95 |
+
|
| 96 |
+
MATH_INSTRUCT_COLLEGE_PROMPT = '''Math Content:{text}
|
| 97 |
+
|
| 98 |
+
As a math teacher, you are highly proficient in mathematical knowledge.
|
| 99 |
+
Your goal is to utilize your abilities, inspired by the provided math content, create a college-level math problem.
|
| 100 |
+
You should follow these steps:
|
| 101 |
+
1. First, draft a self-contained, college-level math problem inspired by the math content, according to the following requirements.
|
| 102 |
+
- The drafted problem must be intellectually stimulating and designed for an audience familiar with advanced mathematics, such as Calculus, Linear Algebra, Abstract Algebra, etc.
|
| 103 |
+
- The drafted problem must include all necessary information for solving it.
|
| 104 |
+
- The drafted problem must be fully text-based and solvable without images.
|
| 105 |
+
2. Then, provide a detailed, step-by-step solution to the drafted problem, according to the following requirements.
|
| 106 |
+
- The solution must clearly explain the reasoning, mathematical principles, and steps used.
|
| 107 |
+
- Call out any key theorems or properties being applied at each step.
|
| 108 |
+
- All mathematical expressions in the solution must be formatted using LaTeX.
|
| 109 |
+
3. Finally, please put the drafted problem within <problem></problem> and put the solution within <solution></solution>.
|
| 110 |
+
TThe result format is as follows:
|
| 111 |
+
<result>
|
| 112 |
+
<problem></problem>
|
| 113 |
+
<solution></solution>
|
| 114 |
+
</result>
|
| 115 |
+
|
| 116 |
+
In addition, the output format refrain from using Markdown, avoid bold or italic styles, and do not add any text decorations.'''
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ============================================================================
|
| 120 |
+
# Prompt Registry
|
| 121 |
+
# ============================================================================
|
| 122 |
+
|
| 123 |
+
QA_PROMPTS = {
|
| 124 |
+
"grade_school": MATH_INSTRUCT_GRADE_SCHOOL_PROMPT,
|
| 125 |
+
"middle_school": MATH_INSTRUCT_MIDDLE_SCHOOL_PROMPT,
|
| 126 |
+
"high_school": MATH_INSTRUCT_HIGH_SCHOOL_PROMPT,
|
| 127 |
+
"college": MATH_INSTRUCT_COLLEGE_PROMPT,
|
| 128 |
+
}
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def get_qa_prompt(level: str) -> str:
|
| 132 |
+
"""
|
| 133 |
+
Get Q&A synthesis prompt for specified difficulty level
|
| 134 |
+
|
| 135 |
+
Args:
|
| 136 |
+
level: Difficulty level, options: "grade_school", "middle_school", "high_school", "college"
|
| 137 |
+
|
| 138 |
+
Returns:
|
| 139 |
+
Corresponding prompt template string
|
| 140 |
+
"""
|
| 141 |
+
if level not in QA_PROMPTS:
|
| 142 |
+
raise ValueError(f"Unknown level: {level}. Available levels: {list(QA_PROMPTS.keys())}")
|
| 143 |
+
return QA_PROMPTS[level]
|
requirements.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.0.0
|
| 2 |
+
openai>=1.0.0
|
run_synthesis.py
ADDED
|
@@ -0,0 +1,514 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""
|
| 3 |
+
UltraData-Math L3 - Data Synthesis Script
|
| 4 |
+
|
| 5 |
+
OpenAI API-based data synthesis tool, supporting:
|
| 6 |
+
- Q&A synthesis
|
| 7 |
+
- Multi-turn conversation synthesis
|
| 8 |
+
- Multi-style rewriting
|
| 9 |
+
- Knowledge extraction and textbook exercise generation
|
| 10 |
+
|
| 11 |
+
Usage:
|
| 12 |
+
python run_synthesis.py \
|
| 13 |
+
--input data.jsonl \
|
| 14 |
+
--output output.jsonl \
|
| 15 |
+
--task qa \
|
| 16 |
+
--level high_school \
|
| 17 |
+
--model gpt-4o \
|
| 18 |
+
--workers 10
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
import argparse
|
| 22 |
+
import asyncio
|
| 23 |
+
import json
|
| 24 |
+
import os
|
| 25 |
+
import re
|
| 26 |
+
import time
|
| 27 |
+
from pathlib import Path
|
| 28 |
+
from typing import Optional
|
| 29 |
+
|
| 30 |
+
from openai import AsyncOpenAI
|
| 31 |
+
|
| 32 |
+
# Import prompt templates
|
| 33 |
+
from qa_synthesis import QA_PROMPTS, get_qa_prompt
|
| 34 |
+
from conversation_synthesis import CONVERSATION_PROMPTS, get_conversation_prompt
|
| 35 |
+
from multistyle_rewrite import MULTISTYLE_PROMPTS, get_multistyle_prompt
|
| 36 |
+
from knowledge_textbook import (
|
| 37 |
+
get_knowledge_extraction_prompt,
|
| 38 |
+
get_textbook_exercise_prompt,
|
| 39 |
+
TEXTBOOK_EXERCISE_PROMPTS,
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
# ============================================================================
|
| 44 |
+
# Configuration
|
| 45 |
+
# ============================================================================
|
| 46 |
+
|
| 47 |
+
DEFAULT_MODEL = "gpt-4o"
|
| 48 |
+
DEFAULT_TEMPERATURE = 0.7
|
| 49 |
+
DEFAULT_MAX_TOKENS = 4096
|
| 50 |
+
DEFAULT_WORKERS = 10
|
| 51 |
+
DEFAULT_MAX_RETRIES = 3
|
| 52 |
+
DEFAULT_RETRY_DELAY = 1.0
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
# ============================================================================
|
| 56 |
+
# Output Parsers
|
| 57 |
+
# ============================================================================
|
| 58 |
+
|
| 59 |
+
def parse_qa_output(response: str) -> dict:
|
| 60 |
+
"""Parse Q&A synthesis output"""
|
| 61 |
+
result = {"raw": response}
|
| 62 |
+
|
| 63 |
+
# Extract <problem> and <solution>
|
| 64 |
+
problem_match = re.search(r"<problem>(.*?)</problem>", response, re.DOTALL)
|
| 65 |
+
solution_match = re.search(r"<solution>(.*?)</solution>", response, re.DOTALL)
|
| 66 |
+
|
| 67 |
+
if problem_match:
|
| 68 |
+
result["problem"] = problem_match.group(1).strip()
|
| 69 |
+
if solution_match:
|
| 70 |
+
result["solution"] = solution_match.group(1).strip()
|
| 71 |
+
|
| 72 |
+
return result
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def parse_conversation_output(response: str) -> dict:
|
| 76 |
+
"""Parse conversation synthesis output"""
|
| 77 |
+
result = {"raw": response}
|
| 78 |
+
|
| 79 |
+
# Try multiple tags
|
| 80 |
+
for tag in ["discussions", "conversation", "interaction"]:
|
| 81 |
+
match = re.search(rf"<{tag}>(.*?)</{tag}>", response, re.DOTALL)
|
| 82 |
+
if match:
|
| 83 |
+
result["content"] = match.group(1).strip()
|
| 84 |
+
result["type"] = tag
|
| 85 |
+
break
|
| 86 |
+
|
| 87 |
+
return result
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def parse_rewrite_output(response: str) -> dict:
|
| 91 |
+
"""Parse multi-style rewrite output"""
|
| 92 |
+
result = {"raw": response}
|
| 93 |
+
|
| 94 |
+
match = re.search(r"<rewritten content>(.*?)</rewritten content>", response, re.DOTALL)
|
| 95 |
+
if match:
|
| 96 |
+
result["rewritten"] = match.group(1).strip()
|
| 97 |
+
|
| 98 |
+
return result
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def parse_knowledge_output(response: str) -> dict:
|
| 102 |
+
"""Parse knowledge extraction output"""
|
| 103 |
+
result = {"raw": response}
|
| 104 |
+
|
| 105 |
+
if "no result" in response.lower():
|
| 106 |
+
result["knowledge_points"] = []
|
| 107 |
+
return result
|
| 108 |
+
|
| 109 |
+
# Extract all knowledge points
|
| 110 |
+
pattern = r"<mathematical knowledge point\d*>(.*?)</mathematical knowledge point\d*>"
|
| 111 |
+
matches = re.findall(pattern, response, re.DOTALL)
|
| 112 |
+
result["knowledge_points"] = [m.strip() for m in matches]
|
| 113 |
+
|
| 114 |
+
return result
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
def parse_textbook_output(response: str) -> dict:
|
| 118 |
+
"""Parse textbook exercise output"""
|
| 119 |
+
result = {"raw": response}
|
| 120 |
+
|
| 121 |
+
match = re.search(r"<material>(.*?)</material>", response, re.DOTALL)
|
| 122 |
+
if match:
|
| 123 |
+
result["material"] = match.group(1).strip()
|
| 124 |
+
|
| 125 |
+
return result
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
OUTPUT_PARSERS = {
|
| 129 |
+
"qa": parse_qa_output,
|
| 130 |
+
"conversation": parse_conversation_output,
|
| 131 |
+
"rewrite": parse_rewrite_output,
|
| 132 |
+
"knowledge": parse_knowledge_output,
|
| 133 |
+
"textbook": parse_textbook_output,
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
# ============================================================================
|
| 138 |
+
# API Client
|
| 139 |
+
# ============================================================================
|
| 140 |
+
|
| 141 |
+
class SynthesisClient:
|
| 142 |
+
"""Data synthesis client"""
|
| 143 |
+
|
| 144 |
+
def __init__(
|
| 145 |
+
self,
|
| 146 |
+
api_key: Optional[str] = None,
|
| 147 |
+
base_url: Optional[str] = None,
|
| 148 |
+
model: str = DEFAULT_MODEL,
|
| 149 |
+
temperature: float = DEFAULT_TEMPERATURE,
|
| 150 |
+
max_tokens: int = DEFAULT_MAX_TOKENS,
|
| 151 |
+
max_retries: int = DEFAULT_MAX_RETRIES,
|
| 152 |
+
retry_delay: float = DEFAULT_RETRY_DELAY,
|
| 153 |
+
):
|
| 154 |
+
self.client = AsyncOpenAI(
|
| 155 |
+
api_key=api_key or os.getenv("OPENAI_API_KEY"),
|
| 156 |
+
base_url=base_url or os.getenv("OPENAI_BASE_URL"),
|
| 157 |
+
)
|
| 158 |
+
self.model = model
|
| 159 |
+
self.temperature = temperature
|
| 160 |
+
self.max_tokens = max_tokens
|
| 161 |
+
self.max_retries = max_retries
|
| 162 |
+
self.retry_delay = retry_delay
|
| 163 |
+
|
| 164 |
+
async def generate(self, prompt: str) -> str:
|
| 165 |
+
"""Call API to generate content"""
|
| 166 |
+
for attempt in range(self.max_retries):
|
| 167 |
+
try:
|
| 168 |
+
response = await self.client.chat.completions.create(
|
| 169 |
+
model=self.model,
|
| 170 |
+
messages=[{"role": "user", "content": prompt}],
|
| 171 |
+
temperature=self.temperature,
|
| 172 |
+
max_tokens=self.max_tokens,
|
| 173 |
+
)
|
| 174 |
+
return response.choices[0].message.content
|
| 175 |
+
except Exception as e:
|
| 176 |
+
if attempt < self.max_retries - 1:
|
| 177 |
+
await asyncio.sleep(self.retry_delay * (2 ** attempt))
|
| 178 |
+
else:
|
| 179 |
+
raise e
|
| 180 |
+
return ""
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
# ============================================================================
|
| 184 |
+
# Synthesis Tasks
|
| 185 |
+
# ============================================================================
|
| 186 |
+
|
| 187 |
+
class SynthesisTask:
|
| 188 |
+
"""Base class for synthesis tasks"""
|
| 189 |
+
|
| 190 |
+
def __init__(self, client: SynthesisClient, text_field: str = "text"):
|
| 191 |
+
self.client = client
|
| 192 |
+
self.text_field = text_field
|
| 193 |
+
|
| 194 |
+
def get_prompt(self, sample: dict) -> str:
|
| 195 |
+
raise NotImplementedError
|
| 196 |
+
|
| 197 |
+
def parse_output(self, response: str) -> dict:
|
| 198 |
+
raise NotImplementedError
|
| 199 |
+
|
| 200 |
+
async def process(self, sample: dict) -> dict:
|
| 201 |
+
"""Process a single sample"""
|
| 202 |
+
prompt = self.get_prompt(sample)
|
| 203 |
+
response = await self.client.generate(prompt)
|
| 204 |
+
parsed = self.parse_output(response)
|
| 205 |
+
return {**sample, "synthesis_result": parsed}
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
class QASynthesisTask(SynthesisTask):
|
| 209 |
+
"""Q&A synthesis task"""
|
| 210 |
+
|
| 211 |
+
def __init__(self, client: SynthesisClient, level: str, text_field: str = "text"):
|
| 212 |
+
super().__init__(client, text_field)
|
| 213 |
+
self.level = level
|
| 214 |
+
self.prompt_template = get_qa_prompt(level)
|
| 215 |
+
|
| 216 |
+
def get_prompt(self, sample: dict) -> str:
|
| 217 |
+
text = sample.get(self.text_field, "")
|
| 218 |
+
return self.prompt_template.format(text=text)
|
| 219 |
+
|
| 220 |
+
def parse_output(self, response: str) -> dict:
|
| 221 |
+
return parse_qa_output(response)
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
class ConversationSynthesisTask(SynthesisTask):
|
| 225 |
+
"""Conversation synthesis task"""
|
| 226 |
+
|
| 227 |
+
def __init__(self, client: SynthesisClient, style: str, text_field: str = "text"):
|
| 228 |
+
super().__init__(client, text_field)
|
| 229 |
+
self.style = style
|
| 230 |
+
self.prompt_template = get_conversation_prompt(style)
|
| 231 |
+
|
| 232 |
+
def get_prompt(self, sample: dict) -> str:
|
| 233 |
+
text = sample.get(self.text_field, "")
|
| 234 |
+
return self.prompt_template.format(text=text)
|
| 235 |
+
|
| 236 |
+
def parse_output(self, response: str) -> dict:
|
| 237 |
+
return parse_conversation_output(response)
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
class RewriteSynthesisTask(SynthesisTask):
|
| 241 |
+
"""Multi-style rewrite task"""
|
| 242 |
+
|
| 243 |
+
def __init__(self, client: SynthesisClient, style: str, text_field: str = "text"):
|
| 244 |
+
super().__init__(client, text_field)
|
| 245 |
+
self.style = style
|
| 246 |
+
self.prompt_template = get_multistyle_prompt(style)
|
| 247 |
+
|
| 248 |
+
def get_prompt(self, sample: dict) -> str:
|
| 249 |
+
text = sample.get(self.text_field, "")
|
| 250 |
+
return self.prompt_template.format(text=text)
|
| 251 |
+
|
| 252 |
+
def parse_output(self, response: str) -> dict:
|
| 253 |
+
return parse_rewrite_output(response)
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
class KnowledgeExtractionTask(SynthesisTask):
|
| 257 |
+
"""Knowledge extraction task"""
|
| 258 |
+
|
| 259 |
+
def __init__(self, client: SynthesisClient, text_field: str = "text"):
|
| 260 |
+
super().__init__(client, text_field)
|
| 261 |
+
self.prompt_template = get_knowledge_extraction_prompt()
|
| 262 |
+
|
| 263 |
+
def get_prompt(self, sample: dict) -> str:
|
| 264 |
+
text = sample.get(self.text_field, "")
|
| 265 |
+
return self.prompt_template.format(text=text)
|
| 266 |
+
|
| 267 |
+
def parse_output(self, response: str) -> dict:
|
| 268 |
+
return parse_knowledge_output(response)
|
| 269 |
+
|
| 270 |
+
|
| 271 |
+
class TextbookExerciseTask(SynthesisTask):
|
| 272 |
+
"""Textbook exercise generation task"""
|
| 273 |
+
|
| 274 |
+
def __init__(self, client: SynthesisClient, difficulty: str, knowledge_field: str = "knowledge_point"):
|
| 275 |
+
super().__init__(client)
|
| 276 |
+
self.difficulty = difficulty
|
| 277 |
+
self.knowledge_field = knowledge_field
|
| 278 |
+
self.prompt_template = get_textbook_exercise_prompt(difficulty)
|
| 279 |
+
|
| 280 |
+
def get_prompt(self, sample: dict) -> str:
|
| 281 |
+
knowledge = sample.get(self.knowledge_field, "")
|
| 282 |
+
return self.prompt_template.format(mathematical_knowledge_point=knowledge)
|
| 283 |
+
|
| 284 |
+
def parse_output(self, response: str) -> dict:
|
| 285 |
+
return parse_textbook_output(response)
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
# ============================================================================
|
| 289 |
+
# Batch Processing
|
| 290 |
+
# ============================================================================
|
| 291 |
+
|
| 292 |
+
async def process_batch(
|
| 293 |
+
task: SynthesisTask,
|
| 294 |
+
samples: list[dict],
|
| 295 |
+
workers: int,
|
| 296 |
+
progress_callback=None,
|
| 297 |
+
) -> list[dict]:
|
| 298 |
+
"""Process batch data concurrently"""
|
| 299 |
+
semaphore = asyncio.Semaphore(workers)
|
| 300 |
+
results = []
|
| 301 |
+
completed = 0
|
| 302 |
+
|
| 303 |
+
async def process_with_semaphore(sample: dict, idx: int):
|
| 304 |
+
nonlocal completed
|
| 305 |
+
async with semaphore:
|
| 306 |
+
try:
|
| 307 |
+
result = await task.process(sample)
|
| 308 |
+
result["_status"] = "success"
|
| 309 |
+
except Exception as e:
|
| 310 |
+
result = {**sample, "_status": "error", "_error": str(e)}
|
| 311 |
+
|
| 312 |
+
completed += 1
|
| 313 |
+
if progress_callback:
|
| 314 |
+
progress_callback(completed, len(samples))
|
| 315 |
+
|
| 316 |
+
return idx, result
|
| 317 |
+
|
| 318 |
+
tasks = [process_with_semaphore(sample, i) for i, sample in enumerate(samples)]
|
| 319 |
+
task_results = await asyncio.gather(*tasks)
|
| 320 |
+
|
| 321 |
+
# Sort by original order
|
| 322 |
+
task_results.sort(key=lambda x: x[0])
|
| 323 |
+
results = [r[1] for r in task_results]
|
| 324 |
+
|
| 325 |
+
return results
|
| 326 |
+
|
| 327 |
+
|
| 328 |
+
def load_jsonl(filepath: str) -> list[dict]:
|
| 329 |
+
"""Load JSONL file"""
|
| 330 |
+
data = []
|
| 331 |
+
with open(filepath, "r", encoding="utf-8") as f:
|
| 332 |
+
for line in f:
|
| 333 |
+
line = line.strip()
|
| 334 |
+
if line:
|
| 335 |
+
data.append(json.loads(line))
|
| 336 |
+
return data
|
| 337 |
+
|
| 338 |
+
|
| 339 |
+
def save_jsonl(data: list[dict], filepath: str):
|
| 340 |
+
"""Save JSONL file"""
|
| 341 |
+
with open(filepath, "w", encoding="utf-8") as f:
|
| 342 |
+
for item in data:
|
| 343 |
+
f.write(json.dumps(item, ensure_ascii=False) + "\n")
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
# ============================================================================
|
| 347 |
+
# Command Line Interface
|
| 348 |
+
# ============================================================================
|
| 349 |
+
|
| 350 |
+
def create_task(args, client: SynthesisClient) -> SynthesisTask:
|
| 351 |
+
"""Create synthesis task based on arguments"""
|
| 352 |
+
task_type = args.task
|
| 353 |
+
|
| 354 |
+
if task_type == "qa":
|
| 355 |
+
level = args.level or "high_school"
|
| 356 |
+
if level not in QA_PROMPTS:
|
| 357 |
+
raise ValueError(f"Invalid QA level: {level}. Available: {list(QA_PROMPTS.keys())}")
|
| 358 |
+
return QASynthesisTask(client, level, args.text_field)
|
| 359 |
+
|
| 360 |
+
elif task_type == "conversation":
|
| 361 |
+
style = args.style or "teacher_student"
|
| 362 |
+
if style not in CONVERSATION_PROMPTS:
|
| 363 |
+
raise ValueError(f"Invalid conversation style: {style}. Available: {list(CONVERSATION_PROMPTS.keys())}")
|
| 364 |
+
return ConversationSynthesisTask(client, style, args.text_field)
|
| 365 |
+
|
| 366 |
+
elif task_type == "rewrite":
|
| 367 |
+
style = args.style or "textbook"
|
| 368 |
+
if style not in MULTISTYLE_PROMPTS:
|
| 369 |
+
raise ValueError(f"Invalid rewrite style: {style}. Available: {list(MULTISTYLE_PROMPTS.keys())}")
|
| 370 |
+
return RewriteSynthesisTask(client, style, args.text_field)
|
| 371 |
+
|
| 372 |
+
elif task_type == "knowledge":
|
| 373 |
+
return KnowledgeExtractionTask(client, args.text_field)
|
| 374 |
+
|
| 375 |
+
elif task_type == "textbook":
|
| 376 |
+
difficulty = args.difficulty or "easy"
|
| 377 |
+
if difficulty not in TEXTBOOK_EXERCISE_PROMPTS:
|
| 378 |
+
raise ValueError(f"Invalid difficulty: {difficulty}. Available: {list(TEXTBOOK_EXERCISE_PROMPTS.keys())}")
|
| 379 |
+
return TextbookExerciseTask(client, difficulty, args.knowledge_field)
|
| 380 |
+
|
| 381 |
+
else:
|
| 382 |
+
raise ValueError(f"Unknown task type: {task_type}")
|
| 383 |
+
|
| 384 |
+
|
| 385 |
+
def print_progress(completed: int, total: int):
|
| 386 |
+
"""Print progress"""
|
| 387 |
+
percent = completed / total * 100
|
| 388 |
+
print(f"\rProgress: {completed}/{total} ({percent:.1f}%)", end="", flush=True)
|
| 389 |
+
|
| 390 |
+
|
| 391 |
+
async def main_async(args):
|
| 392 |
+
"""Async main function"""
|
| 393 |
+
# Create client
|
| 394 |
+
client = SynthesisClient(
|
| 395 |
+
api_key=args.api_key,
|
| 396 |
+
base_url=args.base_url,
|
| 397 |
+
model=args.model,
|
| 398 |
+
temperature=args.temperature,
|
| 399 |
+
max_tokens=args.max_tokens,
|
| 400 |
+
max_retries=args.max_retries,
|
| 401 |
+
)
|
| 402 |
+
|
| 403 |
+
# Create task
|
| 404 |
+
task = create_task(args, client)
|
| 405 |
+
|
| 406 |
+
# Load data
|
| 407 |
+
print(f"Loading data from {args.input}...")
|
| 408 |
+
samples = load_jsonl(args.input)
|
| 409 |
+
|
| 410 |
+
# Limit processing count
|
| 411 |
+
if args.limit:
|
| 412 |
+
samples = samples[:args.limit]
|
| 413 |
+
|
| 414 |
+
print(f"Processing {len(samples)} samples with {args.workers} workers...")
|
| 415 |
+
start_time = time.time()
|
| 416 |
+
|
| 417 |
+
# Process data
|
| 418 |
+
results = await process_batch(
|
| 419 |
+
task,
|
| 420 |
+
samples,
|
| 421 |
+
args.workers,
|
| 422 |
+
progress_callback=print_progress if not args.quiet else None,
|
| 423 |
+
)
|
| 424 |
+
|
| 425 |
+
elapsed = time.time() - start_time
|
| 426 |
+
print(f"\nCompleted in {elapsed:.2f}s ({len(samples)/elapsed:.1f} samples/s)")
|
| 427 |
+
|
| 428 |
+
# Statistics
|
| 429 |
+
success_count = sum(1 for r in results if r.get("_status") == "success")
|
| 430 |
+
error_count = len(results) - success_count
|
| 431 |
+
print(f"Success: {success_count}, Error: {error_count}")
|
| 432 |
+
|
| 433 |
+
# Save results
|
| 434 |
+
save_jsonl(results, args.output)
|
| 435 |
+
print(f"Results saved to {args.output}")
|
| 436 |
+
|
| 437 |
+
|
| 438 |
+
def main():
|
| 439 |
+
parser = argparse.ArgumentParser(
|
| 440 |
+
description="UltraData-Math L3 Data Synthesis Tool",
|
| 441 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
| 442 |
+
epilog="""
|
| 443 |
+
Examples:
|
| 444 |
+
# Q&A synthesis (high school level)
|
| 445 |
+
python run_synthesis.py -i data.jsonl -o qa_output.jsonl -t qa --level high_school
|
| 446 |
+
|
| 447 |
+
# Multi-turn conversation synthesis (teacher-student)
|
| 448 |
+
python run_synthesis.py -i data.jsonl -o conv_output.jsonl -t conversation --style teacher_student
|
| 449 |
+
|
| 450 |
+
# Multi-style rewrite (textbook style)
|
| 451 |
+
python run_synthesis.py -i data.jsonl -o rewrite_output.jsonl -t rewrite --style textbook
|
| 452 |
+
|
| 453 |
+
# Knowledge extraction
|
| 454 |
+
python run_synthesis.py -i data.jsonl -o knowledge_output.jsonl -t knowledge
|
| 455 |
+
|
| 456 |
+
# Textbook exercise generation (medium difficulty)
|
| 457 |
+
python run_synthesis.py -i knowledge.jsonl -o textbook_output.jsonl -t textbook --difficulty medium
|
| 458 |
+
|
| 459 |
+
Task Types:
|
| 460 |
+
qa Q&A synthesis
|
| 461 |
+
--level: grade_school, middle_school, high_school, college
|
| 462 |
+
|
| 463 |
+
conversation Multi-turn conversation synthesis
|
| 464 |
+
--style: two_professors, teacher_student, two_students,
|
| 465 |
+
interview, problem_solving, layman_expert, debate
|
| 466 |
+
|
| 467 |
+
rewrite Multi-style rewrite
|
| 468 |
+
--style: wikipedia, textbook, blog, popular_science,
|
| 469 |
+
academic_paper, learning_note, lecture_note
|
| 470 |
+
|
| 471 |
+
knowledge Knowledge extraction
|
| 472 |
+
|
| 473 |
+
textbook Textbook exercise generation
|
| 474 |
+
--difficulty: easy, medium, hard
|
| 475 |
+
"""
|
| 476 |
+
)
|
| 477 |
+
|
| 478 |
+
# Input/Output
|
| 479 |
+
parser.add_argument("-i", "--input", required=True, help="Input JSONL file path")
|
| 480 |
+
parser.add_argument("-o", "--output", required=True, help="Output JSONL file path")
|
| 481 |
+
|
| 482 |
+
# Task configuration
|
| 483 |
+
parser.add_argument("-t", "--task", required=True,
|
| 484 |
+
choices=["qa", "conversation", "rewrite", "knowledge", "textbook"],
|
| 485 |
+
help="Synthesis task type")
|
| 486 |
+
parser.add_argument("--level", help="Q&A difficulty level")
|
| 487 |
+
parser.add_argument("--style", help="Conversation/rewrite style")
|
| 488 |
+
parser.add_argument("--difficulty", help="Textbook exercise difficulty")
|
| 489 |
+
|
| 490 |
+
# Field configuration
|
| 491 |
+
parser.add_argument("--text-field", default="text", help="Input text field name (default: text)")
|
| 492 |
+
parser.add_argument("--knowledge-field", default="knowledge_point", help="Knowledge point field name (default: knowledge_point)")
|
| 493 |
+
|
| 494 |
+
# API configuration
|
| 495 |
+
parser.add_argument("--api-key", help="OpenAI API Key (or set OPENAI_API_KEY env var)")
|
| 496 |
+
parser.add_argument("--base-url", help="API Base URL (or set OPENAI_BASE_URL env var)")
|
| 497 |
+
parser.add_argument("--model", default=DEFAULT_MODEL, help=f"Model name (default: {DEFAULT_MODEL})")
|
| 498 |
+
parser.add_argument("--temperature", type=float, default=DEFAULT_TEMPERATURE, help=f"Sampling temperature (default: {DEFAULT_TEMPERATURE})")
|
| 499 |
+
parser.add_argument("--max-tokens", type=int, default=DEFAULT_MAX_TOKENS, help=f"Max tokens to generate (default: {DEFAULT_MAX_TOKENS})")
|
| 500 |
+
|
| 501 |
+
# Execution configuration
|
| 502 |
+
parser.add_argument("-w", "--workers", type=int, default=DEFAULT_WORKERS, help=f"Concurrency (default: {DEFAULT_WORKERS})")
|
| 503 |
+
parser.add_argument("--max-retries", type=int, default=DEFAULT_MAX_RETRIES, help=f"Max retries (default: {DEFAULT_MAX_RETRIES})")
|
| 504 |
+
parser.add_argument("--limit", type=int, help="Limit number of samples to process")
|
| 505 |
+
parser.add_argument("-q", "--quiet", action="store_true", help="Quiet mode")
|
| 506 |
+
|
| 507 |
+
args = parser.parse_args()
|
| 508 |
+
|
| 509 |
+
# Run
|
| 510 |
+
asyncio.run(main_async(args))
|
| 511 |
+
|
| 512 |
+
|
| 513 |
+
if __name__ == "__main__":
|
| 514 |
+
main()
|