File size: 11,765 Bytes
e783436
7ab6de8
e783436
 
 
 
8675864
e783436
 
 
e480c1e
e783436
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
title: DataFlow-VQA
emoji: πŸ”¬
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.25.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: "3.11"
---

# DataFlow-VQA

**[δΈ­ζ–‡ζ–‡ζ‘£](README_zh.md)**

A pipeline for extracting, curating, and generating chain-of-thought (CoT) data from PDF textbooks and exam papers.

[πŸ€—Dataset](https://huggingface.co/datasets/OpenDCAI/FlipVQA)

## Overview
![DataFlow-VQA overview](static/overview_2.png)
DataFlow-VQA processes PDF documents through three sequential stages:

- Stage1 (**Section 3.1: VQA Extraction**): Parses PDFs using [MinerU](https://github.com/opendatalab/MinerU) for document layout analysis, then uses an LLM to extract structured question-answer pairs with images.
- Stage2 (**Section 3.2.1 to Section 3.2.5: Data Curation**): Filters and cleans the extracted QA pairs β€” splits sub-questions, classifies question types, extracts concise answers, and removes low-quality items.
- Stage3 (**Section 3.2.6: CoT Generation**): Generates chain-of-thought reasoning via reject sampling β€” an LLM generates answers, which are verified against ground truth, and incorrect ones are retried.



## Installation

This project is built on top of [DataFlow](https://github.com/OpenDCAI/DataFlow). Clone and install it first:

```shell
git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
pip install -e ".[pdf2vqa]"
```

Then clone this repository:

```shell
git clone <this-repo-url>
cd DataFlow-VQA
```

## Configuration

### API Keys

Two API keys are required:

- `DF_API_KEY`: API key for the LLM service (OpenAI, Google Gemini, DeepSeek, etc.)
- `MINERU_API_KEY`: API key for [MinerU](https://mineru.net/apiManage/token) document layout parsing

```shell
export DF_API_KEY="sk-xxxxx"
export MINERU_API_KEY="sk2-xxxxx"
```

### LLM Endpoint

Each pipeline accepts `--api_url` and `--model` arguments. Any [OpenAI-compatible API](https://platform.openai.com/docs/api-reference) endpoint is supported, including OpenAI, Google Gemini (via proxy), DeepSeek, and others.

Provide the **base URL** without `/chat/completions` (e.g. `https://api.openai.com/v1`).

---

## Stage 1: VQA Extraction

### Input Format

Create a JSONL file where each line describes one PDF extraction task:

```jsonl
{"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
{"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}
```

- `input_pdf_paths`: A single PDF (questions and answers interleaved) or a list of two or more PDFs (questions before answers).
- `name`: A unique identifier for this task (used for directory naming and caching).

### Run

```bash
python -m pipelines.vqa_extract_optimized_pipeline \
    --input_file ./examples/VQA/vqa_extract_test.jsonl \
    --output_dir ./output \
    --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
    --model gemini-2.5-pro
```

**Important:** We recommend using a strong powerful model here. Weak models like `gpt-5-mini` might perform bad.

### Output

- `{output_dir}/raw_vqa.jsonl`: Extracted QA pairs with image references
- `{output_dir}/{name}/vqa_images/`: Extracted images
- `cache/{name}/extracted_vqa.jsonl`, `merged_qa_pairs.jsonl`, `merged_qa_pairs.md`: Per-task intermediate files

Each QA item contains:

```json
{
  "question": "Compute $x$ such that $x^2 - 1 = 0$.",
  "answer": "$x = 1$ or $x = -1$",
  "solution": "Factor as $(x-1)(x+1)=0$.",
  "label": 1,
  "question_chapter_title": "Chapter 1: Quadratic Equations",
  "answer_chapter_title": "Chapter 1: Quadratic Equations",
  "image_basedir": "/path/to/your/images"
}
```

### Note

**We also support using a local MinerU deployment**: Replace `FileOrURLToMarkdownConverterAPI` with `FileOrURLToMarkdownConverterLocal` or `FileOrURLToMarkdownConverterFlash` in `pipelines/vqa_extract_optimized_pipeline.py`:

```python
# Original opendatalab local version
self.mineru_executor = FileOrURLToMarkdownConverterLocal(
    intermediate_dir="intermediate",
    mineru_model_path="path/to/mineru/model",
)

# Accelerated version (Flash)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
    intermediate_dir="intermediate",
    mineru_model_path="path/to/mineru/model",
    batch_size=4,
    replicas=1,
    num_gpus_per_replica=1,
    engine_gpu_util_rate_to_ray_cap=0.9,
)
```

See [DataFlow's MinerU operators](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py) for full parameter documentation.

<details>
<summary>Pipeline details</summary>

The extraction pipeline runs six steps:

1. **PDF Merging** (`PDF_Merger`): If multiple PDFs are provided, merges them into one.
2. **Document Layout Parsing** (`FileOrURLToMarkdownConverterAPI`): Calls the MinerU API to produce structured JSON layout tokens and page images.
3. **Layout Preprocessing** (`MinerU2LLMInputOperator`): Flattens list items and re-indexes IDs to prepare LLM-ready input.
4. **LLM Extraction** (`ChunkedPromptedGenerator`): Chunks the layout JSON (max 128k tokens per chunk) and calls the LLM with `QAExtractPrompt` to extract QA pairs as structured XML.
5. **Output Parsing** (`LLMOutputParser`): Parses the XML response into JSONL and copies images to `vqa_images/`.
6. **QA Merging** (`QA_Merger`): For separated question/answer PDFs, matches question and answer blocks by chapter title and question number.
This operator includes a `strict_title_match` parameter: When set to True, the operator performs an exact string match on chapter titles. Otherwise, the operator attempts to extract Chinese or English sequence numbers from the titles for matching.

</details>

---

## Stage 2: Data Curation

```bash
python -m pipelines.curate_data \
    --input_file ./output/raw_vqa.jsonl \
    --api_url https://api.openai.com/v1 \
    --model gpt-5-mini
```

Output is saved as `curated_vqa.jsonl` in the same directory as `--input_file`.

<details>
<summary>Pipeline details</summary>

Four sequential steps:

**1. Sub-question Splitting**

Questions with multiple independent parts (e.g. (a), (b), (c)) are split into separate items. Each sub-question is paired with its corresponding sub-answer and sub-solution. Items where the question or both answer and solution are empty are discarded.

Sub-questions that are context-sensitive (e.g. (b) uses the result of (a)) will not be split into separate items.

Adds field: `split_qa`

**2. Question Type Classification**

Each question is classified as one of: `Calculation`, `Proof`, `Explanation`, `Fill-in`, `Multiple-choice`, `Sketching`, `Other`.

By default, only `Calculation`, `Fill-in`, and `Multiple-choice` are retained. To change this, edit the `filter_rules` list in `DataCurationPipeline.__init__`.

Adds fields: `type`, `type_reason`

**3. Answer Extraction**

Extracts a concise final answer from the `solution` field and writes it to `answer`. Items that already have a non-empty `answer` are skipped (set `overwrite=True` in `AnswerExtractionOperator` to override).

**4. QA Filtering**

Removes items based on the following criteria:

- The question must pose a clear, specific problem suitable for an exam. Examples, statements without questions, and open-ended discussions are rejected.
- The answer must directly address the question.
- The question and answer must be self-contained, without relying on external references or omitted context.

Adds fields: `filter_result`, `filter_reason`

</details>

---

## Stage 3: Generate CoT

The answer model and judge model can use different API endpoints and API keys, which is useful when the answer model is a self-hosted open-source VLM (e.g. Qwen3-VL served via vLLM) and the judge model is a commercial API.

Use `--answer_api_key_env` / `--judge_api_key_env` to specify which environment variable holds the API key for each model (default: `DF_API_KEY` for both).

```bash
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx"   # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"

python -m pipelines.generate_cot \
    --input_file ./output/curated_vqa.jsonl \
    --max_retries 5 \
    --answer_api_url https://your-vllm-server/v1 \
    --answer_model qwen3-vl-235b-thinking \
    --answer_api_key_env VLLM_API_KEY \
    --judge_api_url https://api.openai.com/v1 \
    --judge_model gpt-5-mini \
    --judge_api_key_env DF_API_KEY
```

Output is saved as `curated_vqa_with_cot.jsonl` in the same directory as `--input_file`.

<details>
<summary>Pipeline details</summary>

Uses reject sampling over up to `max_retries` rounds:

**1. Answer Generation** (`VQAReasoningAnswerGenerator`)

The LLM generates a step-by-step answer. Set `skip_text_only=True` in `RejectSamplingPipeline` to process only VQA items (questions containing images); set to `False` to process all items. Generated answer stored in `generated_cot`.

**2. Thinking Cleanup**

Strips `<think>...</think>` content from the generated answer to reduce verification cost. The cleaned answer is stored in `llm_short_answer`. Assumes the model outputs `<think>THINK</think>ANSWER` or `THINK</think>ANSWER`.

**3. Answer Verification** (`BenchDatasetEvaluatorQuestion`)

Compares `llm_short_answer` against the ground truth `answer` using semantic LLM evaluation (with 5% numerical tolerance). Items that pass are marked `answer_match_result = True` and skipped in subsequent rounds.

Set `support_subquestions=True` to evaluate each sub-question independently; `answer_match_result` is `False` if any sub-question is wrong.

Evaluation statistics (overall accuracy, sub-question accuracy) are saved to `./cot_cache/eval_results.jsonl`:

```json
{
  "total_samples": 23584,
  "matched_samples": 12281,
  "accuracy": 0.521,
  "total_subquestions": 26380,
  "correct_subquestions": 13807,
  "subquestion_accuracy": 0.523
}
```

</details>

---

## Examples

Sample PDFs and input JSONL are provided in `examples/VQA/`:

```
examples/VQA/
β”œβ”€β”€ vqa_extract_test.jsonl    # Example input for Stage 1
β”œβ”€β”€ questionextract_test.pdf  # Single PDF with interleaved Q&A
β”œβ”€β”€ math_question.pdf         # Questions PDF (for separated Q&A demo)
└── math_answer.pdf           # Answers PDF (for separated Q&A demo)
```

To run the full pipeline on the examples:

```bash
# Stage 1: Extract
python -m pipelines.vqa_extract_optimized_pipeline \
    --input_file ./examples/VQA/vqa_extract_test.jsonl \
    --output_dir ./output \
    --api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
    --model gemini-2.5-pro

# Stage 2: Curate
python -m pipelines.curate_data \
    --input_file ./output/raw_vqa.jsonl \
    --api_url https://api.openai.com/v1 \
    --model gpt-5-mini

# Stage 3: Generate CoT
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx"   # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"

python -m pipelines.generate_cot \
    --input_file ./output/curated_vqa.jsonl \
    --max_retries 5 \
    --answer_api_url https://your-vllm-server/v1 \
    --answer_model qwen3-vl-235b-thinking \
    --answer_api_key_env VLLM_API_KEY \
    --judge_api_url https://api.openai.com/v1 \
    --judge_model gpt-5-mini \
    --judge_api_key_env DF_API_KEY
```

## Note
The implementation in this repository is only for running a demo at small scale. If you wish to run the pipeline on large number of books, you will probably need features [Checkpoint Resume](https://opendcai.github.io/DataFlow-Doc/en/guide/resume/) and [Batched Inference](https://opendcai.github.io/DataFlow-Doc/en/guide/batch/).

## License

This project is licensed under the [Apache License 2.0](LICENSE).