Spaces:
Sleeping
Sleeping
File size: 11,765 Bytes
e783436 7ab6de8 e783436 8675864 e783436 e480c1e e783436 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | ---
title: DataFlow-VQA
emoji: π¬
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.25.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: "3.11"
---
# DataFlow-VQA
**[δΈζζζ‘£](README_zh.md)**
A pipeline for extracting, curating, and generating chain-of-thought (CoT) data from PDF textbooks and exam papers.
[π€Dataset](https://huggingface.co/datasets/OpenDCAI/FlipVQA)
## Overview

DataFlow-VQA processes PDF documents through three sequential stages:
- Stage1 (**Section 3.1: VQA Extraction**): Parses PDFs using [MinerU](https://github.com/opendatalab/MinerU) for document layout analysis, then uses an LLM to extract structured question-answer pairs with images.
- Stage2 (**Section 3.2.1 to Section 3.2.5: Data Curation**): Filters and cleans the extracted QA pairs β splits sub-questions, classifies question types, extracts concise answers, and removes low-quality items.
- Stage3 (**Section 3.2.6: CoT Generation**): Generates chain-of-thought reasoning via reject sampling β an LLM generates answers, which are verified against ground truth, and incorrect ones are retried.
## Installation
This project is built on top of [DataFlow](https://github.com/OpenDCAI/DataFlow). Clone and install it first:
```shell
git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
pip install -e ".[pdf2vqa]"
```
Then clone this repository:
```shell
git clone <this-repo-url>
cd DataFlow-VQA
```
## Configuration
### API Keys
Two API keys are required:
- `DF_API_KEY`: API key for the LLM service (OpenAI, Google Gemini, DeepSeek, etc.)
- `MINERU_API_KEY`: API key for [MinerU](https://mineru.net/apiManage/token) document layout parsing
```shell
export DF_API_KEY="sk-xxxxx"
export MINERU_API_KEY="sk2-xxxxx"
```
### LLM Endpoint
Each pipeline accepts `--api_url` and `--model` arguments. Any [OpenAI-compatible API](https://platform.openai.com/docs/api-reference) endpoint is supported, including OpenAI, Google Gemini (via proxy), DeepSeek, and others.
Provide the **base URL** without `/chat/completions` (e.g. `https://api.openai.com/v1`).
---
## Stage 1: VQA Extraction
### Input Format
Create a JSONL file where each line describes one PDF extraction task:
```jsonl
{"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
{"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}
```
- `input_pdf_paths`: A single PDF (questions and answers interleaved) or a list of two or more PDFs (questions before answers).
- `name`: A unique identifier for this task (used for directory naming and caching).
### Run
```bash
python -m pipelines.vqa_extract_optimized_pipeline \
--input_file ./examples/VQA/vqa_extract_test.jsonl \
--output_dir ./output \
--api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
--model gemini-2.5-pro
```
**Important:** We recommend using a strong powerful model here. Weak models like `gpt-5-mini` might perform bad.
### Output
- `{output_dir}/raw_vqa.jsonl`: Extracted QA pairs with image references
- `{output_dir}/{name}/vqa_images/`: Extracted images
- `cache/{name}/extracted_vqa.jsonl`, `merged_qa_pairs.jsonl`, `merged_qa_pairs.md`: Per-task intermediate files
Each QA item contains:
```json
{
"question": "Compute $x$ such that $x^2 - 1 = 0$.",
"answer": "$x = 1$ or $x = -1$",
"solution": "Factor as $(x-1)(x+1)=0$.",
"label": 1,
"question_chapter_title": "Chapter 1: Quadratic Equations",
"answer_chapter_title": "Chapter 1: Quadratic Equations",
"image_basedir": "/path/to/your/images"
}
```
### Note
**We also support using a local MinerU deployment**: Replace `FileOrURLToMarkdownConverterAPI` with `FileOrURLToMarkdownConverterLocal` or `FileOrURLToMarkdownConverterFlash` in `pipelines/vqa_extract_optimized_pipeline.py`:
```python
# Original opendatalab local version
self.mineru_executor = FileOrURLToMarkdownConverterLocal(
intermediate_dir="intermediate",
mineru_model_path="path/to/mineru/model",
)
# Accelerated version (Flash)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
intermediate_dir="intermediate",
mineru_model_path="path/to/mineru/model",
batch_size=4,
replicas=1,
num_gpus_per_replica=1,
engine_gpu_util_rate_to_ray_cap=0.9,
)
```
See [DataFlow's MinerU operators](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py) for full parameter documentation.
<details>
<summary>Pipeline details</summary>
The extraction pipeline runs six steps:
1. **PDF Merging** (`PDF_Merger`): If multiple PDFs are provided, merges them into one.
2. **Document Layout Parsing** (`FileOrURLToMarkdownConverterAPI`): Calls the MinerU API to produce structured JSON layout tokens and page images.
3. **Layout Preprocessing** (`MinerU2LLMInputOperator`): Flattens list items and re-indexes IDs to prepare LLM-ready input.
4. **LLM Extraction** (`ChunkedPromptedGenerator`): Chunks the layout JSON (max 128k tokens per chunk) and calls the LLM with `QAExtractPrompt` to extract QA pairs as structured XML.
5. **Output Parsing** (`LLMOutputParser`): Parses the XML response into JSONL and copies images to `vqa_images/`.
6. **QA Merging** (`QA_Merger`): For separated question/answer PDFs, matches question and answer blocks by chapter title and question number.
This operator includes a `strict_title_match` parameter: When set to True, the operator performs an exact string match on chapter titles. Otherwise, the operator attempts to extract Chinese or English sequence numbers from the titles for matching.
</details>
---
## Stage 2: Data Curation
```bash
python -m pipelines.curate_data \
--input_file ./output/raw_vqa.jsonl \
--api_url https://api.openai.com/v1 \
--model gpt-5-mini
```
Output is saved as `curated_vqa.jsonl` in the same directory as `--input_file`.
<details>
<summary>Pipeline details</summary>
Four sequential steps:
**1. Sub-question Splitting**
Questions with multiple independent parts (e.g. (a), (b), (c)) are split into separate items. Each sub-question is paired with its corresponding sub-answer and sub-solution. Items where the question or both answer and solution are empty are discarded.
Sub-questions that are context-sensitive (e.g. (b) uses the result of (a)) will not be split into separate items.
Adds field: `split_qa`
**2. Question Type Classification**
Each question is classified as one of: `Calculation`, `Proof`, `Explanation`, `Fill-in`, `Multiple-choice`, `Sketching`, `Other`.
By default, only `Calculation`, `Fill-in`, and `Multiple-choice` are retained. To change this, edit the `filter_rules` list in `DataCurationPipeline.__init__`.
Adds fields: `type`, `type_reason`
**3. Answer Extraction**
Extracts a concise final answer from the `solution` field and writes it to `answer`. Items that already have a non-empty `answer` are skipped (set `overwrite=True` in `AnswerExtractionOperator` to override).
**4. QA Filtering**
Removes items based on the following criteria:
- The question must pose a clear, specific problem suitable for an exam. Examples, statements without questions, and open-ended discussions are rejected.
- The answer must directly address the question.
- The question and answer must be self-contained, without relying on external references or omitted context.
Adds fields: `filter_result`, `filter_reason`
</details>
---
## Stage 3: Generate CoT
The answer model and judge model can use different API endpoints and API keys, which is useful when the answer model is a self-hosted open-source VLM (e.g. Qwen3-VL served via vLLM) and the judge model is a commercial API.
Use `--answer_api_key_env` / `--judge_api_key_env` to specify which environment variable holds the API key for each model (default: `DF_API_KEY` for both).
```bash
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"
python -m pipelines.generate_cot \
--input_file ./output/curated_vqa.jsonl \
--max_retries 5 \
--answer_api_url https://your-vllm-server/v1 \
--answer_model qwen3-vl-235b-thinking \
--answer_api_key_env VLLM_API_KEY \
--judge_api_url https://api.openai.com/v1 \
--judge_model gpt-5-mini \
--judge_api_key_env DF_API_KEY
```
Output is saved as `curated_vqa_with_cot.jsonl` in the same directory as `--input_file`.
<details>
<summary>Pipeline details</summary>
Uses reject sampling over up to `max_retries` rounds:
**1. Answer Generation** (`VQAReasoningAnswerGenerator`)
The LLM generates a step-by-step answer. Set `skip_text_only=True` in `RejectSamplingPipeline` to process only VQA items (questions containing images); set to `False` to process all items. Generated answer stored in `generated_cot`.
**2. Thinking Cleanup**
Strips `<think>...</think>` content from the generated answer to reduce verification cost. The cleaned answer is stored in `llm_short_answer`. Assumes the model outputs `<think>THINK</think>ANSWER` or `THINK</think>ANSWER`.
**3. Answer Verification** (`BenchDatasetEvaluatorQuestion`)
Compares `llm_short_answer` against the ground truth `answer` using semantic LLM evaluation (with 5% numerical tolerance). Items that pass are marked `answer_match_result = True` and skipped in subsequent rounds.
Set `support_subquestions=True` to evaluate each sub-question independently; `answer_match_result` is `False` if any sub-question is wrong.
Evaluation statistics (overall accuracy, sub-question accuracy) are saved to `./cot_cache/eval_results.jsonl`:
```json
{
"total_samples": 23584,
"matched_samples": 12281,
"accuracy": 0.521,
"total_subquestions": 26380,
"correct_subquestions": 13807,
"subquestion_accuracy": 0.523
}
```
</details>
---
## Examples
Sample PDFs and input JSONL are provided in `examples/VQA/`:
```
examples/VQA/
βββ vqa_extract_test.jsonl # Example input for Stage 1
βββ questionextract_test.pdf # Single PDF with interleaved Q&A
βββ math_question.pdf # Questions PDF (for separated Q&A demo)
βββ math_answer.pdf # Answers PDF (for separated Q&A demo)
```
To run the full pipeline on the examples:
```bash
# Stage 1: Extract
python -m pipelines.vqa_extract_optimized_pipeline \
--input_file ./examples/VQA/vqa_extract_test.jsonl \
--output_dir ./output \
--api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
--model gemini-2.5-pro
# Stage 2: Curate
python -m pipelines.curate_data \
--input_file ./output/raw_vqa.jsonl \
--api_url https://api.openai.com/v1 \
--model gpt-5-mini
# Stage 3: Generate CoT
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"
python -m pipelines.generate_cot \
--input_file ./output/curated_vqa.jsonl \
--max_retries 5 \
--answer_api_url https://your-vllm-server/v1 \
--answer_model qwen3-vl-235b-thinking \
--answer_api_key_env VLLM_API_KEY \
--judge_api_url https://api.openai.com/v1 \
--judge_model gpt-5-mini \
--judge_api_key_env DF_API_KEY
```
## Note
The implementation in this repository is only for running a demo at small scale. If you wish to run the pipeline on large number of books, you will probably need features [Checkpoint Resume](https://opendcai.github.io/DataFlow-Doc/en/guide/resume/) and [Batched Inference](https://opendcai.github.io/DataFlow-Doc/en/guide/batch/).
## License
This project is licensed under the [Apache License 2.0](LICENSE).
|