DataFlow-VQA / README.md
aaron1141's picture
chore: update Space title to DataFlow-VQA
7ab6de8
---
title: DataFlow-VQA
emoji: πŸ”¬
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.25.0
app_file: app.py
pinned: false
license: apache-2.0
python_version: "3.11"
---
# DataFlow-VQA
**[δΈ­ζ–‡ζ–‡ζ‘£](README_zh.md)**
A pipeline for extracting, curating, and generating chain-of-thought (CoT) data from PDF textbooks and exam papers.
[πŸ€—Dataset](https://huggingface.co/datasets/OpenDCAI/FlipVQA)
## Overview
![DataFlow-VQA overview](static/overview_2.png)
DataFlow-VQA processes PDF documents through three sequential stages:
- Stage1 (**Section 3.1: VQA Extraction**): Parses PDFs using [MinerU](https://github.com/opendatalab/MinerU) for document layout analysis, then uses an LLM to extract structured question-answer pairs with images.
- Stage2 (**Section 3.2.1 to Section 3.2.5: Data Curation**): Filters and cleans the extracted QA pairs β€” splits sub-questions, classifies question types, extracts concise answers, and removes low-quality items.
- Stage3 (**Section 3.2.6: CoT Generation**): Generates chain-of-thought reasoning via reject sampling β€” an LLM generates answers, which are verified against ground truth, and incorrect ones are retried.
## Installation
This project is built on top of [DataFlow](https://github.com/OpenDCAI/DataFlow). Clone and install it first:
```shell
git clone https://github.com/OpenDCAI/DataFlow.git
cd DataFlow
pip install -e ".[pdf2vqa]"
```
Then clone this repository:
```shell
git clone <this-repo-url>
cd DataFlow-VQA
```
## Configuration
### API Keys
Two API keys are required:
- `DF_API_KEY`: API key for the LLM service (OpenAI, Google Gemini, DeepSeek, etc.)
- `MINERU_API_KEY`: API key for [MinerU](https://mineru.net/apiManage/token) document layout parsing
```shell
export DF_API_KEY="sk-xxxxx"
export MINERU_API_KEY="sk2-xxxxx"
```
### LLM Endpoint
Each pipeline accepts `--api_url` and `--model` arguments. Any [OpenAI-compatible API](https://platform.openai.com/docs/api-reference) endpoint is supported, including OpenAI, Google Gemini (via proxy), DeepSeek, and others.
Provide the **base URL** without `/chat/completions` (e.g. `https://api.openai.com/v1`).
---
## Stage 1: VQA Extraction
### Input Format
Create a JSONL file where each line describes one PDF extraction task:
```jsonl
{"input_pdf_paths": "./examples/VQA/questionextract_test.pdf", "name": "math1"}
{"input_pdf_paths": ["./examples/VQA/math_question.pdf", "./examples/VQA/math_answer.pdf"], "name": "math2"}
```
- `input_pdf_paths`: A single PDF (questions and answers interleaved) or a list of two or more PDFs (questions before answers).
- `name`: A unique identifier for this task (used for directory naming and caching).
### Run
```bash
python -m pipelines.vqa_extract_optimized_pipeline \
--input_file ./examples/VQA/vqa_extract_test.jsonl \
--output_dir ./output \
--api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
--model gemini-2.5-pro
```
**Important:** We recommend using a strong powerful model here. Weak models like `gpt-5-mini` might perform bad.
### Output
- `{output_dir}/raw_vqa.jsonl`: Extracted QA pairs with image references
- `{output_dir}/{name}/vqa_images/`: Extracted images
- `cache/{name}/extracted_vqa.jsonl`, `merged_qa_pairs.jsonl`, `merged_qa_pairs.md`: Per-task intermediate files
Each QA item contains:
```json
{
"question": "Compute $x$ such that $x^2 - 1 = 0$.",
"answer": "$x = 1$ or $x = -1$",
"solution": "Factor as $(x-1)(x+1)=0$.",
"label": 1,
"question_chapter_title": "Chapter 1: Quadratic Equations",
"answer_chapter_title": "Chapter 1: Quadratic Equations",
"image_basedir": "/path/to/your/images"
}
```
### Note
**We also support using a local MinerU deployment**: Replace `FileOrURLToMarkdownConverterAPI` with `FileOrURLToMarkdownConverterLocal` or `FileOrURLToMarkdownConverterFlash` in `pipelines/vqa_extract_optimized_pipeline.py`:
```python
# Original opendatalab local version
self.mineru_executor = FileOrURLToMarkdownConverterLocal(
intermediate_dir="intermediate",
mineru_model_path="path/to/mineru/model",
)
# Accelerated version (Flash)
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
intermediate_dir="intermediate",
mineru_model_path="path/to/mineru/model",
batch_size=4,
replicas=1,
num_gpus_per_replica=1,
engine_gpu_util_rate_to_ray_cap=0.9,
)
```
See [DataFlow's MinerU operators](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py) for full parameter documentation.
<details>
<summary>Pipeline details</summary>
The extraction pipeline runs six steps:
1. **PDF Merging** (`PDF_Merger`): If multiple PDFs are provided, merges them into one.
2. **Document Layout Parsing** (`FileOrURLToMarkdownConverterAPI`): Calls the MinerU API to produce structured JSON layout tokens and page images.
3. **Layout Preprocessing** (`MinerU2LLMInputOperator`): Flattens list items and re-indexes IDs to prepare LLM-ready input.
4. **LLM Extraction** (`ChunkedPromptedGenerator`): Chunks the layout JSON (max 128k tokens per chunk) and calls the LLM with `QAExtractPrompt` to extract QA pairs as structured XML.
5. **Output Parsing** (`LLMOutputParser`): Parses the XML response into JSONL and copies images to `vqa_images/`.
6. **QA Merging** (`QA_Merger`): For separated question/answer PDFs, matches question and answer blocks by chapter title and question number.
This operator includes a `strict_title_match` parameter: When set to True, the operator performs an exact string match on chapter titles. Otherwise, the operator attempts to extract Chinese or English sequence numbers from the titles for matching.
</details>
---
## Stage 2: Data Curation
```bash
python -m pipelines.curate_data \
--input_file ./output/raw_vqa.jsonl \
--api_url https://api.openai.com/v1 \
--model gpt-5-mini
```
Output is saved as `curated_vqa.jsonl` in the same directory as `--input_file`.
<details>
<summary>Pipeline details</summary>
Four sequential steps:
**1. Sub-question Splitting**
Questions with multiple independent parts (e.g. (a), (b), (c)) are split into separate items. Each sub-question is paired with its corresponding sub-answer and sub-solution. Items where the question or both answer and solution are empty are discarded.
Sub-questions that are context-sensitive (e.g. (b) uses the result of (a)) will not be split into separate items.
Adds field: `split_qa`
**2. Question Type Classification**
Each question is classified as one of: `Calculation`, `Proof`, `Explanation`, `Fill-in`, `Multiple-choice`, `Sketching`, `Other`.
By default, only `Calculation`, `Fill-in`, and `Multiple-choice` are retained. To change this, edit the `filter_rules` list in `DataCurationPipeline.__init__`.
Adds fields: `type`, `type_reason`
**3. Answer Extraction**
Extracts a concise final answer from the `solution` field and writes it to `answer`. Items that already have a non-empty `answer` are skipped (set `overwrite=True` in `AnswerExtractionOperator` to override).
**4. QA Filtering**
Removes items based on the following criteria:
- The question must pose a clear, specific problem suitable for an exam. Examples, statements without questions, and open-ended discussions are rejected.
- The answer must directly address the question.
- The question and answer must be self-contained, without relying on external references or omitted context.
Adds fields: `filter_result`, `filter_reason`
</details>
---
## Stage 3: Generate CoT
The answer model and judge model can use different API endpoints and API keys, which is useful when the answer model is a self-hosted open-source VLM (e.g. Qwen3-VL served via vLLM) and the judge model is a commercial API.
Use `--answer_api_key_env` / `--judge_api_key_env` to specify which environment variable holds the API key for each model (default: `DF_API_KEY` for both).
```bash
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"
python -m pipelines.generate_cot \
--input_file ./output/curated_vqa.jsonl \
--max_retries 5 \
--answer_api_url https://your-vllm-server/v1 \
--answer_model qwen3-vl-235b-thinking \
--answer_api_key_env VLLM_API_KEY \
--judge_api_url https://api.openai.com/v1 \
--judge_model gpt-5-mini \
--judge_api_key_env DF_API_KEY
```
Output is saved as `curated_vqa_with_cot.jsonl` in the same directory as `--input_file`.
<details>
<summary>Pipeline details</summary>
Uses reject sampling over up to `max_retries` rounds:
**1. Answer Generation** (`VQAReasoningAnswerGenerator`)
The LLM generates a step-by-step answer. Set `skip_text_only=True` in `RejectSamplingPipeline` to process only VQA items (questions containing images); set to `False` to process all items. Generated answer stored in `generated_cot`.
**2. Thinking Cleanup**
Strips `<think>...</think>` content from the generated answer to reduce verification cost. The cleaned answer is stored in `llm_short_answer`. Assumes the model outputs `<think>THINK</think>ANSWER` or `THINK</think>ANSWER`.
**3. Answer Verification** (`BenchDatasetEvaluatorQuestion`)
Compares `llm_short_answer` against the ground truth `answer` using semantic LLM evaluation (with 5% numerical tolerance). Items that pass are marked `answer_match_result = True` and skipped in subsequent rounds.
Set `support_subquestions=True` to evaluate each sub-question independently; `answer_match_result` is `False` if any sub-question is wrong.
Evaluation statistics (overall accuracy, sub-question accuracy) are saved to `./cot_cache/eval_results.jsonl`:
```json
{
"total_samples": 23584,
"matched_samples": 12281,
"accuracy": 0.521,
"total_subquestions": 26380,
"correct_subquestions": 13807,
"subquestion_accuracy": 0.523
}
```
</details>
---
## Examples
Sample PDFs and input JSONL are provided in `examples/VQA/`:
```
examples/VQA/
β”œβ”€β”€ vqa_extract_test.jsonl # Example input for Stage 1
β”œβ”€β”€ questionextract_test.pdf # Single PDF with interleaved Q&A
β”œβ”€β”€ math_question.pdf # Questions PDF (for separated Q&A demo)
└── math_answer.pdf # Answers PDF (for separated Q&A demo)
```
To run the full pipeline on the examples:
```bash
# Stage 1: Extract
python -m pipelines.vqa_extract_optimized_pipeline \
--input_file ./examples/VQA/vqa_extract_test.jsonl \
--output_dir ./output \
--api_url https://generativelanguage.googleapis.com/v1beta/openai/ \
--model gemini-2.5-pro
# Stage 2: Curate
python -m pipelines.curate_data \
--input_file ./output/raw_vqa.jsonl \
--api_url https://api.openai.com/v1 \
--model gpt-5-mini
# Stage 3: Generate CoT
# Example: self-hosted Qwen3-VL for answers, OpenAI for judging
export VLLM_API_KEY="token-xxxx" # or leave empty if your vLLM server needs no key
export DF_API_KEY="sk-xxxx"
python -m pipelines.generate_cot \
--input_file ./output/curated_vqa.jsonl \
--max_retries 5 \
--answer_api_url https://your-vllm-server/v1 \
--answer_model qwen3-vl-235b-thinking \
--answer_api_key_env VLLM_API_KEY \
--judge_api_url https://api.openai.com/v1 \
--judge_model gpt-5-mini \
--judge_api_key_env DF_API_KEY
```
## Note
The implementation in this repository is only for running a demo at small scale. If you wish to run the pipeline on large number of books, you will probably need features [Checkpoint Resume](https://opendcai.github.io/DataFlow-Doc/en/guide/resume/) and [Batched Inference](https://opendcai.github.io/DataFlow-Doc/en/guide/batch/).
## License
This project is licensed under the [Apache License 2.0](LICENSE).