BAGEL-World-model / README.md
ZichengD's picture
Update README.md
e11274f verified
---
tags:
- Visual Question-Visual Answering (VQVA)
- dataset-construction
- image-editing
- multimodal
- instruction-tuning
- visual-reasoning
---
# 🥯 **BAGEL-World-model**
**A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.**
---
The BAGEL-World framework outputs high-quality VQ-VA data via the following steps:
### 🔄**Preprocessing**
Filters and classify noisy web-interleaved data into design- and knowledge-related documents.
### 🤖**Agentic Pipeline for VQ-VA Data Creation**
**1. Retriever** selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions.
**2. Instruction** Generator write a natural-language question about one image so that the other image serves as the correct answer.
**3. Filterer** removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩.
**4. Rewriter** increases instruction diversity by producing multiple variants of the original questions.
**5. Reasoner** generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image.
The framework at last outputs **interleaved quadruplets**:
- 🧠 *Question Image*
- 💬 *Visual Question / Instruction*
- 🔍 *Reasoning Trace*
- 🎨 *Answer Image*
Stay tuned for updates and examples!