| --- |
| tags: |
| - Visual Question-Visual Answering (VQVA) |
| - dataset-construction |
| - image-editing |
| - multimodal |
| - instruction-tuning |
| - visual-reasoning |
| --- |
| |
| # 🥯 **BAGEL-World-model** |
|
|
| **A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.** |
|
|
|
|
| --- |
|
|
| The BAGEL-World framework outputs high-quality VQ-VA data via the following steps: |
|
|
| ### 🔄**Preprocessing** |
|
|
| Filters and classify noisy web-interleaved data into design- and knowledge-related documents. |
|
|
|
|
| ### 🤖**Agentic Pipeline for VQ-VA Data Creation** |
|
|
| **1. Retriever** selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions. |
|
|
| **2. Instruction** Generator write a natural-language question about one image so that the other image serves as the correct answer. |
|
|
| **3. Filterer** removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩. |
|
|
| **4. Rewriter** increases instruction diversity by producing multiple variants of the original questions. |
|
|
| **5. Reasoner** generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image. |
|
|
| The framework at last outputs **interleaved quadruplets**: |
|
|
| - 🧠 *Question Image* |
| - 💬 *Visual Question / Instruction* |
| - 🔍 *Reasoning Trace* |
| - 🎨 *Answer Image* |
|
|
|
|
|
|
|
|
|
|
| Stay tuned for updates and examples! |
|
|