--- tags: - Visual Question-Visual Answering (VQVA) - dataset-construction - image-editing - multimodal - instruction-tuning - visual-reasoning --- # 🥯 **BAGEL-World-model** **A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.** --- The BAGEL-World framework outputs high-quality VQ-VA data via the following steps: ### 🔄**Preprocessing** Filters and classify noisy web-interleaved data into design- and knowledge-related documents. ### 🤖**Agentic Pipeline for VQ-VA Data Creation** **1. Retriever** selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions. **2. Instruction** Generator write a natural-language question about one image so that the other image serves as the correct answer. **3. Filterer** removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩. **4. Rewriter** increases instruction diversity by producing multiple variants of the original questions. **5. Reasoner** generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image. The framework at last outputs **interleaved quadruplets**: - 🧠 *Question Image* - 💬 *Visual Question / Instruction* - 🔍 *Reasoning Trace* - 🎨 *Answer Image* Stay tuned for updates and examples!