Model Architecture
This document describes the architecture and training methodology for models fine-tuned on Zebra-CoT.
Mixture-of-Transformer-Experts (MoT)
The architecture adopts a Mixture-of-Transformer-Experts (MoT) design to maximize the model's capacity to learn from richly diverse multimodal information.
Key Design Principles
Capacity Maximization: MoT enables the model to handle the diversity of visual reasoning tasks across scientific, 2D, 3D, and logic/game domains.
Expert Specialization: Different experts can specialize in different types of reasoning patterns (geometric, spatial, strategic, etc.).
Dual Encoder System
Following the capacity maximization principle, the architecture utilizes two separate encoders:
| Encoder | Purpose | Features Captured |
|---|---|---|
| Pixel-Level Encoder | Low-level visual processing | Edges, textures, fine details |
| Semantic-Level Encoder | High-level understanding | Objects, relationships, concepts |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Input Image β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
βΌ βΌ
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Pixel-Level Encoder β β Semantic-Level Encoder β
β (Fine visual details) β β (High-level concepts) β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β β
βββββββββββββββ¬ββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Mixture-of-Transformer-Experts β
β (MoT) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Next Group of Token Prediction β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Next Group of Token Prediction (NGTP)
The training paradigm follows Next Group of Token Prediction, where the model predicts the next group of language or visual tokens as a compression target.
Advantages
- Interleaved Generation: Naturally supports generating interleaved text-image reasoning traces
- Efficient Compression: Groups of tokens provide better information density
- Flexible Modality: Can predict either language tokens or visual tokens based on context
Token Groups
Input: [Question] [Problem Image]
β
βΌ
Step 1: [THOUGHT 1] βββββββββββββββββΊ Language tokens
β
βΌ
Step 2: [REASONING IMAGE 1] βββββββββΊ Visual tokens
β
βΌ
Step 3: [THOUGHT 2] βββββββββββββββββΊ Language tokens
β
βΌ
Output: [FINAL ANSWER]
Performance Results
In-Distribution Test Accuracy
| Model | Before Fine-tuning | After Fine-tuning | Improvement |
|---|---|---|---|
| Anole-7B | 4.2% | 16.9% | +12.7% |
| Bagel-7B | β | High-quality interleaved chains | Qualitative |
Benchmark Improvements
Fine-tuning on Zebra-CoT yields up to +13% performance gain on standard VLM benchmarks:
- Enhanced visual reasoning capabilities
- Improved chain-of-thought generation
- Better intermediate step visualization
Category-Specific Performance
| Category | Description | Key Reasoning Skills |
|---|---|---|
| Scientific | Geometry, Physics, Algorithms | Diagram construction, step-by-step derivation |
| 2D Visual | Visual search, Jigsaw | Pattern recognition, spatial arrangement |
| 3D Visual | Multi-hop inference, Embodied planning | Depth perception, navigation |
| Logic/Games | Chess, Visual logic | Strategic thinking, rule application |