File size: 5,478 Bytes
9537200 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# Model Architecture
This document describes the architecture and training methodology for models fine-tuned on Zebra-CoT.
## Mixture-of-Transformer-Experts (MoT)
The architecture adopts a **Mixture-of-Transformer-Experts (MoT)** design to maximize the model's capacity to learn from richly diverse multimodal information.
### Key Design Principles
1. **Capacity Maximization**: MoT enables the model to handle the diversity of visual reasoning tasks across scientific, 2D, 3D, and logic/game domains.
2. **Expert Specialization**: Different experts can specialize in different types of reasoning patterns (geometric, spatial, strategic, etc.).
## Dual Encoder System
Following the capacity maximization principle, the architecture utilizes **two separate encoders**:
| Encoder | Purpose | Features Captured |
|---------|---------|-------------------|
| **Pixel-Level Encoder** | Low-level visual processing | Edges, textures, fine details |
| **Semantic-Level Encoder** | High-level understanding | Objects, relationships, concepts |
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Input Image β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
βΌ βΌ
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β Pixel-Level Encoder β β Semantic-Level Encoder β
β (Fine visual details) β β (High-level concepts) β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β β
βββββββββββββββ¬ββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Mixture-of-Transformer-Experts β
β (MoT) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Next Group of Token Prediction β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Next Group of Token Prediction (NGTP)
The training paradigm follows **Next Group of Token Prediction**, where the model predicts the next group of language or visual tokens as a compression target.
### Advantages
- **Interleaved Generation**: Naturally supports generating interleaved text-image reasoning traces
- **Efficient Compression**: Groups of tokens provide better information density
- **Flexible Modality**: Can predict either language tokens or visual tokens based on context
### Token Groups
```
Input: [Question] [Problem Image]
β
βΌ
Step 1: [THOUGHT 1] βββββββββββββββββΊ Language tokens
β
βΌ
Step 2: [REASONING IMAGE 1] βββββββββΊ Visual tokens
β
βΌ
Step 3: [THOUGHT 2] βββββββββββββββββΊ Language tokens
β
βΌ
Output: [FINAL ANSWER]
```
## Performance Results
### In-Distribution Test Accuracy
| Model | Before Fine-tuning | After Fine-tuning | Improvement |
|-------|-------------------|-------------------|-------------|
| Anole-7B | 4.2% | 16.9% | **+12.7%** |
| Bagel-7B | β | High-quality interleaved chains | Qualitative |
### Benchmark Improvements
Fine-tuning on Zebra-CoT yields up to **+13%** performance gain on standard VLM benchmarks:
- Enhanced visual reasoning capabilities
- Improved chain-of-thought generation
- Better intermediate step visualization
## Category-Specific Performance
| Category | Description | Key Reasoning Skills |
|----------|-------------|---------------------|
| Scientific | Geometry, Physics, Algorithms | Diagram construction, step-by-step derivation |
| 2D Visual | Visual search, Jigsaw | Pattern recognition, spatial arrangement |
| 3D Visual | Multi-hop inference, Embodied planning | Depth perception, navigation |
| Logic/Games | Chess, Visual logic | Strategic thinking, rule application |
## References
- [Zebra-CoT Paper](https://arxiv.org/abs/2507.16746)
- [Anole-Zebra-CoT Model](https://huggingface.co/multimodal-reasoning-lab/Anole-Zebra-CoT)
- [Bagel-Zebra-CoT Model](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)
|