File size: 5,478 Bytes
9537200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# Model Architecture

This document describes the architecture and training methodology for models fine-tuned on Zebra-CoT.

## Mixture-of-Transformer-Experts (MoT)

The architecture adopts a **Mixture-of-Transformer-Experts (MoT)** design to maximize the model's capacity to learn from richly diverse multimodal information.

### Key Design Principles

1. **Capacity Maximization**: MoT enables the model to handle the diversity of visual reasoning tasks across scientific, 2D, 3D, and logic/game domains.

2. **Expert Specialization**: Different experts can specialize in different types of reasoning patterns (geometric, spatial, strategic, etc.).

## Dual Encoder System

Following the capacity maximization principle, the architecture utilizes **two separate encoders**:

| Encoder | Purpose | Features Captured |
|---------|---------|-------------------|
| **Pixel-Level Encoder** | Low-level visual processing | Edges, textures, fine details |
| **Semantic-Level Encoder** | High-level understanding | Objects, relationships, concepts |

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Input Image                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Pixel-Level Encoder   β”‚   β”‚  Semantic-Level Encoder β”‚
β”‚   (Fine visual details) β”‚   β”‚  (High-level concepts)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                           β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Mixture-of-Transformer-Experts              β”‚
β”‚                        (MoT)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Next Group of Token Prediction                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Next Group of Token Prediction (NGTP)

The training paradigm follows **Next Group of Token Prediction**, where the model predicts the next group of language or visual tokens as a compression target.

### Advantages

- **Interleaved Generation**: Naturally supports generating interleaved text-image reasoning traces
- **Efficient Compression**: Groups of tokens provide better information density
- **Flexible Modality**: Can predict either language tokens or visual tokens based on context

### Token Groups

```
Input:  [Question] [Problem Image]
          β”‚
          β–Ό
Step 1: [THOUGHT 1] ────────────────► Language tokens
          β”‚
          β–Ό  
Step 2: [REASONING IMAGE 1] ────────► Visual tokens
          β”‚
          β–Ό
Step 3: [THOUGHT 2] ────────────────► Language tokens
          β”‚
          β–Ό
Output: [FINAL ANSWER]
```

## Performance Results

### In-Distribution Test Accuracy

| Model | Before Fine-tuning | After Fine-tuning | Improvement |
|-------|-------------------|-------------------|-------------|
| Anole-7B | 4.2% | 16.9% | **+12.7%** |
| Bagel-7B | β€” | High-quality interleaved chains | Qualitative |

### Benchmark Improvements

Fine-tuning on Zebra-CoT yields up to **+13%** performance gain on standard VLM benchmarks:

- Enhanced visual reasoning capabilities
- Improved chain-of-thought generation
- Better intermediate step visualization

## Category-Specific Performance

| Category | Description | Key Reasoning Skills |
|----------|-------------|---------------------|
| Scientific | Geometry, Physics, Algorithms | Diagram construction, step-by-step derivation |
| 2D Visual | Visual search, Jigsaw | Pattern recognition, spatial arrangement |
| 3D Visual | Multi-hop inference, Embodied planning | Depth perception, navigation |
| Logic/Games | Chess, Visual logic | Strategic thinking, rule application |

## References

- [Zebra-CoT Paper](https://arxiv.org/abs/2507.16746)
- [Anole-Zebra-CoT Model](https://huggingface.co/multimodal-reasoning-lab/Anole-Zebra-CoT)
- [Bagel-Zebra-CoT Model](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)