Sheikh-Freemium / dataset /docs /ARCHITECTURE.md
shk-bd's picture
Upload folder using huggingface_hub
9537200 verified
# Model Architecture
This document describes the architecture and training methodology for models fine-tuned on Zebra-CoT.
## Mixture-of-Transformer-Experts (MoT)
The architecture adopts a **Mixture-of-Transformer-Experts (MoT)** design to maximize the model's capacity to learn from richly diverse multimodal information.
### Key Design Principles
1. **Capacity Maximization**: MoT enables the model to handle the diversity of visual reasoning tasks across scientific, 2D, 3D, and logic/game domains.
2. **Expert Specialization**: Different experts can specialize in different types of reasoning patterns (geometric, spatial, strategic, etc.).
## Dual Encoder System
Following the capacity maximization principle, the architecture utilizes **two separate encoders**:
| Encoder | Purpose | Features Captured |
|---------|---------|-------------------|
| **Pixel-Level Encoder** | Low-level visual processing | Edges, textures, fine details |
| **Semantic-Level Encoder** | High-level understanding | Objects, relationships, concepts |
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Input Image β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pixel-Level Encoder β”‚ β”‚ Semantic-Level Encoder β”‚
β”‚ (Fine visual details) β”‚ β”‚ (High-level concepts) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Mixture-of-Transformer-Experts β”‚
β”‚ (MoT) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Next Group of Token Prediction β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Next Group of Token Prediction (NGTP)
The training paradigm follows **Next Group of Token Prediction**, where the model predicts the next group of language or visual tokens as a compression target.
### Advantages
- **Interleaved Generation**: Naturally supports generating interleaved text-image reasoning traces
- **Efficient Compression**: Groups of tokens provide better information density
- **Flexible Modality**: Can predict either language tokens or visual tokens based on context
### Token Groups
```
Input: [Question] [Problem Image]
β”‚
β–Ό
Step 1: [THOUGHT 1] ────────────────► Language tokens
β”‚
β–Ό
Step 2: [REASONING IMAGE 1] ────────► Visual tokens
β”‚
β–Ό
Step 3: [THOUGHT 2] ────────────────► Language tokens
β”‚
β–Ό
Output: [FINAL ANSWER]
```
## Performance Results
### In-Distribution Test Accuracy
| Model | Before Fine-tuning | After Fine-tuning | Improvement |
|-------|-------------------|-------------------|-------------|
| Anole-7B | 4.2% | 16.9% | **+12.7%** |
| Bagel-7B | β€” | High-quality interleaved chains | Qualitative |
### Benchmark Improvements
Fine-tuning on Zebra-CoT yields up to **+13%** performance gain on standard VLM benchmarks:
- Enhanced visual reasoning capabilities
- Improved chain-of-thought generation
- Better intermediate step visualization
## Category-Specific Performance
| Category | Description | Key Reasoning Skills |
|----------|-------------|---------------------|
| Scientific | Geometry, Physics, Algorithms | Diagram construction, step-by-step derivation |
| 2D Visual | Visual search, Jigsaw | Pattern recognition, spatial arrangement |
| 3D Visual | Multi-hop inference, Embodied planning | Depth perception, navigation |
| Logic/Games | Chess, Visual logic | Strategic thinking, rule application |
## References
- [Zebra-CoT Paper](https://arxiv.org/abs/2507.16746)
- [Anole-Zebra-CoT Model](https://huggingface.co/multimodal-reasoning-lab/Anole-Zebra-CoT)
- [Bagel-Zebra-CoT Model](https://huggingface.co/multimodal-reasoning-lab/Bagel-Zebra-CoT)