Model Architecture

This document describes the architecture and training methodology for models fine-tuned on Zebra-CoT.

Mixture-of-Transformer-Experts (MoT)

The architecture adopts a Mixture-of-Transformer-Experts (MoT) design to maximize the model's capacity to learn from richly diverse multimodal information.

Key Design Principles

Capacity Maximization: MoT enables the model to handle the diversity of visual reasoning tasks across scientific, 2D, 3D, and logic/game domains.
Expert Specialization: Different experts can specialize in different types of reasoning patterns (geometric, spatial, strategic, etc.).

Dual Encoder System

Following the capacity maximization principle, the architecture utilizes two separate encoders:

Encoder	Purpose	Features Captured
Pixel-Level Encoder	Low-level visual processing	Edges, textures, fine details
Semantic-Level Encoder	High-level understanding	Objects, relationships, concepts

┌─────────────────────────────────────────────────────────┐
│                      Input Image                         │
└─────────────────────────────────────────────────────────┘
                            │
              ┌─────────────┴─────────────┐
              ▼                           ▼
┌─────────────────────────┐   ┌─────────────────────────┐
│   Pixel-Level Encoder   │   │  Semantic-Level Encoder │
│   (Fine visual details) │   │  (High-level concepts)  │
└─────────────────────────┘   └─────────────────────────┘
              │                           │
              └─────────────┬─────────────┘
                            ▼
┌─────────────────────────────────────────────────────────┐
│              Mixture-of-Transformer-Experts              │
│                        (MoT)                             │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│           Next Group of Token Prediction                 │
└─────────────────────────────────────────────────────────┘

Next Group of Token Prediction (NGTP)

The training paradigm follows Next Group of Token Prediction, where the model predicts the next group of language or visual tokens as a compression target.

Advantages

Interleaved Generation: Naturally supports generating interleaved text-image reasoning traces
Efficient Compression: Groups of tokens provide better information density
Flexible Modality: Can predict either language tokens or visual tokens based on context

Token Groups

Input:  [Question] [Problem Image]
          │
          ▼
Step 1: [THOUGHT 1] ────────────────► Language tokens
          │
          ▼  
Step 2: [REASONING IMAGE 1] ────────► Visual tokens
          │
          ▼
Step 3: [THOUGHT 2] ────────────────► Language tokens
          │
          ▼
Output: [FINAL ANSWER]

Performance Results

In-Distribution Test Accuracy

Model	Before Fine-tuning	After Fine-tuning	Improvement
Anole-7B	4.2%	16.9%	+12.7%
Bagel-7B	—	High-quality interleaved chains	Qualitative

Benchmark Improvements

Fine-tuning on Zebra-CoT yields up to +13% performance gain on standard VLM benchmarks:

Enhanced visual reasoning capabilities
Improved chain-of-thought generation
Better intermediate step visualization

Category-Specific Performance

Category	Description	Key Reasoning Skills
Scientific	Geometry, Physics, Algorithms	Diagram construction, step-by-step derivation
2D Visual	Visual search, Jigsaw	Pattern recognition, spatial arrangement
3D Visual	Multi-hop inference, Embodied planning	Depth perception, navigation
Logic/Games	Chess, Visual logic	Strategic thinking, rule application

shk-bd
/

Sheikh-Freemium