Sheikh-Freemium / dataset /docs /ARCHITECTURE.md
shk-bd's picture
Upload folder using huggingface_hub
9537200 verified

Model Architecture

This document describes the architecture and training methodology for models fine-tuned on Zebra-CoT.

Mixture-of-Transformer-Experts (MoT)

The architecture adopts a Mixture-of-Transformer-Experts (MoT) design to maximize the model's capacity to learn from richly diverse multimodal information.

Key Design Principles

  1. Capacity Maximization: MoT enables the model to handle the diversity of visual reasoning tasks across scientific, 2D, 3D, and logic/game domains.

  2. Expert Specialization: Different experts can specialize in different types of reasoning patterns (geometric, spatial, strategic, etc.).

Dual Encoder System

Following the capacity maximization principle, the architecture utilizes two separate encoders:

Encoder Purpose Features Captured
Pixel-Level Encoder Low-level visual processing Edges, textures, fine details
Semantic-Level Encoder High-level understanding Objects, relationships, concepts
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Input Image                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Pixel-Level Encoder   β”‚   β”‚  Semantic-Level Encoder β”‚
β”‚   (Fine visual details) β”‚   β”‚  (High-level concepts)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                           β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Mixture-of-Transformer-Experts              β”‚
β”‚                        (MoT)                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Next Group of Token Prediction                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Next Group of Token Prediction (NGTP)

The training paradigm follows Next Group of Token Prediction, where the model predicts the next group of language or visual tokens as a compression target.

Advantages

  • Interleaved Generation: Naturally supports generating interleaved text-image reasoning traces
  • Efficient Compression: Groups of tokens provide better information density
  • Flexible Modality: Can predict either language tokens or visual tokens based on context

Token Groups

Input:  [Question] [Problem Image]
          β”‚
          β–Ό
Step 1: [THOUGHT 1] ────────────────► Language tokens
          β”‚
          β–Ό  
Step 2: [REASONING IMAGE 1] ────────► Visual tokens
          β”‚
          β–Ό
Step 3: [THOUGHT 2] ────────────────► Language tokens
          β”‚
          β–Ό
Output: [FINAL ANSWER]

Performance Results

In-Distribution Test Accuracy

Model Before Fine-tuning After Fine-tuning Improvement
Anole-7B 4.2% 16.9% +12.7%
Bagel-7B β€” High-quality interleaved chains Qualitative

Benchmark Improvements

Fine-tuning on Zebra-CoT yields up to +13% performance gain on standard VLM benchmarks:

  • Enhanced visual reasoning capabilities
  • Improved chain-of-thought generation
  • Better intermediate step visualization

Category-Specific Performance

Category Description Key Reasoning Skills
Scientific Geometry, Physics, Algorithms Diagram construction, step-by-step derivation
2D Visual Visual search, Jigsaw Pattern recognition, spatial arrangement
3D Visual Multi-hop inference, Embodied planning Depth perception, navigation
Logic/Games Chess, Visual logic Strategic thinking, rule application

References