Gemmagain Multimodal

Gemma3 multimodal model with layer looping support for the text decoder. This allows running the same physical text decoder layers multiple times in sequence, enabling parameter-efficient deep networks while leaving the vision tower unchanged.

Features

Layer looping for text decoder only - Vision tower (SiglipVisionModel) is unchanged
100% weight compatible with unsloth/gemma-3-4b-pt and other Gemma3 multimodal models
Supports generation with KV caching - Cache slots properly allocated for looped layers
Flexible layer sequence format - Specify which layers to loop and how many times

Usage

import torch
from transformers import AutoConfig, Gemma3ForConditionalGeneration

# Load config with layer looping
config = AutoConfig.from_pretrained('rpDungeon/gemmagain-mm', trust_remote_code=True)

# Configure layer looping: layers 0-9 once, layers 10-27 twice, layers 28-33 once
config.text_config.layer_sequence = [[0, 10], [10, 28, 2], [28, 34]]

# Import and create model
from modeling_gemmagain import GemmagainForConditionalGeneration

model = GemmagainForConditionalGeneration(config)

# Load weights from any Gemma3 multimodal checkpoint
orig = Gemma3ForConditionalGeneration.from_pretrained(
    'unsloth/gemma-3-4b-pt',
    torch_dtype=torch.bfloat16,
)
model.load_state_dict(orig.state_dict())
del orig

model = model.to(dtype=torch.bfloat16, device='cuda')

Layer Sequence Format

The layer_sequence config accepts a flexible format:

Format	Example	Meaning
Integer	`5`	Single layer 5
2-element list	`[4, 20]`	Layers 4-19 (end exclusive)
3-element list	`[10, 28, 2]`	Layers 10-27, repeated 2 times

Example configurations:

# Default: all 34 layers once
config.text_config.layer_sequence = [[0, 34, 1]]

# Loopstral-style: loop middle layers twice
# Physical: 34 layers, Effective: 52 layers
config.text_config.layer_sequence = [[0, 10], [10, 28, 2], [28, 34]]

# Loop all layers twice (2x depth, same params)
config.text_config.layer_sequence = [[0, 34, 2]]

Architecture

GemmagainForConditionalGeneration
├── model (GemmagainModel)
│   ├── vision_tower (SiglipVisionModel)     # Unchanged from Gemma3
│   ├── multi_modal_projector                 # Unchanged from Gemma3
│   └── language_model (GemmagainTextModel)   # Layer looping support
│       ├── embed_tokens
│       ├── layers[0..33]                     # Physical layers
│       ├── _layer_sequence                   # Execution order with loops
│       └── norm
└── lm_head

License

Apache 2.0 (same as Gemma3)

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support