arxiv:2502.15130

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

Published on Oct 9, 2025

Authors:

Abstract

TransMamba enables efficient training of Mamba-based models through cross-architecture knowledge transfer from Transformers, using selective weight subcloning and adaptive distillation techniques.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models such as LLaVA, CLIP, and DeiT, etc. In parallel, emerging sub-quadratic architectures like Mamba offer promising efficiency gains by enabling global context modeling with linear complexity. However, training these architectures from scratch remains resource-intensive (e.g., in terms of data and time). Motivated by this challenge, we explore a cross-architecture knowledge transfer paradigm, termed TransMamba, that facilitates the reuse of Transformer pre-trained knowledge. We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks. The first stage leverages pre-trained Transformer models to initialize critical components of the Mamba architecture. To bridge architectural and dimensional gaps, we develop a selective weight subcloning strategy and a layered initialization scheme that prioritizes the early n layers. Building on this initialization, the second stage introduces an adaptive multi-directional knowledge distillation method. This mechanism employs layer-wise adaptive scaling factors to align Mamba representations with their Transformer counterparts, while accommodating the scanning order variations inherent to multi-modal Mamba architectures. Despite operating with a reduced training dataset and a more compact model architecture, TransMamba consistently outperforms baseline approaches across diverse mamba-based backbones (e.g., PlainMamba, Vmamba, ViM and VideoMamba) and downstream tasks (e.g., image classification, visual question answering, text-video retrieval and multimodal reasoning). All code and implementation details will be released.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2502.15130

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.15130 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.15130 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.15130 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.