NEO-unify: Building Native Multimodal Unified Models End to End

Community Article Published March 5, 2026

Upvote

Existing Multimodal AI Dilemma

For years, multimodal AI typically adopts a vision encoder (VE) to perceive and a variational autoencoder (VAE) to generate. Recent efforts seek to unify both with a shared tokenizer — but often with trade-offs. We return to the first principles: Building a model that directly engages with native inputs — pixels and words.

Today, SenseTime, in collaboration with NTU, introduces a native, unified, end-to-end paradigm dubbed NEO-unify (preview) — stepping beyond representation arguments, and breaking free from pre-trained priors or scaling-law bottlenecks. No VE! No VAE!

NEO-unify: End-to-End Native Unified Model Paradigm

NEO-unify is the first step toward truly end-to-end unified models, learning directly from near-lossless inputs via a representation space shaped by the model itself. NEO-unify incorporates: 1) near-lossless visual interface for input and output, 2) native Mixture-of-Transformer (MoT) synergizing understanding and generation, 3) unified learning with autoregressive cross-entropy for texts and pixel flow matching for vision.

Model Performance

1. Quantitative Results

2. Qualitative Results

Key Findings

1. Encoder-Free Design Preserves Both Semantic and Pixel Representations

[Image Reconstruction]

Our earlier work NEO (Diao et al., ICLR 2026) shows that a native end-to-end model can learn rich semantic representations. Here we surprisingly discover that a separate generative pathway can recover fine grained visual details even from a frozen understanding (und) pathway.

NEO-unify (2B) reaches 31.56 PSNR and 0.85 SSIM on MS COCO 2017 after an initial 90K pretraining step, compared with 32.65 and 0.91 for Flux VAE. This indicates that near-lossless inputs can well support both semantic understanding and pixel-level fidelity — without pre-trained encoders.

Reconstructing Out-domain Images (2B NEO-unify with frozen understanding branch):

[Image Editing]

Building on this insight, we conduct further explorations: NEO-unify throws all condition contexts through the understanding pathway, while the generative pathway produces the new images directly.

Even with a frozen understanding branch, NEO-unify (2B) still shows strong editing capabilities, while largely improving token efficiency. Using public T2I and editing datasets, and after an initial 60k mixed training, it attains a 3.32 score on ImgEdit, with the understanding branch kept frozen.

Fiting Small In-domain Data (2B NEO-unify with frozen understanding branch):

Validating ImgEdit Prompts (2B NEO-unify with frozen understanding branch):

2. Encoder-Free Design Synergizes with MoT Backbone with Minimal Intrinsic Conflict

With the pre-trained understanding and generation branch, we jointly train both in NEO-Unify on the same middle-training and supervised fine-tuning data sources. Even with low data ratios and loss weights, understanding remains stable and generation converges faster. They co-evolved within the MoT backbone where the conflict was minimal.

3. Encoder-Free Design Shows High Data-scaling Efficiency

We begin with web-scale pretraining, followed by mid-training (MT) and supervised fine-tuning(SFT) stages using diverse and high-quality data corpora. NEO-unify shows substantially better data-scaling efficiency over the counterpart Bagel, achieving higher performance with fewer training tokens.

Outlook

This is not just a model, it is a step toward:

Interleaved perception–generation loops
Omni-modal reasoning
Vision-centric intelligence
Spatial intelligence
World model
...

A roadmap where models do not translate between modalities, but think across them natively. Multimodal AI is no longer about connecting systems. It’s about building one that was never divided, and trusting the necessary capabilities to emerge from within.

We are scaling our efforts. Stay tuned, we will show you more and open our models, hopefully in not too long ...

Community

edmond

about 3 hours ago

Model available on HF ? 👀

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote