Ming-flash-omni 2.0

📑 Technical Report｜🤗 Hugging Face｜ 🤖 ModelScope

Introduction

The newly released Ming-flash-omni 2.0 leverages the Ling-2.0 architecture—a Mixture-of-Experts (MoE) framework comprising 100B total and 6B active parameters. Representing a generational advancement over its predecessor, it establishes new State-of-the-Art (SOTA) benchmarks among open-source omni-MLLMs. Ming-flash-omni 2.0 effectively synergizes foundational abilities with specialized domain expertise. In particular, it exhibits superior performance in visual encyclopedic knowledge, immersive speech synthesis, and high-dynamic image generation and manipulation.

📌 Updates

[2026.02.11] 🔥 We release the official version of Ming-flash-omni 2.0, an open-source SOTA omni-MLLM that pushes the boundaries of multimodal understanding and synthesis.
[2025.10.27] 🔥 We release the preview version of Ming-flash-omni：Ming-flash-omni Preview.
[2025.07.15] 🔥 We release Ming-lite-omni v1.5 with significant improvements across all modalities.
[2025.06.12] 🔥 Our Technical Report is in public on arxiv.
[2025.05.28] 🔥 The official version of Ming-lite-omni v1 is released, with better performance and image generation support.
[2025.05.04] 🔥 We release the test version of Ming-lite-omni：Ming-lite-omni-Preview.

Key Features

Compared to Ming-flash-omni Preview, Ming-flash-omni 2.0 focuses on optimizing capabilities across the following key domains:

Expert-level Multimodal Cognition: It accurately identifies plants and animals, recognizing cultural references (from regional cuisines to global landmarks), and delivering expert-level analysis of artifacts, including era, form, and craftsmanship. By synergizing high-resolution visual capture with a vast knowledge graph, the model achieves "vision-to-knowledge" synthesis, enabling superior knowledge understanding.
Immersive and Controllable Unified Acoustic Synthesis: Ming-flash-omni 2.0 introduces a unified end-to-end acoustic generation pipeline that integrates Speech, Audio, and Music within a single channel. Leveraging Continuous Autoregression coupled with a Diffusion Transformer (DiT) head, the model enables zero-shot voice cloning and nuanced attribute control (e.g., emotion, timbre, and ambient atmosphere). This architecture facilitates a transition from simple text-to-speech to highly expressive, emotionally resonant, and immersive auditory experiences.
High-Dynamic Controllable Image Generation and Manipulation: Ming-flash-omni 2.0 features a native multi-task architecture that unifies segmentation, generation, and editing, allowing for sophisticated spatiotemporal semantic decoupling. It excels in high-dynamic content creation, including atmospheric reconstruction, seamless scene composition, and context-aware object removal. By maintaining texture coherence and spatial depth consistency, Ming-flash-omni 2.0 achieves state-of-the-art precision in complex image manipulation tasks.

Use Cases

Enhanced Multimodal Cognition & Free Modality Switching

Streaming Video Conversation

Controllable Audio Generation

Image Generation & Editing

Model Downloads

You can download our latest model from both Huggingface and ModelScope. For previous version model like Ming-flash-omni-Preview, Please refer to this link.

Model	Input modality	Oput modality	Download
Ming-flash-omni 2.0	Image,text,video,audio	Image,text,audio	🤗 HuggingFace 🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-flash-omni-2.0 --local_dir inclusionAI/Ming-flash-omni-2.0  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt
pip install nvidia-cublas-cu12==12.4.5.8  # for H20 GPU

Example Usage

We provide a step-by-step running example:

Step 1 - Download the source code

git clone https://github.com/inclusionAI/Ming.git 
cd Ming

Step 2 - Download the model weights and create a soft link to the source code directory

Download our model following Model Downloads

mkdir inclusionAI 
ln -s /path/to/inclusionAI/Ming-flash-omni-2.0 inclusionAI/Ming-flash-omni-2.0

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-flash-omni model.

jupyter notebook cookbook.ipynb

We also provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb.

import os
import torch
import warnings
from bisect import bisect_left
warnings.filterwarnings("ignore")

from transformers import AutoProcessor
from modeling_bailingmm2 import BailingMM2NativeForConditionalGeneration

def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 32
    layer_per_gpu = num_layers // world_size
    layer_per_gpu = [i * layer_per_gpu for i in range(1, world_size + 1)]
    for i in range(num_layers):
        device_map[f'model.model.layers.{i}'] = bisect_left(layer_per_gpu, i)
    device_map['vision'] = 0
    device_map['audio'] = 0
    device_map['linear_proj'] = 0
    device_map['linear_proj_audio'] = 0
    device_map['model.model.word_embeddings.weight'] = 0
    device_map['model.model.norm.weight'] = 0
    device_map['model.lm_head.weight'] = 0
    device_map['model.model.norm'] = 0
    device_map[f'model.model.layers.{num_layers - 1}'] = 0
    return device_map

# Load pre-trained model with optimized settings, this will take ~10 minutes
model_path = "inclusionAI/Ming-flash-omni-2.0"
model = BailingMM2NativeForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=split_model(),
    load_image_gen=True,
    load_talker=True,
).to(dtype=torch.bfloat16)

# Initialize processor for handling multimodal inputs
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Inference Pipeline
def generate(messages, processor, model, sys_prompt_exp=None, use_cot_system_prompt=False, max_new_tokens=512):
    text = processor.apply_chat_template(
        messages, 
        sys_prompt_exp=sys_prompt_exp,
        use_cot_system_prompt=use_cot_system_prompt
    )
    image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        audios=audio_inputs,
        return_tensors="pt",
        audio_kwargs={"use_whisper_encoder": True},
    ).to(model.device)

    for k in inputs.keys():
        if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
            inputs[k] = inputs[k].to(dtype=torch.bfloat16)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            eos_token_id=processor.gen_terminator,
            num_logits_to_keep=1,
        )

    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    return output_text

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
output_text = generate(messages, processor=processor, model=model)
print(output_text)
# Output:

# 鹦鹉是一种非常受欢迎的宠物鸟类，它们以其鲜艳的羽毛、聪明的头脑和模仿人类语言的能力而闻名。鹦鹉的生活习性非常丰富，以下是一些主要的习性：

# 1. **社交性**：鹦鹉是高度社交的鸟类，它们在野外通常生活在群体中，与同伴互动、玩耍和寻找食物。在家庭环境中，鹦鹉需要与人类或其他鹦鹉进行定期的互动，以保持其心理健康。

# 2. **智力**：鹦鹉拥有非常高的智力，它们能够学习各种技能，包括模仿人类语言、识别物体、解决问题等。这种智力使它们成为非常有趣的宠物。

# ......

Citation

If you find our work helpful, feel free to give us a cite.


@misc{Mingomni2025,
      title  = {Ming-Omni: A Unified Multimodal Model for Perception and Generation}, 
      author = {Inclusion AI},
      year = {2025},
      eprint = {2506.09344},
      archivePrefix = {arXiv},
      url = {https://arxiv.org/abs/2506.09344}
}

@article{ai2025ming,
  title={Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation},
  author={Inclusion AI},
  journal={arXiv preprint arXiv:2510.24821},
  year={2025}
}