A newer version of this model is available: inclusionAI/Ming-flash-omni-2.0

Ming-flash-omni Preview

๐Ÿ“‘ Technical Report๏ฝœ๐Ÿค— Hugging Face๏ฝœ ๐Ÿค– ModelScope

Introduction

Ming-flash-omni Preview, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100B total parameters, of which only 6B are active per token. Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in both contextual ASR and dialect-aware ASR. In image generation, Ming-flash-omni Preview introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-flash-omni Preview introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models.

๐Ÿ“Œ Updates

  • [2025.10.27] ๐Ÿ”ฅ We release the preview version of Ming-flash-omni๏ผšMing-flash-omni Preview.
  • [2025.07.15] ๐Ÿ”ฅ We release Ming-lite-omni v1.5 with significant improvements across all modalities.
  • [2025.06.12] ๐Ÿ”ฅ Our Technical Report is in public on arxiv.
  • [2025.05.28] ๐Ÿ”ฅ The official version of Ming-lite-omni v1 is released, with better performance and image generation support.
  • [2025.05.04] ๐Ÿ”ฅ We release the test version of Ming-lite-omni๏ผšMing-lite-omni-Preview.

Key Features

Compared to Ming-lite-omni v1.5, Ming-flash-omni Preview features key optimizations in the following 3 areas:

  • Sparse MoE Architecture for Omni-Modality: The Sparse MoE Architecture for Omni-Modality features a 100B-A6B MoE backbone (an extension of Ling-Flash-2.0). To ensure uniform expert activation and stable training across all modalities, Ming-flash-omni Preview employs a Dual-Balanced Routing Mechanism that combines an Auxiliary Load Balancing Loss with a Modality-Level Router Bias Update.
  • Generative Segmentation-as-Editing Paradigm: It unifies segmentation and editing into a semantics-preserving generation task, and achieves $0.90$ on GenEval, surpassing non-RL methods in fine-grained spatial control.
  • Context-Aware and Dialectal Speech Recognition: Ming-flash-omni Preview sets a new State-of-the-Art performance across all 12 ContextASR benchmarks, and it significantly improves recognition performance for 15 Chinese dialects.

Use Cases

Steaming Video Conversation

Audio Context ASR & Dialect ASR

Audio Voice Clone

Image Generation & Editing

Model Downloads

You can download our latest model from both Huggingface and ModelScope. For previous version model like Ming-Lite-Omni v1.5, Please refer to this link.

Model Input modality Oput modality Download
Ming-flash-omni Preview Image,text,video,audio Image,text,audio ๐Ÿค— HuggingFace
๐Ÿค– ModelScope
If you're in mainland China, we strongly recommend you to download our model from ๐Ÿค– ModelScope.
pip install modelscope
modelscope download --model inclusionAI/Ming-flash-omni-Preview --local_dir inclusionAI/Ming-flash-omni-Preview  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Evaluation

Ming-flash-omni Preview shows competitive performance in vision-text understanding, image generation, audio understanding and text-to-speech capabilities. For detailed evaluation results๏ผŒplease refer to our techinical report.

Example Usage

We provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb.

import os
import torch
import warnings
from bisect import bisect_left
warnings.filterwarnings("ignore")

from transformers import AutoProcessor
from modeling_bailingmm2 import BailingMM2NativeForConditionalGeneration

def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 32
    layer_per_gpu = num_layers // world_size
    layer_per_gpu = [i * layer_per_gpu for i in range(1, world_size + 1)]
    for i in range(num_layers):
        device_map[f'model.model.layers.{i}'] = bisect_left(layer_per_gpu, i)
    device_map['vision'] = 0
    device_map['audio'] = 0
    device_map['linear_proj'] = 0
    device_map['linear_proj_audio'] = 0
    device_map['model.model.word_embeddings.weight'] = 0
    device_map['model.model.norm.weight'] = 0
    device_map['model.lm_head.weight'] = 0
    device_map['model.model.norm'] = 0
    device_map[f'model.model.layers.{num_layers - 1}'] = 0
    return device_map

# Load pre-trained model with optimized settings, this will take ~10 minutes
model_path = "inclusionAI/Ming-flash-omni-Preview"
model = BailingMM2NativeForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=split_model(),
    load_image_gen=True,
    load_talker=True,
).to(dtype=torch.bfloat16)

# Initialize processor for handling multimodal inputs
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Inference Pipeline
def generate(messages, processor, model, sys_prompt_exp=None, use_cot_system_prompt=False, max_new_tokens=512):
    text = processor.apply_chat_template(
        messages, 
        sys_prompt_exp=sys_prompt_exp,
        use_cot_system_prompt=use_cot_system_prompt
    )
    image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        audios=audio_inputs,
        return_tensors="pt",
        audio_kwargs={"use_whisper_encoder": True},
    ).to(model.device)

    for k in inputs.keys():
        if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
            inputs[k] = inputs[k].to(dtype=torch.bfloat16)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            use_cache=True,
            eos_token_id=processor.gen_terminator,
            num_logits_to_keep=1,
        )

    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    return output_text

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "่ฏท่ฏฆ็ป†ไป‹็ป้นฆ้น‰็š„็”Ÿๆดปไน ๆ€งใ€‚"}
        ],
    },
]
output_text = generate(messages, processor=processor, model=model)
print(output_text)
# Output:

# ้นฆ้น‰ๆ˜ฏไธ€็ง้žๅธธ่ชๆ˜Žๅ’Œ็คพไบคๆ€งๅผบ็š„้ธŸ็ฑป๏ผŒๅฎƒไปฌ็š„็”Ÿๆดปไน ๆ€ง้žๅธธไธฐๅฏŒๅ’Œๆœ‰่ถฃใ€‚ไปฅไธ‹ๆ˜ฏไธ€ไบ›ๅ…ณไบŽ้นฆ้น‰็”Ÿๆดปไน ๆ€ง็š„่ฏฆ็ป†ไป‹็ป๏ผš
# ### 1. **ๆ –ๆฏๅœฐ**
# ้นฆ้น‰ไธป่ฆๅˆ†ๅธƒๅœจ็ƒญๅธฆๅ’Œไบš็ƒญๅธฆๅœฐๅŒบ๏ผŒๅŒ…ๆ‹ฌ้žๆดฒใ€ไบšๆดฒใ€ๆพณๅคงๅˆฉไบšๅ’Œๅ—็พŽๆดฒใ€‚ๅฎƒไปฌ้€šๅธธ็”Ÿๆดปๅœจๆฃฎๆž—ใ€่‰ๅŽŸใ€ๆฒ™ๆผ ๅ’ŒๅŸŽๅธ‚็Žฏๅขƒไธญใ€‚ไธๅŒ็ง็ฑป็š„้นฆ้น‰ๅฏนๆ –ๆฏๅœฐ็š„่ฆๆฑ‚ๆœ‰ๆ‰€ไธๅŒ๏ผŒไฝ†ๅคงๅคšๆ•ฐ้นฆ้น‰ๅ–œๆฌขๆœ‰ไธฐๅฏŒๆค่ขซๅ’Œๆฐดๆบ็š„ๅœฐๆ–นใ€‚
# ### 2. **้ฅฎ้ฃŸ**
# ้นฆ้น‰ๆ˜ฏๆ‚้ฃŸๆ€งๅŠจ็‰ฉ๏ผŒๅฎƒไปฌ็š„้ฅฎ้ฃŸ้žๅธธๅคšๆ ทๅŒ–ใ€‚ๅฎƒไปฌ็š„้ฃŸ็‰ฉๅŒ…ๆ‹ฌ็งๅญใ€ๅšๆžœใ€ๆฐดๆžœใ€่”ฌ่œใ€่Šฑ่œœๅ’Œๆ˜†่™ซใ€‚้นฆ้น‰็š„ๅ–™้žๅธธๅผบๅฃฎ๏ผŒ่ƒฝๅคŸ่ฝปๆพๅœฐๆ‰“ๅผ€ๅš็กฌ็š„ๆžœๅฃณๅ’Œๅšๆžœใ€‚ไธ€ไบ›้นฆ้น‰่ฟ˜ไผšๅƒๆณฅๅœŸๆˆ–ๆฒ™ๅญ๏ผŒไปฅๅธฎๅŠฉๆถˆๅŒ–ๅ’Œ่กฅๅ……็Ÿฟ็‰ฉ่ดจใ€‚
# ......

Citation

If you find our work helpful, feel free to give us a cite.


@misc{Mingflash2025,
      title  = {Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation}, 
      author = {Inclusion AI},
      year = {2025},
      eprint = {2510.24821},
      archivePrefix = {arXiv},
      url = {https://arxiv.org/abs/2510.24821}
}


@misc{Mingomni2025,
      title  = {Ming-Omni: A Unified Multimodal Model for Perception and Generation}, 
      author = {Inclusion AI},
      year = {2025},
      eprint = {2506.09344},
      archivePrefix = {arXiv},
      url = {https://arxiv.org/abs/2506.09344}
}
Downloads last month
3
Safetensors
Model size
104B params
Tensor type
BF16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anas89ji/Ming-flash-omni-Preview

Finetuned
(4)
this model

Papers for anas89ji/Ming-flash-omni-Preview