Overview

LongCat-Image-Edit-MNN-int8 is an MNN diffusion resource package for LongCat-Image-Edit, to be used with the MNN C++ diffusion_demo binary for both Text-to-Image (T2I) and Image Editing tasks.

Supported Modes:

🎨 T2I Mode: Generate images from text prompts (no input image)
✏️ Edit Mode: Edit existing images based on text instructions (with input image)

Note: LongCat uses model_type=3 for both modes. The mode is automatically determined by whether an input image is provided.

LongCat-Image-Edit is a state-of-the-art bilingual (Chinese-English) multimodal model that supports:

🎨 Text-to-Image Generation: Create images from text descriptions
🖌️ Global editing: Modify overall image style, atmosphere, or theme
✏️ Local editing: Edit specific regions or objects
📝 Text modification: Change text content in images
🖼️ Reference-guided editing: Edit based on reference images
🔄 Multi-turn editing: Perform sequential edits while preserving consistency

What's included

text_encoder/ directory containing:
- visual.mnn / visual.mnn.weight (vision encoder for image understanding, int8 quantized)
- llm.mnn / llm.mnn.weight (language model for text processing, int4 quantized)
- embeddings_bf16.bin (text embeddings)
- tokenizer.txt (tokenizer vocabulary)
- config.json (text encoder configuration with model metadata and precision info)
unet.mnn / unet.mnn.weight (diffusion transformer, int8 quantized)
vae_encoder.mnn / vae_decoder.mnn (VAE for image encoding/decoding, fp16)
config.json: resource description (filenames, precision labels, default inference parameters, etc.)
configuration.json: generic task description
scheduler_config.json: FlowMatchEulerDiscreteScheduler config

The directory name LongCat-Image-Edit-MNN-int8 denotes the primary quantization level. The actual bit-widths are:

Visual encoder: int8
LLM: int4
Transformer (UNet): int8
VAE: fp16

Bit-width is set via MNNConvert --weightQuantBits. Treat config.json as the source of truth.

How to export / generate

Export ONNX from PyTorch model
- visual.onnx, llm.onnx, unet.onnx, vae_encoder.onnx, vae_decoder.onnx
- IO names expected by the MNN LongCat-Image-Edit pipeline:
  - visual: inputs pixel_values; output image_embeds
  - llm: inputs input_ids, attention_mask, image_embeds; output last_hidden_state
  - unet: inputs sample, timestep (float), encoder_hidden_states, image_latents; output out_sample
  - vae_encoder: input sample; output latent_sample
  - vae_decoder: input latent_sample; output sample

Convert ONNX → MNN:

Note: For the text_encoder (visual + llm), use the specialized export script from the custom MNN fork to ensure proper MNN structure support， others can use MNNConvert


MNNConvert -f ONNX --modelFile unet.onnx        --MNNModel unet.mnn        --weightQuantBits 8
MNNConvert -f ONNX --modelFile vae_encoder.onnx --MNNModel vae_encoder.mnn --weightQuantBits 16
MNNConvert -f ONNX --modelFile vae_decoder.onnx --MNNModel vae_decoder.mnn --weightQuantBits 16

How to run

Windows (PowerShell / cmd) — `longcat-image-edit.bat`

The script automatically detects whether to use T2I mode or Edit mode based on the arguments:

T2I Mode (Text-to-Image):

# Generate single image
.\longcat-image-edit.bat "一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。"

# Generate multiple images (batch)
.\longcat-image-edit.bat "一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。" 10

Edit Mode (Image Editing):

# Edit single image
.\longcat-image-edit.bat "将亚裔女性的黄色针织衫换成红色，白色项链换成金色项链，表情微笑" "input.jpg"

# Edit with multiple variations (batch)
.\longcat-image-edit.bat "将亚裔女性的黄色针织衫换成红色，白色项链换成金色项链，表情微笑" "input.jpg" 10

Auto-detection logic:

If 2nd argument is an existing file → Edit mode (input image provided)
If 2nd argument is a number → T2I mode with batch count (no input image)
If 2nd argument is empty → T2I mode, single image (no input image)

Arguments passed to diffusion_demo.exe:
resource_path model_type memory_mode backend_type iteration_num random_seed output_image_name [image_size] [cfg_scale] [cfg_mode] [gpu_mem_mode] [precision_mode] [te_on_cpu] [vae_on_cpu] <prompt_text> [input_image]

Script parameters (modifiable at top of longcat-image-edit.bat):

MODEL_DIR (default .): model directory (relative or absolute)
MEMORY_TYPE (0/1/2): Diffusion memory mode (0=memory lack, 1=memory enough, 2=balance)
BACKEND (0=cpu, 3=opencl, 7=vulkan)
STEPS: diffusion steps (20–50 recommended; default 20)
SEED: 0=auto random; non-0=fixed seed
SIZE: 512/640/768/896/1024 (default 1024)
CFG: classifier-free guidance scale
- T2I mode: 3.0–8.0 (default 7.5)
- Edit mode: 3.0–6.0 (default 4.5, auto-adjusted)
CFG_MODE: CFG sigma range for dual-UNet models (LongCat only)
- 0=auto(0.1~~0.8), 1=wide(0.1~~0.9), 2=standard(0.1~0.8)
- 3=medium(0.15~~0.7), 4=narrow(0.2~~0.6), 5=minimal(0.25~0.5)
GPU_MEM_MODE (OpenCL only): 0=auto, 1=buffer, 2=image
PRECISION (0=auto, 1=FP16, 2=FP32 normal, 3=FP32 high)
TE_ON_CPU (0 same as UNet, 1 forces text_encoder on CPU; default 1)
VAE_ON_CPU (0 same as UNet, 1 forces VAE on CPU for GPU memory saving; default 0)

Linux/macOS — `longcat-image-edit.sh`

Same auto-detection logic as Windows:

T2I Mode:

# Generate single image
./longcat-image-edit.sh "一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。"

# Generate multiple images (batch)
./longcat-image-edit.sh "一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。" 10

Edit Mode:

# Edit single image
./longcat-image-edit.sh "将亚裔女性的黄色针织衫换成红色，白色项链换成金色项链，表情微笑" "input.jpg"

# Edit with multiple variations (batch)
./longcat-image-edit.sh "将亚裔女性的黄色针织衫换成红色，白色项链换成金色项链，表情微笑" "input.jpg" 10

Script parameters mirror the Windows script:

MODEL_DIR
MEMORY_TYPE
BACKEND (default 3=OpenCL on Linux/macOS)
STEPS (default 20)
SEED
SIZE (default 1024)
CFG (T2I: 7.5, Edit: 4.5, auto-adjusted)
CFG_MODE (default 0=auto)
GPU_MEM_MODE
PRECISION
TE_ON_CPU (default 1)
VAE_ON_CPU (default 0)

Building MNN for Windows

For detailed Windows build instructions, see BUILD_WINDOWS.md.

Quick summary:

cd MNN\build
cmake .. -DCMAKE_BUILD_TYPE=Release -DMNN_USE_SSE=ON -DMNN_AVX2=ON -DMNN_AVX512=ON -DMNN_AVX512_VNNI=ON -DMNN_OPENCL=ON -DMNN_VULKAN=ON -DMNN_VULKAN_IMAGE=ON -DMNN_SUPPORT_TRANSFORMER_FUSE=ON -DMNN_BUILD_CONVERTER=ON -DMNN_OPENMP=ON -DMNN_BUILD_DIFFUSION=ON -DMNN_BUILD_LLM=ON -DMNN_BUILD_OPENCV=ON -DMNN_LOW_MEMORY=ON -DMNN_USE_THREAD_POOL=ON
cmake --build . --config Release --target diffusion_demo -j 8

Then copy diffusion_demo.exe and MNN.dll from MNN\build\Release\ to the bin\ directory.

Model Type

model_type=3: LongCat-Image-Edit (supports both T2I and image editing)
- T2I mode: Activated when no input image is provided
- Edit mode: Activated when input image is provided as the last parameter

The scripts use model_type=3 for all operations. The diffusion_demo binary automatically detects whether to perform T2I or image editing based on the presence of the input_image parameter.

Notes

Weight files (.mnn.weight) are large; use Git LFS/external storage when sharing.
MNN_BUILD_DIFFUSION=ON is required for diffusion model support
MNN_BUILD_LLM=ON is required for LongCat's multimodal text encoder (vision + LLM)
MNN_BUILD_OPENCV=ON is recommended for better image I/O support
Custom MNN Fork: This model requires the specialized MNN fork with LongCat support. Use: https://github.com/er6y/MNN
The same model weights support both T2I and Edit modes - no need for separate downloads

Model Information

Base Model: meituan-longcat/LongCat-Image-Edit
Paper: LongCat-Image Technical Report
GitHub: LongCat-Image
Hugging Face: LongCat-Image-Edit

Key Features

🌟 Dual-Mode Support: Both T2I generation and image editing in one model
🌟 Auto-Detection: Scripts automatically detect mode based on input parameters
🌟 Superior Precise Editing: Supports global editing, local editing, text modification, and reference-guided editing
🌟 Consistency Preservation: Strong consistency preservation in non-edited regions during multi-turn editing
🌟 Bilingual Support: Native Chinese and English instruction following
🌟 State-of-the-art Performance: Leading performance among open-source image editing models
🌟 Efficient Inference: Optimized for fast inference with MNN quantization
🌟 Advanced CFG Control: Dual-UNet architecture with configurable CFG sigma ranges

Downloads last month: -

Model tree for er6y/LongCat-Image-Edit-MNN-int8

Base model

meituan-longcat/LongCat-Image-Edit

Finetuned

(1)

this model

Paper for er6y/LongCat-Image-Edit-MNN-int8

LongCat-Image Technical Report

Paper • 2512.07584 • Published Dec 8, 2025 • 18