Overview
LongCat-Image-Edit-MNN-int8 is an MNN diffusion resource package for LongCat-Image-Edit, to be used with the MNN C++ diffusion_demo binary for both Text-to-Image (T2I) and Image Editing tasks.
Supported Modes:
- 🎨 T2I Mode: Generate images from text prompts (no input image)
- ✏️ Edit Mode: Edit existing images based on text instructions (with input image)
Note: LongCat uses model_type=3 for both modes. The mode is automatically determined by whether an input image is provided.
LongCat-Image-Edit is a state-of-the-art bilingual (Chinese-English) multimodal model that supports:
- 🎨 Text-to-Image Generation: Create images from text descriptions
- 🖌️ Global editing: Modify overall image style, atmosphere, or theme
- ✏️ Local editing: Edit specific regions or objects
- 📝 Text modification: Change text content in images
- 🖼️ Reference-guided editing: Edit based on reference images
- 🔄 Multi-turn editing: Perform sequential edits while preserving consistency
What's included
text_encoder/directory containing:visual.mnn/visual.mnn.weight(vision encoder for image understanding, int8 quantized)llm.mnn/llm.mnn.weight(language model for text processing, int4 quantized)embeddings_bf16.bin(text embeddings)tokenizer.txt(tokenizer vocabulary)config.json(text encoder configuration with model metadata and precision info)
unet.mnn/unet.mnn.weight(diffusion transformer, int8 quantized)vae_encoder.mnn/vae_decoder.mnn(VAE for image encoding/decoding, fp16)config.json: resource description (filenames, precision labels, default inference parameters, etc.)configuration.json: generic task descriptionscheduler_config.json:FlowMatchEulerDiscreteSchedulerconfig
The directory name LongCat-Image-Edit-MNN-int8 denotes the primary quantization level. The actual bit-widths are:
- Visual encoder: int8
- LLM: int4
- Transformer (UNet): int8
- VAE: fp16
Bit-width is set via MNNConvert --weightQuantBits. Treat config.json as the source of truth.
How to export / generate
- Export ONNX from PyTorch model
visual.onnx,llm.onnx,unet.onnx,vae_encoder.onnx,vae_decoder.onnx- IO names expected by the MNN LongCat-Image-Edit pipeline:
- visual: inputs
pixel_values; outputimage_embeds - llm: inputs
input_ids,attention_mask,image_embeds; outputlast_hidden_state - unet: inputs
sample,timestep(float),encoder_hidden_states,image_latents; outputout_sample - vae_encoder: input
sample; outputlatent_sample - vae_decoder: input
latent_sample; outputsample
- visual: inputs
- Convert ONNX → MNN:
- Note: For the text_encoder (visual + llm), use the specialized export script from the custom MNN fork to ensure proper MNN structure support, others can use MNNConvert
MNNConvert -f ONNX --modelFile unet.onnx --MNNModel unet.mnn --weightQuantBits 8 MNNConvert -f ONNX --modelFile vae_encoder.onnx --MNNModel vae_encoder.mnn --weightQuantBits 16 MNNConvert -f ONNX --modelFile vae_decoder.onnx --MNNModel vae_decoder.mnn --weightQuantBits 16
How to run
Windows (PowerShell / cmd) — longcat-image-edit.bat
The script automatically detects whether to use T2I mode or Edit mode based on the arguments:
T2I Mode (Text-to-Image):
# Generate single image
.\longcat-image-edit.bat "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。"
# Generate multiple images (batch)
.\longcat-image-edit.bat "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。" 10
Edit Mode (Image Editing):
# Edit single image
.\longcat-image-edit.bat "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg"
# Edit with multiple variations (batch)
.\longcat-image-edit.bat "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg" 10
Auto-detection logic:
- If 2nd argument is an existing file → Edit mode (input image provided)
- If 2nd argument is a number → T2I mode with batch count (no input image)
- If 2nd argument is empty → T2I mode, single image (no input image)
Arguments passed to diffusion_demo.exe:resource_path model_type memory_mode backend_type iteration_num random_seed output_image_name [image_size] [cfg_scale] [cfg_mode] [gpu_mem_mode] [precision_mode] [te_on_cpu] [vae_on_cpu] <prompt_text> [input_image]
Script parameters (modifiable at top of longcat-image-edit.bat):
MODEL_DIR(default.): model directory (relative or absolute)MEMORY_TYPE(0/1/2): Diffusion memory mode (0=memory lack, 1=memory enough, 2=balance)BACKEND(0=cpu, 3=opencl, 7=vulkan)STEPS: diffusion steps (20–50 recommended; default 20)SEED: 0=auto random; non-0=fixed seedSIZE: 512/640/768/896/1024 (default 1024)CFG: classifier-free guidance scale- T2I mode: 3.0–8.0 (default 7.5)
- Edit mode: 3.0–6.0 (default 4.5, auto-adjusted)
CFG_MODE: CFG sigma range for dual-UNet models (LongCat only)- 0=auto(0.1
0.8), 1=wide(0.10.9), 2=standard(0.1~0.8) - 3=medium(0.15
0.7), 4=narrow(0.20.6), 5=minimal(0.25~0.5)
- 0=auto(0.1
GPU_MEM_MODE(OpenCL only): 0=auto, 1=buffer, 2=imagePRECISION(0=auto, 1=FP16, 2=FP32 normal, 3=FP32 high)TE_ON_CPU(0 same as UNet, 1 forces text_encoder on CPU; default 1)VAE_ON_CPU(0 same as UNet, 1 forces VAE on CPU for GPU memory saving; default 0)
Linux/macOS — longcat-image-edit.sh
Same auto-detection logic as Windows:
T2I Mode:
# Generate single image
./longcat-image-edit.sh "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。"
# Generate multiple images (batch)
./longcat-image-edit.sh "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。" 10
Edit Mode:
# Edit single image
./longcat-image-edit.sh "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg"
# Edit with multiple variations (batch)
./longcat-image-edit.sh "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg" 10
Script parameters mirror the Windows script:
MODEL_DIRMEMORY_TYPEBACKEND(default 3=OpenCL on Linux/macOS)STEPS(default 20)SEEDSIZE(default 1024)CFG(T2I: 7.5, Edit: 4.5, auto-adjusted)CFG_MODE(default 0=auto)GPU_MEM_MODEPRECISIONTE_ON_CPU(default 1)VAE_ON_CPU(default 0)
Building MNN for Windows
For detailed Windows build instructions, see BUILD_WINDOWS.md.
Quick summary:
cd MNN\build
cmake .. -DCMAKE_BUILD_TYPE=Release -DMNN_USE_SSE=ON -DMNN_AVX2=ON -DMNN_AVX512=ON -DMNN_AVX512_VNNI=ON -DMNN_OPENCL=ON -DMNN_VULKAN=ON -DMNN_VULKAN_IMAGE=ON -DMNN_SUPPORT_TRANSFORMER_FUSE=ON -DMNN_BUILD_CONVERTER=ON -DMNN_OPENMP=ON -DMNN_BUILD_DIFFUSION=ON -DMNN_BUILD_LLM=ON -DMNN_BUILD_OPENCV=ON -DMNN_LOW_MEMORY=ON -DMNN_USE_THREAD_POOL=ON
cmake --build . --config Release --target diffusion_demo -j 8
Then copy diffusion_demo.exe and MNN.dll from MNN\build\Release\ to the bin\ directory.
Model Type
- model_type=3: LongCat-Image-Edit (supports both T2I and image editing)
- T2I mode: Activated when no input image is provided
- Edit mode: Activated when input image is provided as the last parameter
The scripts use model_type=3 for all operations. The diffusion_demo binary automatically detects whether to perform T2I or image editing based on the presence of the input_image parameter.
Notes
- Weight files (
.mnn.weight) are large; use Git LFS/external storage when sharing. - MNN_BUILD_DIFFUSION=ON is required for diffusion model support
- MNN_BUILD_LLM=ON is required for LongCat's multimodal text encoder (vision + LLM)
- MNN_BUILD_OPENCV=ON is recommended for better image I/O support
- Custom MNN Fork: This model requires the specialized MNN fork with LongCat support. Use: https://github.com/er6y/MNN
- The same model weights support both T2I and Edit modes - no need for separate downloads
Model Information
- Base Model: meituan-longcat/LongCat-Image-Edit
- Paper: LongCat-Image Technical Report
- GitHub: LongCat-Image
- Hugging Face: LongCat-Image-Edit
Key Features
- 🌟 Dual-Mode Support: Both T2I generation and image editing in one model
- 🌟 Auto-Detection: Scripts automatically detect mode based on input parameters
- 🌟 Superior Precise Editing: Supports global editing, local editing, text modification, and reference-guided editing
- 🌟 Consistency Preservation: Strong consistency preservation in non-edited regions during multi-turn editing
- 🌟 Bilingual Support: Native Chinese and English instruction following
- 🌟 State-of-the-art Performance: Leading performance among open-source image editing models
- 🌟 Efficient Inference: Optimized for fast inference with MNN quantization
- 🌟 Advanced CFG Control: Dual-UNet architecture with configurable CFG sigma ranges
- Downloads last month
- -
Model tree for er6y/LongCat-Image-Edit-MNN-int8
Base model
meituan-longcat/LongCat-Image-Edit