Overview

LongCat-Image-Edit-MNN-int8 is an MNN diffusion resource package for LongCat-Image-Edit, to be used with the MNN C++ diffusion_demo binary for both Text-to-Image (T2I) and Image Editing tasks.

Supported Modes:

  • 🎨 T2I Mode: Generate images from text prompts (no input image)
  • ✏️ Edit Mode: Edit existing images based on text instructions (with input image)

Note: LongCat uses model_type=3 for both modes. The mode is automatically determined by whether an input image is provided.

LongCat-Image-Edit is a state-of-the-art bilingual (Chinese-English) multimodal model that supports:

  • 🎨 Text-to-Image Generation: Create images from text descriptions
  • 🖌️ Global editing: Modify overall image style, atmosphere, or theme
  • ✏️ Local editing: Edit specific regions or objects
  • 📝 Text modification: Change text content in images
  • 🖼️ Reference-guided editing: Edit based on reference images
  • 🔄 Multi-turn editing: Perform sequential edits while preserving consistency

What's included

  • text_encoder/ directory containing:
    • visual.mnn / visual.mnn.weight (vision encoder for image understanding, int8 quantized)
    • llm.mnn / llm.mnn.weight (language model for text processing, int4 quantized)
    • embeddings_bf16.bin (text embeddings)
    • tokenizer.txt (tokenizer vocabulary)
    • config.json (text encoder configuration with model metadata and precision info)
  • unet.mnn / unet.mnn.weight (diffusion transformer, int8 quantized)
  • vae_encoder.mnn / vae_decoder.mnn (VAE for image encoding/decoding, fp16)
  • config.json: resource description (filenames, precision labels, default inference parameters, etc.)
  • configuration.json: generic task description
  • scheduler_config.json: FlowMatchEulerDiscreteScheduler config

The directory name LongCat-Image-Edit-MNN-int8 denotes the primary quantization level. The actual bit-widths are:

  • Visual encoder: int8
  • LLM: int4
  • Transformer (UNet): int8
  • VAE: fp16

Bit-width is set via MNNConvert --weightQuantBits. Treat config.json as the source of truth.

How to export / generate

  1. Export ONNX from PyTorch model
    • visual.onnx, llm.onnx, unet.onnx, vae_encoder.onnx, vae_decoder.onnx
    • IO names expected by the MNN LongCat-Image-Edit pipeline:
      • visual: inputs pixel_values; output image_embeds
      • llm: inputs input_ids, attention_mask, image_embeds; output last_hidden_state
      • unet: inputs sample, timestep (float), encoder_hidden_states, image_latents; output out_sample
      • vae_encoder: input sample; output latent_sample
      • vae_decoder: input latent_sample; output sample
  2. Convert ONNX → MNN:
    • Note: For the text_encoder (visual + llm), use the specialized export script from the custom MNN fork to ensure proper MNN structure support, others can use MNNConvert
    
    MNNConvert -f ONNX --modelFile unet.onnx        --MNNModel unet.mnn        --weightQuantBits 8
    MNNConvert -f ONNX --modelFile vae_encoder.onnx --MNNModel vae_encoder.mnn --weightQuantBits 16
    MNNConvert -f ONNX --modelFile vae_decoder.onnx --MNNModel vae_decoder.mnn --weightQuantBits 16
    

How to run

Windows (PowerShell / cmd) — longcat-image-edit.bat

The script automatically detects whether to use T2I mode or Edit mode based on the arguments:

T2I Mode (Text-to-Image):

# Generate single image
.\longcat-image-edit.bat "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。"

# Generate multiple images (batch)
.\longcat-image-edit.bat "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。" 10

Edit Mode (Image Editing):

# Edit single image
.\longcat-image-edit.bat "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg"

# Edit with multiple variations (batch)
.\longcat-image-edit.bat "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg" 10

Auto-detection logic:

  • If 2nd argument is an existing file → Edit mode (input image provided)
  • If 2nd argument is a number → T2I mode with batch count (no input image)
  • If 2nd argument is empty → T2I mode, single image (no input image)

Arguments passed to diffusion_demo.exe:
resource_path model_type memory_mode backend_type iteration_num random_seed output_image_name [image_size] [cfg_scale] [cfg_mode] [gpu_mem_mode] [precision_mode] [te_on_cpu] [vae_on_cpu] <prompt_text> [input_image]

Script parameters (modifiable at top of longcat-image-edit.bat):

  • MODEL_DIR (default .): model directory (relative or absolute)
  • MEMORY_TYPE (0/1/2): Diffusion memory mode (0=memory lack, 1=memory enough, 2=balance)
  • BACKEND (0=cpu, 3=opencl, 7=vulkan)
  • STEPS: diffusion steps (20–50 recommended; default 20)
  • SEED: 0=auto random; non-0=fixed seed
  • SIZE: 512/640/768/896/1024 (default 1024)
  • CFG: classifier-free guidance scale
    • T2I mode: 3.0–8.0 (default 7.5)
    • Edit mode: 3.0–6.0 (default 4.5, auto-adjusted)
  • CFG_MODE: CFG sigma range for dual-UNet models (LongCat only)
    • 0=auto(0.10.8), 1=wide(0.10.9), 2=standard(0.1~0.8)
    • 3=medium(0.150.7), 4=narrow(0.20.6), 5=minimal(0.25~0.5)
  • GPU_MEM_MODE (OpenCL only): 0=auto, 1=buffer, 2=image
  • PRECISION (0=auto, 1=FP16, 2=FP32 normal, 3=FP32 high)
  • TE_ON_CPU (0 same as UNet, 1 forces text_encoder on CPU; default 1)
  • VAE_ON_CPU (0 same as UNet, 1 forces VAE on CPU for GPU memory saving; default 0)

Linux/macOS — longcat-image-edit.sh

Same auto-detection logic as Windows:

T2I Mode:

# Generate single image
./longcat-image-edit.sh "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。"

# Generate multiple images (batch)
./longcat-image-edit.sh "一个年轻的亚裔女性,身穿黄色针织衫,搭配白色项链。她的双手放在膝盖上,表情恬静。背景是一堵粗糙的砖墙,午后的阳光温暖地洒在她身上,营造出一种宁静而温馨的氛围。镜头采用中距离视角,突出她的神态和服饰的细节。光线柔和地打在她的脸上,强调她的五官和饰品的质感,增加画面的层次感与亲和力。整个画面构图简洁,砖墙的纹理与阳光的光影效果相得益彰,突显出人物的优雅与从容。" 10

Edit Mode:

# Edit single image
./longcat-image-edit.sh "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg"

# Edit with multiple variations (batch)
./longcat-image-edit.sh "将亚裔女性的黄色针织衫换成红色,白色项链换成金色项链,表情微笑" "input.jpg" 10

Script parameters mirror the Windows script:

  • MODEL_DIR
  • MEMORY_TYPE
  • BACKEND (default 3=OpenCL on Linux/macOS)
  • STEPS (default 20)
  • SEED
  • SIZE (default 1024)
  • CFG (T2I: 7.5, Edit: 4.5, auto-adjusted)
  • CFG_MODE (default 0=auto)
  • GPU_MEM_MODE
  • PRECISION
  • TE_ON_CPU (default 1)
  • VAE_ON_CPU (default 0)

Building MNN for Windows

For detailed Windows build instructions, see BUILD_WINDOWS.md.

Quick summary:

cd MNN\build
cmake .. -DCMAKE_BUILD_TYPE=Release -DMNN_USE_SSE=ON -DMNN_AVX2=ON -DMNN_AVX512=ON -DMNN_AVX512_VNNI=ON -DMNN_OPENCL=ON -DMNN_VULKAN=ON -DMNN_VULKAN_IMAGE=ON -DMNN_SUPPORT_TRANSFORMER_FUSE=ON -DMNN_BUILD_CONVERTER=ON -DMNN_OPENMP=ON -DMNN_BUILD_DIFFUSION=ON -DMNN_BUILD_LLM=ON -DMNN_BUILD_OPENCV=ON -DMNN_LOW_MEMORY=ON -DMNN_USE_THREAD_POOL=ON
cmake --build . --config Release --target diffusion_demo -j 8

Then copy diffusion_demo.exe and MNN.dll from MNN\build\Release\ to the bin\ directory.

Model Type

  • model_type=3: LongCat-Image-Edit (supports both T2I and image editing)
    • T2I mode: Activated when no input image is provided
    • Edit mode: Activated when input image is provided as the last parameter

The scripts use model_type=3 for all operations. The diffusion_demo binary automatically detects whether to perform T2I or image editing based on the presence of the input_image parameter.

Notes

  • Weight files (.mnn.weight) are large; use Git LFS/external storage when sharing.
  • MNN_BUILD_DIFFUSION=ON is required for diffusion model support
  • MNN_BUILD_LLM=ON is required for LongCat's multimodal text encoder (vision + LLM)
  • MNN_BUILD_OPENCV=ON is recommended for better image I/O support
  • Custom MNN Fork: This model requires the specialized MNN fork with LongCat support. Use: https://github.com/er6y/MNN
  • The same model weights support both T2I and Edit modes - no need for separate downloads

Model Information

Key Features

  • 🌟 Dual-Mode Support: Both T2I generation and image editing in one model
  • 🌟 Auto-Detection: Scripts automatically detect mode based on input parameters
  • 🌟 Superior Precise Editing: Supports global editing, local editing, text modification, and reference-guided editing
  • 🌟 Consistency Preservation: Strong consistency preservation in non-edited regions during multi-turn editing
  • 🌟 Bilingual Support: Native Chinese and English instruction following
  • 🌟 State-of-the-art Performance: Leading performance among open-source image editing models
  • 🌟 Efficient Inference: Optimized for fast inference with MNN quantization
  • 🌟 Advanced CFG Control: Dual-UNet architecture with configurable CFG sigma ranges
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for er6y/LongCat-Image-Edit-MNN-int8

Finetuned
(1)
this model

Paper for er6y/LongCat-Image-Edit-MNN-int8