LongCat-Flash-Lite GGUF

GGUF quantizations of meituan-longcat/LongCat-Flash-Lite for use with a custom llama.cpp fork.

Custom fork required. This model uses a novel architecture (MLA + MoE with identity experts + N-gram embeddings) that is not supported by upstream llama.cpp. You must build from the longcat-flash-ngram branch of the linked fork.

About LongCat-Flash-Lite

LongCat-Flash-Lite is a 68.5B parameter Mixture-of-Experts language model from Meituan, with only 3โ€“4.5B parameters activated per token. It combines three architectural innovations that make it unusually efficient:

  • N-gram embeddings augment the standard token embedding with context from neighboring tokens
  • Multi-head Latent Attention (MLA) compresses the KV cache for efficient long-context inference
  • Identity experts in the MoE layer allow tokens to bypass expert computation via learned residual paths

The model supports a 327,680 token context window.

Why a custom fork?

Two upstream llama.cpp PRs attempted to add this architecture:

  • PR #19167 (ngxson) โ€” N-gram embedding support, blocked by base model not yet being supported
  • PR #19182 (ngxson) โ€” LongCat-Flash base architecture, abandoned after maintainers deemed identity experts too complex

This fork implements the complete architecture in a single self-contained addition (903 lines across 15 files). The implementation was AI-generated using Claude Code, which means it cannot be submitted upstream per llama.cpp's AI usage policy. It will remain available as a standalone fork.

Available Quantizations

Quantization guidance: The sweet spot for this MoE architecture is Q4_K_M or Q5_K_M โ€” best balance of quality, speed, and VRAM. Hallucination rate climbs monotonically as quantization increases: going above Q4 yields only marginal accuracy gains at steep speed/VRAM cost, while going below Q4 loses real knowledge with no quality benefit. Q3_K_L is usable but noticeably degraded. Lower quantizations (Q2 and below) are not provided as the model degenerates โ€” accuracy halves, response times spike from looping, and hallucination rate exceeds 91%.

Quantization Size Filename
Q3_K_L 30.5 GB LongCat-Flash-Lite-Q3_K_L.gguf
Q4_K_M 37.4 GB LongCat-Flash-Lite-Q4_K_M.gguf (recommended)
Q5_K_M 44.7 GB LongCat-Flash-Lite-Q5_K_M.gguf (recommended)
Q6_K 52.4 GB LongCat-Flash-Lite-Q6_K.gguf
Q8_0 67.8 GB LongCat-Flash-Lite-Q8_0.gguf
BF16 127.7 GB LongCat-Flash-Lite-bf16.gguf

How to Run

1. Build the custom llama.cpp fork

git clone -b longcat-flash-ngram https://github.com/InquiringMinds-AI/llama.cpp.git
cd llama.cpp

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build -t llama-server -j$(nproc)

2. Download a quantization

# Example: Q4_K_M (37.4 GB)
huggingface-cli download InquiringMinds-AI/LongCat-Flash-Lite-GGUF \
  LongCat-Flash-Lite-Q4_K_M.gguf --local-dir ./models

3. Run the server

./build/bin/llama-server \
  -m ./models/LongCat-Flash-Lite-Q4_K_M.gguf \
  -c 16384 -ngl 999 --host 0.0.0.0 --port 8080

The server exposes an OpenAI-compatible API at http://localhost:8080/v1.

Inference Performance

Measured on NVIDIA GB10 (128 GB unified memory) with full GPU offload:

Quantization Generation Speed
Q4_K_M ~57 tok/s

Architecture Details

LongCat-Flash-Lite uses a double-block layout: the original 14 transformer layers each contain two sub-blocks, mapped to 28 llama.cpp blocks. Key parameters:

Parameter Value
Total parameters 68.5B
Activated parameters 3โ€“4.5B
Vocabulary 131,072 tokens
Hidden dimension 3,072
Attention heads 32
KV heads (GQA) 1
Q LoRA rank 1,536
KV LoRA rank 512
Real experts 256
Identity experts 128
Active experts (top-k) 12
Shared experts 1
Expert FFN dimension 1,024
N-gram tables 12 (4 neighbor x 3 split)
Context window 327,680
RoPE YaRN (factor=10, base=5M)

N-gram Embeddings

Instead of using only the current token's embedding, the model hashes neighboring tokens (4 neighbors, split into 3 groups) through 12 polynomial rolling hash tables. The final embedding is computed as:

embed = base_embedding / 13 + sum(ngram_embeddings)

This gives the model sub-word and local context awareness at the embedding level.

Multi-head Latent Attention (MLA)

MLA compresses keys and values through a low-rank bottleneck (KV LoRA rank 512), reducing the KV cache size while maintaining attention quality. LoRA scaling factors (sqrt(2) for Q, sqrt(6) for KV) are applied at runtime.

Identity Experts

Of the 384 total experts per MoE layer, 128 are "identity" experts that pass the input through unchanged. When the router selects an identity expert, the token's representation is carried forward via a residual connection without any computation. This allows the model to learn which tokens benefit from expert processing and which are better left alone.

Acknowledgments

License

MIT โ€” same as the source model.

Downloads last month
958
GGUF
Model size
69B params
Architecture
longcat-flash-ngram
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for InquiringMinds-AI/LongCat-Flash-Lite-GGUF

Quantized
(8)
this model