LongCat-Flash-Lite GGUF
GGUF quantizations of meituan-longcat/LongCat-Flash-Lite for use with a custom llama.cpp fork.
Custom fork required. This model uses a novel architecture (MLA + MoE with identity experts + N-gram embeddings) that is not supported by upstream llama.cpp. You must build from the
longcat-flash-ngrambranch of the linked fork.
About LongCat-Flash-Lite
LongCat-Flash-Lite is a 68.5B parameter Mixture-of-Experts language model from Meituan, with only 3โ4.5B parameters activated per token. It combines three architectural innovations that make it unusually efficient:
- N-gram embeddings augment the standard token embedding with context from neighboring tokens
- Multi-head Latent Attention (MLA) compresses the KV cache for efficient long-context inference
- Identity experts in the MoE layer allow tokens to bypass expert computation via learned residual paths
The model supports a 327,680 token context window.
Why a custom fork?
Two upstream llama.cpp PRs attempted to add this architecture:
- PR #19167 (ngxson) โ N-gram embedding support, blocked by base model not yet being supported
- PR #19182 (ngxson) โ LongCat-Flash base architecture, abandoned after maintainers deemed identity experts too complex
This fork implements the complete architecture in a single self-contained addition (903 lines across 15 files). The implementation was AI-generated using Claude Code, which means it cannot be submitted upstream per llama.cpp's AI usage policy. It will remain available as a standalone fork.
Available Quantizations
Quantization guidance: The sweet spot for this MoE architecture is Q4_K_M or Q5_K_M โ best balance of quality, speed, and VRAM. Hallucination rate climbs monotonically as quantization increases: going above Q4 yields only marginal accuracy gains at steep speed/VRAM cost, while going below Q4 loses real knowledge with no quality benefit. Q3_K_L is usable but noticeably degraded. Lower quantizations (Q2 and below) are not provided as the model degenerates โ accuracy halves, response times spike from looping, and hallucination rate exceeds 91%.
| Quantization | Size | Filename |
|---|---|---|
| Q3_K_L | 30.5 GB | LongCat-Flash-Lite-Q3_K_L.gguf |
| Q4_K_M | 37.4 GB | LongCat-Flash-Lite-Q4_K_M.gguf (recommended) |
| Q5_K_M | 44.7 GB | LongCat-Flash-Lite-Q5_K_M.gguf (recommended) |
| Q6_K | 52.4 GB | LongCat-Flash-Lite-Q6_K.gguf |
| Q8_0 | 67.8 GB | LongCat-Flash-Lite-Q8_0.gguf |
| BF16 | 127.7 GB | LongCat-Flash-Lite-bf16.gguf |
How to Run
1. Build the custom llama.cpp fork
git clone -b longcat-flash-ngram https://github.com/InquiringMinds-AI/llama.cpp.git
cd llama.cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build -t llama-server -j$(nproc)
2. Download a quantization
# Example: Q4_K_M (37.4 GB)
huggingface-cli download InquiringMinds-AI/LongCat-Flash-Lite-GGUF \
LongCat-Flash-Lite-Q4_K_M.gguf --local-dir ./models
3. Run the server
./build/bin/llama-server \
-m ./models/LongCat-Flash-Lite-Q4_K_M.gguf \
-c 16384 -ngl 999 --host 0.0.0.0 --port 8080
The server exposes an OpenAI-compatible API at http://localhost:8080/v1.
Inference Performance
Measured on NVIDIA GB10 (128 GB unified memory) with full GPU offload:
| Quantization | Generation Speed |
|---|---|
| Q4_K_M | ~57 tok/s |
Architecture Details
LongCat-Flash-Lite uses a double-block layout: the original 14 transformer layers each contain two sub-blocks, mapped to 28 llama.cpp blocks. Key parameters:
| Parameter | Value |
|---|---|
| Total parameters | 68.5B |
| Activated parameters | 3โ4.5B |
| Vocabulary | 131,072 tokens |
| Hidden dimension | 3,072 |
| Attention heads | 32 |
| KV heads (GQA) | 1 |
| Q LoRA rank | 1,536 |
| KV LoRA rank | 512 |
| Real experts | 256 |
| Identity experts | 128 |
| Active experts (top-k) | 12 |
| Shared experts | 1 |
| Expert FFN dimension | 1,024 |
| N-gram tables | 12 (4 neighbor x 3 split) |
| Context window | 327,680 |
| RoPE | YaRN (factor=10, base=5M) |
N-gram Embeddings
Instead of using only the current token's embedding, the model hashes neighboring tokens (4 neighbors, split into 3 groups) through 12 polynomial rolling hash tables. The final embedding is computed as:
embed = base_embedding / 13 + sum(ngram_embeddings)
This gives the model sub-word and local context awareness at the embedding level.
Multi-head Latent Attention (MLA)
MLA compresses keys and values through a low-rank bottleneck (KV LoRA rank 512), reducing the KV cache size while maintaining attention quality. LoRA scaling factors (sqrt(2) for Q, sqrt(6) for KV) are applied at runtime.
Identity Experts
Of the 384 total experts per MoE layer, 128 are "identity" experts that pass the input through unchanged. When the router selects an identity expert, the token's representation is carried forward via a residual connection without any computation. This allows the model to learn which tokens benefit from expert processing and which are better left alone.
Acknowledgments
- ngxson for the initial llama.cpp PRs #19167 and #19182 that explored this architecture
- kernelpool (Tarjei Mandt) for the mlx-lm implementation (merged Jan 2026), used as architectural reference
- Meituan LongCat for the original model
License
MIT โ same as the source model.
- Downloads last month
- 958
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for InquiringMinds-AI/LongCat-Flash-Lite-GGUF
Base model
meituan-longcat/LongCat-Flash-Lite