Instructions to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="InquiringMinds-AI/LongCat-Flash-Lite-GGUF", filename="LongCat-Flash-Lite-Q3_K_L.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Use Docker
docker model run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "InquiringMinds-AI/LongCat-Flash-Lite-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InquiringMinds-AI/LongCat-Flash-Lite-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
- Ollama
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Ollama:
ollama run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
- Unsloth Studio new
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for InquiringMinds-AI/LongCat-Flash-Lite-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for InquiringMinds-AI/LongCat-Flash-Lite-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for InquiringMinds-AI/LongCat-Flash-Lite-GGUF to start chatting
- Pi new
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Docker Model Runner:
docker model run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
- Lemonade
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.LongCat-Flash-Lite-GGUF-Q4_K_M
List all available models
lemonade list
LongCat-Flash-Lite GGUF
GGUF quantizations of meituan-longcat/LongCat-Flash-Lite for use with a custom llama.cpp fork.
Custom fork required. This model uses a novel architecture (MLA + MoE with identity experts + N-gram embeddings) that is not supported by upstream llama.cpp. You must build from the
longcat-flash-ngrambranch of the linked fork.
About LongCat-Flash-Lite
LongCat-Flash-Lite is a 68.5B parameter Mixture-of-Experts language model from Meituan, with only 3โ4.5B parameters activated per token. It combines three architectural innovations that make it unusually efficient:
- N-gram embeddings augment the standard token embedding with context from neighboring tokens
- Multi-head Latent Attention (MLA) compresses the KV cache for efficient long-context inference
- Identity experts in the MoE layer allow tokens to bypass expert computation via learned residual paths
The model supports a 327,680 token context window.
Why a custom fork?
Two upstream llama.cpp PRs attempted to add this architecture:
- PR #19167 (ngxson) โ N-gram embedding support, blocked by base model not yet being supported
- PR #19182 (ngxson) โ LongCat-Flash base architecture, abandoned after maintainers deemed identity experts too complex
This fork implements the complete architecture in a single self-contained addition (903 lines across 15 files). The implementation was AI-generated using Claude Code, which means it cannot be submitted upstream per llama.cpp's AI usage policy. It will remain available as a standalone fork.
Available Quantizations
Quantization guidance: The sweet spot for this MoE architecture is Q4_K_M or Q5_K_M โ best balance of quality, speed, and VRAM. Hallucination rate climbs monotonically as quantization increases: going above Q4 yields only marginal accuracy gains at steep speed/VRAM cost, while going below Q4 loses real knowledge with no quality benefit. Q3_K_L is usable but noticeably degraded. Lower quantizations (Q2 and below) are not provided as the model degenerates โ accuracy halves, response times spike from looping, and hallucination rate exceeds 91%.
| Quantization | Size | Filename |
|---|---|---|
| Q3_K_L | 30.5 GB | LongCat-Flash-Lite-Q3_K_L.gguf |
| Q4_K_M | 37.4 GB | LongCat-Flash-Lite-Q4_K_M.gguf (recommended) |
| Q5_K_M | 44.7 GB | LongCat-Flash-Lite-Q5_K_M.gguf (recommended) |
| Q6_K | 52.4 GB | LongCat-Flash-Lite-Q6_K.gguf |
| Q8_0 | 67.8 GB | LongCat-Flash-Lite-Q8_0.gguf |
| BF16 | 127.7 GB | LongCat-Flash-Lite-bf16.gguf |
How to Run
1. Build the custom llama.cpp fork
git clone -b longcat-flash-ngram https://github.com/InquiringMinds-AI/llama.cpp.git
cd llama.cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build -t llama-server -j$(nproc)
2. Download a quantization
# Example: Q4_K_M (37.4 GB)
huggingface-cli download InquiringMinds-AI/LongCat-Flash-Lite-GGUF \
LongCat-Flash-Lite-Q4_K_M.gguf --local-dir ./models
3. Run the server
./build/bin/llama-server \
-m ./models/LongCat-Flash-Lite-Q4_K_M.gguf \
-c 16384 -ngl 999 --host 0.0.0.0 --port 8080
The server exposes an OpenAI-compatible API at http://localhost:8080/v1.
Inference Performance
Measured on NVIDIA GB10 (128 GB unified memory) with full GPU offload:
| Quantization | Generation Speed |
|---|---|
| Q4_K_M | ~57 tok/s |
Architecture Details
LongCat-Flash-Lite uses a double-block layout: the original 14 transformer layers each contain two sub-blocks, mapped to 28 llama.cpp blocks. Key parameters:
| Parameter | Value |
|---|---|
| Total parameters | 68.5B |
| Activated parameters | 3โ4.5B |
| Vocabulary | 131,072 tokens |
| Hidden dimension | 3,072 |
| Attention heads | 32 |
| KV heads (GQA) | 1 |
| Q LoRA rank | 1,536 |
| KV LoRA rank | 512 |
| Real experts | 256 |
| Identity experts | 128 |
| Active experts (top-k) | 12 |
| Shared experts | 1 |
| Expert FFN dimension | 1,024 |
| N-gram tables | 12 (4 neighbor x 3 split) |
| Context window | 327,680 |
| RoPE | YaRN (factor=10, base=5M) |
N-gram Embeddings
Instead of using only the current token's embedding, the model hashes neighboring tokens (4 neighbors, split into 3 groups) through 12 polynomial rolling hash tables. The final embedding is computed as:
embed = base_embedding / 13 + sum(ngram_embeddings)
This gives the model sub-word and local context awareness at the embedding level.
Multi-head Latent Attention (MLA)
MLA compresses keys and values through a low-rank bottleneck (KV LoRA rank 512), reducing the KV cache size while maintaining attention quality. LoRA scaling factors (sqrt(2) for Q, sqrt(6) for KV) are applied at runtime.
Identity Experts
Of the 384 total experts per MoE layer, 128 are "identity" experts that pass the input through unchanged. When the router selects an identity expert, the token's representation is carried forward via a residual connection without any computation. This allows the model to learn which tokens benefit from expert processing and which are better left alone.
Acknowledgments
- ngxson for the initial llama.cpp PRs #19167 and #19182 that explored this architecture
- kernelpool (Tarjei Mandt) for the mlx-lm implementation (merged Jan 2026), used as architectural reference
- Meituan LongCat for the original model
License
MIT โ same as the source model.
- Downloads last month
- 291
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for InquiringMinds-AI/LongCat-Flash-Lite-GGUF
Base model
meituan-longcat/LongCat-Flash-Lite