Instructions to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="InquiringMinds-AI/LongCat-Flash-Lite-GGUF",
	filename="LongCat-Flash-Lite-Q3_K_L.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Use Docker

docker model run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "InquiringMinds-AI/LongCat-Flash-Lite-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "InquiringMinds-AI/LongCat-Flash-Lite-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Ollama
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Ollama:
```
ollama run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
```

Unsloth Studio

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for InquiringMinds-AI/LongCat-Flash-Lite-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for InquiringMinds-AI/LongCat-Flash-Lite-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for InquiringMinds-AI/LongCat-Flash-Lite-GGUF to start chatting

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Docker Model Runner:
```
docker model run hf.co/InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M
```

Lemonade

How to use InquiringMinds-AI/LongCat-Flash-Lite-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull InquiringMinds-AI/LongCat-Flash-Lite-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.LongCat-Flash-Lite-GGUF-Q4_K_M

List all available models

lemonade list

LongCat-Flash-Lite GGUF

GGUF quantizations of meituan-longcat/LongCat-Flash-Lite for use with a custom llama.cpp fork.

Custom fork required. This model uses a novel architecture (MLA + MoE with identity experts + N-gram embeddings) that is not supported by upstream llama.cpp. You must build from the longcat-flash-ngram branch of the linked fork.

About LongCat-Flash-Lite

LongCat-Flash-Lite is a 68.5B parameter Mixture-of-Experts language model from Meituan, with only 3–4.5B parameters activated per token. It combines three architectural innovations that make it unusually efficient:

N-gram embeddings augment the standard token embedding with context from neighboring tokens
Multi-head Latent Attention (MLA) compresses the KV cache for efficient long-context inference
Identity experts in the MoE layer allow tokens to bypass expert computation via learned residual paths

The model supports a 327,680 token context window.

Why a custom fork?

Two upstream llama.cpp PRs attempted to add this architecture:

PR #19167 (ngxson) — N-gram embedding support, blocked by base model not yet being supported
PR #19182 (ngxson) — LongCat-Flash base architecture, abandoned after maintainers deemed identity experts too complex

This fork implements the complete architecture in a single self-contained addition (903 lines across 15 files). The implementation was AI-generated using Claude Code, which means it cannot be submitted upstream per llama.cpp's AI usage policy. It will remain available as a standalone fork.

Available Quantizations

Quantization guidance: The sweet spot for this MoE architecture is Q4_K_M or Q5_K_M — best balance of quality, speed, and VRAM. Hallucination rate climbs monotonically as quantization increases: going above Q4 yields only marginal accuracy gains at steep speed/VRAM cost, while going below Q4 loses real knowledge with no quality benefit. Q3_K_L is usable but noticeably degraded. Lower quantizations (Q2 and below) are not provided as the model degenerates — accuracy halves, response times spike from looping, and hallucination rate exceeds 91%.

Quantization	Size	Filename
Q3_K_L	30.5 GB	`LongCat-Flash-Lite-Q3_K_L.gguf`
Q4_K_M	37.4 GB	`LongCat-Flash-Lite-Q4_K_M.gguf` (recommended)
Q5_K_M	44.7 GB	`LongCat-Flash-Lite-Q5_K_M.gguf` (recommended)
Q6_K	52.4 GB	`LongCat-Flash-Lite-Q6_K.gguf`
Q8_0	67.8 GB	`LongCat-Flash-Lite-Q8_0.gguf`
BF16	127.7 GB	`LongCat-Flash-Lite-bf16.gguf`

How to Run

1. Build the custom llama.cpp fork

git clone -b longcat-flash-ngram https://github.com/InquiringMinds-AI/llama.cpp.git
cd llama.cpp

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
cmake --build build -t llama-server -j$(nproc)

2. Download a quantization

# Example: Q4_K_M (37.4 GB)
huggingface-cli download InquiringMinds-AI/LongCat-Flash-Lite-GGUF \
  LongCat-Flash-Lite-Q4_K_M.gguf --local-dir ./models

3. Run the server

./build/bin/llama-server \
  -m ./models/LongCat-Flash-Lite-Q4_K_M.gguf \
  -c 16384 -ngl 999 --host 0.0.0.0 --port 8080

The server exposes an OpenAI-compatible API at http://localhost:8080/v1.

Inference Performance

Measured on NVIDIA GB10 (128 GB unified memory) with full GPU offload:

Quantization	Generation Speed
Q4_K_M	~57 tok/s

Architecture Details

LongCat-Flash-Lite uses a double-block layout: the original 14 transformer layers each contain two sub-blocks, mapped to 28 llama.cpp blocks. Key parameters:

Parameter	Value
Total parameters	68.5B
Activated parameters	3–4.5B
Vocabulary	131,072 tokens
Hidden dimension	3,072
Attention heads	32
KV heads (GQA)	1
Q LoRA rank	1,536
KV LoRA rank	512
Real experts	256
Identity experts	128
Active experts (top-k)	12
Shared experts	1
Expert FFN dimension	1,024
N-gram tables	12 (4 neighbor x 3 split)
Context window	327,680
RoPE	YaRN (factor=10, base=5M)

N-gram Embeddings

Instead of using only the current token's embedding, the model hashes neighboring tokens (4 neighbors, split into 3 groups) through 12 polynomial rolling hash tables. The final embedding is computed as:

embed = base_embedding / 13 + sum(ngram_embeddings)

This gives the model sub-word and local context awareness at the embedding level.

Multi-head Latent Attention (MLA)

MLA compresses keys and values through a low-rank bottleneck (KV LoRA rank 512), reducing the KV cache size while maintaining attention quality. LoRA scaling factors (sqrt(2) for Q, sqrt(6) for KV) are applied at runtime.

Identity Experts

Of the 384 total experts per MoE layer, 128 are "identity" experts that pass the input through unchanged. When the router selects an identity expert, the token's representation is carried forward via a residual connection without any computation. This allows the model to learn which tokens benefit from expert processing and which are better left alone.

Acknowledgments

ngxson for the initial llama.cpp PRs #19167 and #19182 that explored this architecture
kernelpool (Tarjei Mandt) for the mlx-lm implementation (merged Jan 2026), used as architectural reference
Meituan LongCat for the original model

License

MIT — same as the source model.

Downloads last month: 1,413

GGUF

Model size

69B params

Architecture

longcat-flash-ngram

Hardware compatibility

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for InquiringMinds-AI/LongCat-Flash-Lite-GGUF

Base model

meituan-longcat/LongCat-Flash-Lite

Quantized

(7)

this model