Instructions to use mkit/Yuan3.0-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mkit/Yuan3.0-Flash-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mkit/Yuan3.0-Flash-GGUF",
	filename="Yuan3.0-Flash-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use mkit/Yuan3.0-Flash-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M

Use Docker

docker model run hf.co/mkit/Yuan3.0-Flash-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use mkit/Yuan3.0-Flash-GGUF with Ollama:
```
ollama run hf.co/mkit/Yuan3.0-Flash-GGUF:Q4_K_M
```

Unsloth Studio

How to use mkit/Yuan3.0-Flash-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mkit/Yuan3.0-Flash-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mkit/Yuan3.0-Flash-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mkit/Yuan3.0-Flash-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use mkit/Yuan3.0-Flash-GGUF with Docker Model Runner:
```
docker model run hf.co/mkit/Yuan3.0-Flash-GGUF:Q4_K_M
```

Lemonade

How to use mkit/Yuan3.0-Flash-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mkit/Yuan3.0-Flash-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Yuan3.0-Flash-GGUF-Q4_K_M

List all available models

lemonade list

Yuan3.0-Flash-GGUF

⚠️ Bleeding Edge: This GGUF requires a custom llama.cpp build with Yuan3.0 support. Not yet in mainstream llama.cpp.

See Links below for the custom branch and Docker images.

GGUF quantized versions of YuanLabAI/Yuan3.0-Flash, a 40B parameter multimodal MoE model (~3.7B activated).

Model Overview

Attribute	Value
Base Model	YuanLabAI/Yuan3.0-Flash
Architecture	MoE (256 experts, 8 activated + 1 shared)
Total Parameters	40B
Activated Parameters	~3.7B
Context Length	128K
Input Modality	Text + Images

Available Quantizations

Quantization	Size	Use Case
F16 (3 shards)	~77GB	Full precision
Q4_K_M	~23GB	Good balance of speed/quality

Quickstart

Option 1: Pre-built Docker Image (Recommended)

# Pull the latest image
docker pull ghcr.io/qades/llama.cpp:latest

# Run with GPU
docker run --gpus all -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
  ghcr.io/qades/llama.cpp:latest \
  ./llama-cli -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf \
  -c 131072 -n 4096 --temp 0.7

# Or use the OAI-compatible server
docker run --gpus all -p 8080:8080 -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
  ghcr.io/qades/llama.cpp:latest \
  ./llama-server -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf -c 131072

Available tags:

latest - Main branch
yuan3_0 - Yuan3.0 specific branch
sha-XXXXXX - Specific commits

Option 2: Build from Source

# Clone the custom branch
cd ~/llama.cpp
git checkout yuan3_0
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

# Run
./build/bin/llama-cli -m ../Yuan3.0-Flash-GGUF/Yuan3.0-Flash-Q4_K_M.gguf \
  --mmproj ../Yuan3.0-Flash-GGUF/mmproj-Yuan3.0-Flash-f16.gguf \
  -c 131072 -n 4096 --temp 0.7

Option 3: llama-cpp-python

Build from the custom branch, then:

from llama_cpp import Llama

llm = Llama(
    model_path="Yuan3.0-Flash-Q4_K_M.gguf",
    mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
    n_ctx=131072,
    n_gpu_layers=-1,
)

# Text-only
output = llm("Explain quantum computing in simple terms")

# With image
from llama_cpp import LlamaVision

llm = LlamaVision(
    model_path="Yuan3.0-Flash-Q4_K_M.gguf",
    mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
)
output = llm([{"type": "image", "image": "photo.jpg"}, {"type": "text", "text": "What do you see?"}])

Option 4: Ollama

Create Modelfile:

FROM ./Yuan3.0-Flash-Q4_K_M.gguf
PARAMETER mmproj ./mmproj-Yuan3.0-Flash-f16.gguf
PARAMETER context_length 131072
PARAMETER temperature 0.7

ollama create yuan3.0-flash -f Modelfile
ollama run yuan3.0-flash

vLLM (FP16 recommended for GPU)

vllm serve YuanLabAI/Yuan3.0-Flash \
    --dtype half \
    --max-model-len 131072

Memory Requirements

Quantization	RAM/VRAM
F16	~80GB
Q4_K_M	~24GB
Q4_K_M + CPU offload	~8GB VRAM

Performance Notes

Context length tested up to 131K tokens
Vision encoding adds ~0.5s per image
Recommended: --threads 8 for CPU inference

Citation

If you use this GGUF conversion:

@software{yuan3.0flash_gguf,
  title = {Yuan3.0-Flash-GGUF},
  author = {Michael Klaus},
  year = {2025},
  url = {https://huggingface.co/YuanLabAI/Yuan3.0-Flash}
}

Original model:

@misc{yuan3.0flash,
  title = {Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications},
  author = {YuanLab AI},
  year = {2025},
  eprint = {2601.01718},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Files

Yuan3.0-Flash-GGUF/
├── mmproj-Yuan3.0-Flash-f16.gguf       # Vision projector (~744MB)
├── Yuan3.0-Flash-f16-00001-of-00003.gguf  # F16 shard 1 (~21GB)
├── Yuan3.0-Flash-f16-00002-of-00003.gguf  # F16 shard 2 (~27GB)
├── Yuan3.0-Flash-f16-00003-of-00003.gguf  # F16 shard 3 (~29GB)
└── Yuan3.0-Flash-Q4_K_M.gguf            # Q4_K_M quantization (~23GB)

Converted using llama.cpp (QaDeS branch)

Links

Resource	URL
Custom llama.cpp branch	github.com/QaDeS/llama.cpp/tree/yuan3_0
Docker images	ghcr.io/qades/llama.cpp
Base model (HuggingFace)	YuanLabAI/Yuan3.0-Flash
Local llama.cpp build	`~/llama.cpp`
Original model paper	arXiv:2601.01718

Downloads last month: 28

GGUF

Model size

40B params

Architecture

yuan

Hardware compatibility

4-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mkit/Yuan3.0-Flash-GGUF

Base model

YuanLabAI/Yuan3.0-Flash

Quantized

(1)

this model

Paper for mkit/Yuan3.0-Flash-GGUF

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

Paper • 2601.01718 • Published Jan 5 • 1