Instructions to use mkit/Yuan3.0-Flash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use mkit/Yuan3.0-Flash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mkit/Yuan3.0-Flash-GGUF", filename="Yuan3.0-Flash-Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use mkit/Yuan3.0-Flash-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf mkit/Yuan3.0-Flash-GGUF:Q4_K_M
Use Docker
docker model run hf.co/mkit/Yuan3.0-Flash-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use mkit/Yuan3.0-Flash-GGUF with Ollama:
ollama run hf.co/mkit/Yuan3.0-Flash-GGUF:Q4_K_M
- Unsloth Studio new
How to use mkit/Yuan3.0-Flash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mkit/Yuan3.0-Flash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mkit/Yuan3.0-Flash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mkit/Yuan3.0-Flash-GGUF to start chatting
- Docker Model Runner
How to use mkit/Yuan3.0-Flash-GGUF with Docker Model Runner:
docker model run hf.co/mkit/Yuan3.0-Flash-GGUF:Q4_K_M
- Lemonade
How to use mkit/Yuan3.0-Flash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mkit/Yuan3.0-Flash-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Yuan3.0-Flash-GGUF-Q4_K_M
List all available models
lemonade list
Yuan3.0-Flash-GGUF
β οΈ Bleeding Edge: This GGUF requires a custom llama.cpp build with Yuan3.0 support. Not yet in mainstream llama.cpp.
See Links below for the custom branch and Docker images.
GGUF quantized versions of YuanLabAI/Yuan3.0-Flash, a 40B parameter multimodal MoE model (~3.7B activated).
Model Overview
| Attribute | Value |
|---|---|
| Base Model | YuanLabAI/Yuan3.0-Flash |
| Architecture | MoE (256 experts, 8 activated + 1 shared) |
| Total Parameters | 40B |
| Activated Parameters | ~3.7B |
| Context Length | 128K |
| Input Modality | Text + Images |
Available Quantizations
| Quantization | Size | Use Case |
|---|---|---|
| F16 (3 shards) | ~77GB | Full precision |
| Q4_K_M | ~23GB | Good balance of speed/quality |
Quickstart
Option 1: Pre-built Docker Image (Recommended)
# Pull the latest image
docker pull ghcr.io/qades/llama.cpp:latest
# Run with GPU
docker run --gpus all -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
ghcr.io/qades/llama.cpp:latest \
./llama-cli -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
--mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf \
-c 131072 -n 4096 --temp 0.7
# Or use the OAI-compatible server
docker run --gpus all -p 8080:8080 -v /home/mk/yuan/Yuan3.0-Flash-GGUF:/model \
ghcr.io/qades/llama.cpp:latest \
./llama-server -m /model/Yuan3.0-Flash-Q4_K_M.gguf \
--mmproj /model/mmproj-Yuan3.0-Flash-f16.gguf -c 131072
Available tags:
latest- Main branchyuan3_0- Yuan3.0 specific branchsha-XXXXXX- Specific commits
Option 2: Build from Source
# Clone the custom branch
cd ~/llama.cpp
git checkout yuan3_0
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# Run
./build/bin/llama-cli -m ../Yuan3.0-Flash-GGUF/Yuan3.0-Flash-Q4_K_M.gguf \
--mmproj ../Yuan3.0-Flash-GGUF/mmproj-Yuan3.0-Flash-f16.gguf \
-c 131072 -n 4096 --temp 0.7
Option 3: llama-cpp-python
Build from the custom branch, then:
from llama_cpp import Llama
llm = Llama(
model_path="Yuan3.0-Flash-Q4_K_M.gguf",
mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
n_ctx=131072,
n_gpu_layers=-1,
)
# Text-only
output = llm("Explain quantum computing in simple terms")
# With image
from llama_cpp import LlamaVision
llm = LlamaVision(
model_path="Yuan3.0-Flash-Q4_K_M.gguf",
mmproj_path="mmproj-Yuan3.0-Flash-f16.gguf",
)
output = llm([{"type": "image", "image": "photo.jpg"}, {"type": "text", "text": "What do you see?"}])
Option 4: Ollama
Create Modelfile:
FROM ./Yuan3.0-Flash-Q4_K_M.gguf
PARAMETER mmproj ./mmproj-Yuan3.0-Flash-f16.gguf
PARAMETER context_length 131072
PARAMETER temperature 0.7
ollama create yuan3.0-flash -f Modelfile
ollama run yuan3.0-flash
vLLM (FP16 recommended for GPU)
vllm serve YuanLabAI/Yuan3.0-Flash \
--dtype half \
--max-model-len 131072
Memory Requirements
| Quantization | RAM/VRAM |
|---|---|
| F16 | ~80GB |
| Q4_K_M | ~24GB |
| Q4_K_M + CPU offload | ~8GB VRAM |
Performance Notes
- Context length tested up to 131K tokens
- Vision encoding adds ~0.5s per image
- Recommended:
--threads 8for CPU inference
Citation
If you use this GGUF conversion:
@software{yuan3.0flash_gguf,
title = {Yuan3.0-Flash-GGUF},
author = {Michael Klaus},
year = {2025},
url = {https://huggingface.co/YuanLabAI/Yuan3.0-Flash}
}
Original model:
@misc{yuan3.0flash,
title = {Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications},
author = {YuanLab AI},
year = {2025},
eprint = {2601.01718},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
Files
Yuan3.0-Flash-GGUF/
βββ mmproj-Yuan3.0-Flash-f16.gguf # Vision projector (~744MB)
βββ Yuan3.0-Flash-f16-00001-of-00003.gguf # F16 shard 1 (~21GB)
βββ Yuan3.0-Flash-f16-00002-of-00003.gguf # F16 shard 2 (~27GB)
βββ Yuan3.0-Flash-f16-00003-of-00003.gguf # F16 shard 3 (~29GB)
βββ Yuan3.0-Flash-Q4_K_M.gguf # Q4_K_M quantization (~23GB)
Converted using llama.cpp (QaDeS branch)
Links
| Resource | URL |
|---|---|
| Custom llama.cpp branch | github.com/QaDeS/llama.cpp/tree/yuan3_0 |
| Docker images | ghcr.io/qades/llama.cpp |
| Base model (HuggingFace) | YuanLabAI/Yuan3.0-Flash |
| Local llama.cpp build | ~/llama.cpp |
| Original model paper | arXiv:2601.01718 |
- Downloads last month
- 52
4-bit
16-bit
Model tree for mkit/Yuan3.0-Flash-GGUF
Base model
YuanLabAI/Yuan3.0-Flash