Instructions to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="magiccodingman/Qwen3.6-27B-MagicQuant-GGUF",
	filename="Qwen3.6-27B-LM-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
./llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Use Docker

docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

LM Studio
Jan

vLLM

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Ollama
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Ollama:
```
ollama run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
```

Unsloth Studio new

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for magiccodingman/Qwen3.6-27B-MagicQuant-GGUF to start chatting

Pi new

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Run Hermes

hermes

Docker Model Runner
How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Docker Model Runner:
```
docker model run hf.co/magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M
```

Lemonade

How to use magiccodingman/Qwen3.6-27B-MagicQuant-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull magiccodingman/Qwen3.6-27B-MagicQuant-GGUF:IQ2_M

Run and chat with the model

lemonade run user.Qwen3.6-27B-MagicQuant-GGUF-IQ2_M

List all available models

lemonade list

MTP?

by floory - opened 5 days ago

Discussion

floory

5 days ago

are the MTP layers stripped? if not, i would love to use this together with https://github.com/ggml-org/llama.cpp/pull/22673 ! as currently, there are no good quants of this model which fit within 24gb and MTP makes a big difference (20tps --> 50tps)

magiccodingman

Owner 5 days ago

Yeah, I've been watching that PR (very exciting. I use MTP on my vLLM setup and it's amazing). I'm waiting for it to merge into master since there's still a bit of final work for it to be fully stable. Once it's merged in and done, I'd be more than happy to rebuild and re-post the quants properly. But the llama.cpp I used for this model was a main git pull I did today.

magiccodingman

Owner 5 days ago

•

edited 5 days ago

Oh and since you're on a 24GB model. Hope you enjoy the MQ-Q6_K_3 and MQ-Q5_K_S_1 because this model was the first to fire anomaly detection with any models I've worked with thus far within the predictive engine. AKA, it was able to detect things that broke standard quantization rules and abused that pattern discovered excessively. That's why there's no Q8 because it found Q6 patterns that could not just be smaller but better KLD than Q8, which was pretty cool.

You only see that when anomaly detection fires because the architecture had weirdness that isn't normal and could be replicated. But both that Q6 and Q5 hit far above it's weight. The Q5_K_S_1 beat the standard llama.cpp Q6_K which was super cool too.

floory

5 days ago

•

edited 5 days ago

i've tested the PR on my system with Vulkan and it works fine on my system! people reported -50% PP which is why it's not merged but i personally still get 600tps pp (from maybe 900) but the TG from 20 to 40-60 is 100% worth it. pretty please? 🥺 it's hard to go back once you try it and vLLM isn't a great experience on 24gb but there aren't good quants for it, they all feel dumb </3

can barely run Q5_K_S_1 so i'll check that one out

crazy how you're able to pull this off. i really appreciate your work! been following you for months :D

magiccodingman

Owner 5 days ago

•

edited 5 days ago

Thank you! And you're tempting me now! Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?

Also, the wiki isn't done being fully updated, but I'm trying to document how it works. But this is due to how v2.0 works.

But basically it's utilizing what it learned from other model tensor configurations. Usually better versions of the model is simply making trades. It's not necessarily less bits, just swapping where we prioritize bits. This is the "normal".

But what I call an "anomaly" within my system is where there's a strict violation to the rule. In isolated sampling, ffn_down group at Q8 had a lower KLD than Q6. Thus Q8 is better than Q6, which is obvious and standard. But there was emergent behavior the system smoke tested and was able to find and validate to find localized emergent behavior in real hybrid scenario's that allowed a violation to the rule and thus patterns could be utilized to cause Q6_K to beat Q8_0 in specific groups or patterns.

Similar to how it utilized IQ4_NL in embedding group. THis wasn't a violation to the rule, it just hit way above what it deserved too with emergent behavior and this was utilized as well. Each MagicQuant repo actually is automated with everything created unless I manually change the ReadMe a bit like I tend to do. But the magicquant-manifest folder in every repo is full transparent logs of what the hybrids derive from. If it's utilizing Unsloth Dynamic learned configurations when and where. Plus tensor by tensor configuration maps for full reproducibility :)

I think the MQ-Q5 actually took learned configurations from Unsloth Dynamic UD-Q6 with a mix of Q8 if I remember correctly to pull it off.

floory

5 days ago

Is that PR just not stripping those MTP tensors or something? Meaning as long as MTP isn't enabled is it stable? Would you know?

from what i've read, the way models are quantised strips the MTP layer since it's unsupported, but it's needed for MTP to work so from my understanding, as long as it's not stripped, it should work with the PR. not certain, but pretty sure.

sadly i just got it to my disk, just to realise it's 22gb and i can't fit that with 128k KV cache even with q8_0, gonna have to settle for less but your models are goated regardless

no pressure but like... 🥺 🙏 seriously though if it is not much effort you'd win my heart

magiccodingman

Owner 5 days ago

Haha, I got you. I was about to get off for the night and I am going to be at my brothers wedding and doing wedding stuff for like 2 days. So I may not be able to come back to this and add MTP till this weekend. I just want to test that it doesn't break anything else, but if I have time after dinner I'll try to fit it in before I head out :)

floory

5 days ago

awesome, thank you very much! enjoy the wedding and wishing you all the best, can't wait for MTP :D

magiccodingman

Owner 5 days ago

I'm heading out for wedding stuff, but I'll continue looking into this when I'm back. There's a lot of reports specifically about it causing issues with images right now. So I just need to make sure I properly test it to make sure the model is backwards compatible and it can be disabled. That way the posted version is still stable :)

Though I'm very excited to use MTP as well! It's a world of difference.

floory

4 days ago

yes, the only issue is the PP regression and images crashing. if anything, it could probably just be in a separate qwen3.6-27b-mtp-magiquant repo, but makes sense you don't wanna bother until everything is stable. hope you have a nice time!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment