Instructions to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF",
	filename="Qwen3.6-35B-A3B-MQ-IQ2_XXS_1.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Use Docker

docker model run hf.co/magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Ollama
How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with Ollama:
```
ollama run hf.co/magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M
```

Unsloth Studio new

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF to start chatting

Pi new

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with Docker Model Runner:
```
docker model run hf.co/magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M
```

Lemonade

How to use magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull magiccodingman/Qwen3.6-35B-A3B-MagicQuant-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-MagicQuant-GGUF-Q4_K_M

List all available models

lemonade list

REAP / MagicQuant

by redtailcowboy - opened 20 days ago

Discussion

redtailcowboy

20 days ago

Mr. MagicCoding sir, I was curious what your thoughts on the REAP pruning method are, and if you’ve considered making a MagicQuant of a REAP pruned model? In my dumb monkey brain, the file size reduction from pruning experts plus intelligent bit allocation from your quant method seems like it could make a large model perform quite well as a low bit quant. Hope you are well!

magiccodingman

Owner 20 days ago

Hmm.. I'll look into this seriously. I do wonder if and how much the system could benefit from REAP in general. If the isolated sampling could do similar logic with what it learns from REAP. it could be a very interesting hybrid approach. I think you or someone else brought this to my attention previously but I will admit. Thinking about the brain being protected through MagicQuants bad trade system, while utilizing Unsloth Dynamic quant configurations, and reducing size below 4 bit with REAP. I am drooling at the potential possibility lol.

But maybe REAP wouldn't even need any isolation? Not sure. I've not really played with it. Once I flesh out a lot of my current code, I've got this added on my list now to look into seriously. It could be a really fun addition. But I know too little about REAP at the moment. There's a lot of questions I'd have to answer, numerous testing, but consider me interested in REAP!

And thank you! Glad to see there's eyes on the new quants. This model will hopefully have improvements too in the future. I have to re-run the process on this model because I think I can score a bit better.

magiccodingman

Owner 9 days ago

I'm actually playing with REAP now just as an fyi. Hopefull end up with some pretty cool results. trying to get a good 35% pruned without a crazy amount of loss then utilizing the current MagicQuant finalists to quantize the model. I need to run a variety of benchmarks and I'm doing a lot by hand since it's my first time playing with it. But As long as nothing crazy happens to stop me. Hopefully in the near future I'll add some REAP magicQuant models as well. It'll be under it's own repo. But if it works out well. I'm staring at the MQ-IQ4_NL_1 being reduced to like the ~13.8 GB range which would be pretty cool if it works out.

redtailcowboy

9 days ago

Hell yeah sounds great, I wish you luck with that!

magiccodingman

Owner 9 days ago

I'm going to circle back to REAP. I got pretty far actually, but I don't think REAP has been updated for new architecture in quite some time. Maybe I did something wrong, but I spent like 3 hours having to reverse engineer the code, then had to make multiple patches, and finally I got it to work. But it's not very effecient imo, so I'm fixing a lot of performance issues so that way the process can finish before heat death lol. REAP is super cool, I just didn't know it needed so much TLC. Again maybe I'm doing something wrong, but I hit error after error and was just patching code left and right.

magiccodingman

Owner 8 days ago

okay I finally got it to work TT_TT now I can start playing with it lol.

magiccodingman

Owner 8 days ago

I'm just venting but oh my lord that was so much work lol. I was actually kinda excited about REAP so I didn't give up on it. But after digging so far into the code, I have a few gripes with it personally. It's very good at finding what experts to keep or not, but it is also kinda barbaric with how it removes items. Not crapping on the project, it's genuinely amazing. But there's zero progress saving and if you want an all around strong practical edge model, not just a hyper specialized coding model, you have to basically spend tons of compute time building 0.05, 0.10, 0.15, 0.2, 0.25, 0.30 and so on. Then you'd have to basically benchmark each. More importantly even if you do that, you have to balance datasets and even then it removes things that you may have wanted to keep.

Here's an example, if you light up an expert based on professional legal frameworks. And maybe there's like 4 that're utilized for this, but one is hit more than others. We may deprioritize legalities for example, but I don't want to remove all 4, maybe keep 1-2 for example. But the current method doesn't give you that kind of control. And even though you may want really powerful coding, it may be worth it to forsake a couple experts related to coding in effort to keep a bit more of a well rounded brain anyways.

And so on and so forth.

But the system is also incredibly slow with assuming you have crazy amounts of VRAM or you just have to chuck it in CPU and pray. But I want to do a lot more analysis. So, I have roughly ~52k prompts over tons of data sets on different topics I've categorized and it's roughly 26.5M tokens in full. I then hooked it up to be able to go through the analysis phase with a smaller hot swappable model (specifically FP8) so that it can reduce VRAM as best as it can to get it working. There's other speed improvements to be made, but this was sufficient. Basically learn from the small model, then translate it to the larger. Made it much faster.

But I also wanted to store an analysis meta data effectively to track which datasets and topics attached to what experts and this way I have the knowledge of what is actually being sacrificed to the prune gods.

I think I finally got it stable. It's going to finish in likely ~4 days I think, but I built lots of caching, checkpoints, and saves into the code so that way it's easier to work with and prototype.

I have a super sloppy altered version of REAP that's doing what I want. I think I can make future iterations of it much faster, but I need a longer burn run with lots of data to learn a lot more. But when it's done, I'll have wayyyy more power if it works out. Instead of just saying, "Blast 35% of the brain" this should in theory give me the power to ask before I do any blasting, "What parts of the brain is safe to blast?" This way I can get more dynamic answers instead of just guessing.

I'm not going to add this to MagicQuant necessarily right now, but I'm going to see if I can get it to work and then apply MagicQuant finalists on my first prototype. I'll try to put together a suite of benchmarks as well when it's done. But this'll likely have to bake not only for 4 days, but I'll need more time after that for testing, if anything goes wrong, etc. If I had to guess, I'll likely have some really cool prototypes if hopefully minimally damaged but much smaller models built utilizing REAP likely in the next 2-3 weeks if I had to guess.

But I do think REAP is super cool. But it hasn't been updated for newer architecture for quite some times. And if I get it to work, it'll likely be a very custom fork or potentially a near full integration rebuild, tho not necessarily changing any of the pruning logic or detection itself. But I was hoping it'd be more of a 2-3 day project, didn't realize it'd take this much effort to get working on newer models and so much work done to be able to prototype, work without 80 GB VRAM GPU's, and so on. So I've hammered a ton of things out. SHould be fun though!

redtailcowboy

8 days ago

That’s all great to hear! Definitely don’t feel like you’ve got to do it or whatever just bc of my request, but I’m excited for whatever it ends up as!

floory

7 days ago

cool, man. wish I had as much time as you

you could also check out REAM by Samsung, although i've not seen it used since i saw the paper but thought i'd bring it to your attention

magiccodingman

Owner 7 days ago

•

edited 7 days ago

Nah there's no obligation. Honestly when you brought up REAP. Not sure if you're the one who brought it up on my GitHub in the past as well. But this time it provoked me to dive deeper. And I just philosophically really love the idea. It's very up my alley. May take me a few weeks or longer to really dig into it though. I had a little too much fun lol. I ended up rebuilding soooo much code because it was fun to dive into and I've rebuilt a lot pretty custom so I can really dissect the guts of MOE models

And as for REAM, I'll need to check that out too.

Also I do not have the time to do this right now in my life hahaha. I'm just being dumb and staying up farrrrrr too late because it's what I enjoy. My wife hates when I do that though.

redtailcowboy

7 days ago

lmao, well I’m glad you found an interest in it! I wasn’t the one on github so I guess there’s a lot of ppl with the idea.

magiccodingman

Owner 7 days ago

•

edited 7 days ago

So, I'm turning in for the night before my wife murks me for not watching a show with her lol. But thought I'd share some really cool info I found. So the normal REAP process as I suspected is a pretty heavy axe on the models brain. It can make the model incredibly good at a single benchmark, but you murder the rest of it's capabilities super easily. Even with generalist datasets.

So my rebuilt version of REAP utilized tons of the same scientific level code, but I philosophically uprooted the entire project. Instead of saying, "murk 35%" I made the system detect with pretty high confidence what experts are categorized to what domains. And then the importance of experts to domains.

So far the data shows it's wayyyy more conservative than one would imagine. So under normal circumstances a large dataset doesn't always accurately capture this. But there's ~0.30% experts that I couldn't find what domain they were attached too. AKA, they were attic dwellers that just collected dust and likely affect long tail scenario's but are very very likely safe to remove without looking much deeper.

Then there's ~1.46% of the experts that are activated in specific locations but even when they are activated, they are weighted super low and are again very safe to just wack as it's likely not going to alter end level behavior or benchmarks.

But then there's decision making to make. Most experts are heavily intertwined with multiple domains/categories. But we can hedge our bets prior to pruning and find out which domains we're okay with causing some degregation, though we're not killing that subject, just trimming some fat off of it.

So for the Qwen3.6 35B MOE I'm working with, we can wack 1.77% of the experts off with pretty high trust that the model is probably damn near where it started outside some long tail scenarios.

But then if you want to keep the model as a generalist, we can make decisions to wack off more % like so:

+ humanities edge
260 positions = 2.54%

+ civics edge
399 positions = 3.90%

+ psychology edge
529 positions = 5.17%

+ history edge
747 positions = 7.29%

+ creative_writing edge
875 positions = 8.54%

+ business edge
966 positions = 9.43%

+ legal edge
1,019 positions = 9.95%

+ medical edge
1,050 positions = 10.25%

Not that this would be my final decision making, but each of those categories was what my system found as safe synergy zones. AKA, it's isolated enough that I can trim them meaningfully without brutally murdering the capabilities in that subject. Again it'll degrade, but it won't be blasted in the brain or lobotomized. Just not as good as it was originally. Likely still way better than the smaller parameter models anyways though.

But this is a showcase that you can trim up ~10% pretty safely and still get a good generalist model.

Now if you want a code specialist model, REAP normally touts higher % savings like 35%. But the issue is that reap hits the model too hard and often accidentally murders categories that had heavy synergies with coding. For this model for example, a nearly fully intact model for coding shouldn't have more than ~14% of the experts trimmed if you still want it useable for a variety of practical use cases. Though you can go to ~17% trimming but you're sacrificing technical reasoning and science categories that synergize and can be called in real world applications for coding.

Just because a model benchmarks well on coding after a 35% prune doesn't mean it's actually usable within a practical daily coding sense.

But here's what interesting too! The model shows basically where hot spot synergies are with the analysis I've done. For example if you don't care about code but care about professional/social specialist, so you keep together general, business, legal, medical, civics, psychology, humanities, and sacrificing things like:

coding
computer_science
creative_writing
history
logic
math
science
security

you can trim up to like 22.31% of the experts pretty damn safely. But you've done big technical, code, and security loss.

If you want a creative chat specialist you can hit nearer to 26.45% of the model very safely trimmed off!

So, it's very interesting seeing the first round of results come in. I'm still baking more data in the background. But REAP is seriously a cool project. But it's obvious to me now that it was built as a really cool scientific showcase of an experiment. But re-building it into a more engineering and domain aware system. the core components with some new logic gave me some really cool insight. No more saying, "delete 35%" but instead it's possible to find out what's most likely safe to delete and what sacrifices you're willing to make. Where default REAP can't do that.

Oh and a cool note too. A normal global pruning would have just deleted 10.59% of the model, but 9.12% of that model with the domain and synergy protection mode was deemed critical. Which is why it's so important to choose where you trim and not to just let the executioner ax fall blindly. 10% globally low removal vs 10% global domain trimming is not the same trim at all.

Cheers!

floory

7 days ago

you can trim up to like 22.31% of the experts pretty damn safely. But you've done big technical, code, and security loss.

common thing i've seen people mention/complain about is REAP is it works and seems nearly lossless in testing/general usage, but edge cases like using it in a Hermes harness or some other case where those experts may have influenced certain decisions, the model starts acting "weird"

i know my descriptions are quite vague but it's the anecdotes i've heard. personally, i haven't used REAP models enough to notice anything too "off" but "safe" may not be the right word but who knows, you are the magiccodingman after all :D wouldn't be surprised if you make it work

magiccodingman

Owner 6 days ago

That's the exact issue I was detecting. REAP itself is fantastic, but its decision system is a brutal executioner that kinda is blind to tons of things. Even though the entire analysis phase literally captures enough to avoid such things if properly organized. Protecting logic, tool calling, and so on is detectable. Personally I think that's the #1 thing to protect. If you're going to chop of brains of your model, make damn sure it can make a google search call lol.

What's cool is when you probe it properly, so far I've categorize 99.7% of the model into buckets like:

coding
math
science
general
creative_writing
medical
legal
civics
security
business
logic
humanities
history
psychology
computer_science

I'm not done with my buckets. This is just my test phase. But I've been able to isolate too things like this:

code_instruction
code_reasoning
math_reasoning
science_reasoning
personas_math
software_engineering_tool_use
instruction_mixture
fiction_writing
web_text
professional_medicine
clinical_knowledge
professional_law
international_law
government_and_politics
security_studies
professional_accounting
management
marketing
chemistry
biology
physics
formal_logic
logical_fallacies
philosophy
world_history
professional_psychology
computer_security
machine_learning

It only took 1 hour of probing with a CPU to find this information too. So it doesn't require insane hardware or time to do so. Its been very cool to play with. I'm excited to work on it more later. I got a quail enclosure to build today and some lawn work. I need AI to be good enough to do my chores for me. Then I can spend more time having fun.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment