Instructions to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5
- SGLang
How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Unsloth Studio
How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5", max_seq_length=2048, ) - Docker Model Runner
How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with Docker Model Runner:
docker model run hf.co/AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5
Qwopus-27B-Coder-Mixed-Q5
Overview
This repository provides a highly optimized, custom-quantized GGUF model of Qwopus3.6-27B-Coder, specifically engineered for local deployment on Dual RTX 3090 setups.
The primary research objective of this quantization is to achieve an extreme context length (full 262K tokens in F16 KV Cache) while maximizing inference speed through adapted BPW and Multi-Token Prediction (MTP / Self-Speculative Decoding) and retaining most of the original model's capacities. To achieve this, the base network was quantized to Q5_0 and Q8_0 using a custom iMatrix, while the critical NextN layers and embeddings were strictly preserved in Q8_0.
The base model used for this requantization is Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF.
Research & Methodology
Selective precision Quantization for high-speed inference
Most standard quantization pipelines compress the entire model, which severely degrades the quality, the inference speeds and the NextN layer responsible for Multi-Token Prediction (which is, sometimes, completely suppress it).
To maintain capacities while optimizing for a Dual RTX 3090 (NVLink) setup, I have implemented a Selective Precision Mapping strategy. By carefully partitioning the model into Q8_0 (High Precision) and Q5_0 (Balanced Efficiency) tensors, it preserves the critical activation flows, specifically the Multi-Token Prediction (NextN) layer, without sacrificing the throughput necessary for processing massive contexts (full 262K tokens).
Strategic Layer Mapping
The following quantization scheme was applied :
--tensor-type 'token_embd\.weight=q8_0'
--tensor-type 'output\.weight=q8_0'
--tensor-type 'blk\.64\..*=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_qkv\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_q\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_k\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_v\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_output\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_gate\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ssm_alpha\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ssm_beta\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ssm_out\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ffn_up\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ffn_down\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ffn_gate\.weight=q5_0'
This scheme keeps the most important layers at high precision (Q8_0) and lowers the GGUF size on disk, enabling full offload on dual RTX 3090 setups. I chose to go for Q5_0 and Q8_0 as it both retains many capacities while lowering the needs for complex mathematical kernel calculations.
iMatrix Calibration
The model was calibrated using a custom, shuffled iMatrix to ensure high fidelity across coding, instruction-following, and bilingual tasks (English/French).
The dataset was built by merging and shuffling the following subsets from eaddario/imatrix-calibration:
code_smalltools_smalltext_en_smalltext_fr_small
Accuracy Analysis
I compared the perplexity on Wiki-Text-raw to evaluate the precision loss after the mixed-quantization scheme:
| Model Version | Precision | Wiki-text-raw (PPL) | Delta vs BF16 |
|---|---|---|---|
| Q5-Mixed | Q5_Mixed + Q8_0 MTP | 5.8285 | +0.0204 |
| Base (Source) | BF16 (Original) | 5.8081 | - |
Key Findings:
- Near-Lossless: The perplexity degradation is minimal at +0.0204, indicating that this mixed precision layout preserves the original model's reasoning capabilities.
Recommended Usage
To replicate the optimal performance (200K context, F16 Cache, Multi-GPU) using llama.cpp, use the following llama-server command. Note the specific use of --split-mode tensor and --tensor-split 1,1 for optimal PCIe bandwidth management across dual RTX 3090s. This command appeared to be the best one I could come across using an NVLink.
/path/to/llama.cpp/build/bin/llama-server \
-m /path/to/Qwopus-27B-Coder-Mixed-Q5\
--mmproj /path/to/mmproj-F16.gguf \
--split-mode tensor \
--tensor-split 1,1 \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 262144 \
--parallel 1 \
--gpu-layers 999 \
--cache-type-k f16 \
--cache-type-v f16 \
--flash-attn on \
-b 2048 -ub 2048 \
--spec-type draft-mtp \
-sps 0.70 \
--image-min-tokens 1024 \
--alias Qwen3.6-27b \
--jinja
If I may add, I also developed a proxy to enable users to select thinking or non-thinking behaviors and applied the recommended sampling parameters AND the "Preserve Thinking" option. You may find it on my GitHub.
Hardware Requirements
- Target VRAM: 48 GB (Tested on 2x NVIDIA RTX 3090 24GB).
- RAM: Minimum 32GB system RAM (Prompt caching and system overhead).
- Context limit: The command above loads ~13GB of KV cache across the two GPUs. If you experience OOM (Out of Memory) errors, consider reducing
--ctx-sizeor using 8-bit cache (--cache-type-k q8_0 --cache-type-v q8_0).
Acknowledgments
This project was made possible thanks to the outstanding tools and contributions from the open-source AI community. Special thanks to:
- llama.cpp: Using the new Tensor split mode, it finally achieves extremelly high performances on dual-GPU setups.
- eaddario: For the extremely diverse
imatrix-calibrationdataset, which was crucial in building the custom, multilingual, and code-heavy iMatrix. - Unsloth: For identifying the formatting bugs in the original model and providing the optimized, bug-free chat template and publishing the base BF16 MTP-ready model used on this project.
- The Qwen Team: For researching and releasing the exceptional Qwen3.6 architecture.
- Downloads last month
- 78
We're not able to determine the quantization variants.
Model tree for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5
Base model
Jackrong/Qwopus3.6-27B-v2