Instructions to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5

SGLang

How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Unsloth Studio

How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5",
    max_seq_length=2048,
)

Docker Model Runner
How to use AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5 with Docker Model Runner:
```
docker model run hf.co/AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5
```

Qwopus-27B-Coder-Mixed-Q5

Overview

This repository provides a highly optimized, custom-quantized GGUF model of Qwopus3.6-27B-Coder, specifically engineered for local deployment on Dual RTX 3090 setups. The primary research objective of this quantization is to achieve an extreme context length (full 262K tokens in F16 KV Cache) while maximizing inference speed through adapted BPW and Multi-Token Prediction (MTP / Self-Speculative Decoding) and retaining most of the original model's capacities. To achieve this, the base network was quantized to Q5_0 and Q8_0 using a custom iMatrix, while the critical NextN layers and embeddings were strictly preserved in Q8_0. The base model used for this requantization is Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF.

Research & Methodology

Selective precision Quantization for high-speed inference

Most standard quantization pipelines compress the entire model, which severely degrades the quality, the inference speeds and the NextN layer responsible for Multi-Token Prediction (which is, sometimes, completely suppress it).

To maintain capacities while optimizing for a Dual RTX 3090 (NVLink) setup, I have implemented a Selective Precision Mapping strategy. By carefully partitioning the model into Q8_0 (High Precision) and Q5_0 (Balanced Efficiency) tensors, it preserves the critical activation flows, specifically the Multi-Token Prediction (NextN) layer, without sacrificing the throughput necessary for processing massive contexts (full 262K tokens).

Strategic Layer Mapping

The following quantization scheme was applied :

--tensor-type 'token_embd\.weight=q8_0'
--tensor-type 'output\.weight=q8_0'
--tensor-type 'blk\.64\..*=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_qkv\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_q\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_k\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_v\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_output\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.attn_gate\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ssm_alpha\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ssm_beta\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ssm_out\.weight=q8_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ffn_up\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ffn_down\.weight=q5_0'
--tensor-type 'blk\.([0-9]|[1-5][0-9]|6[0-3])\.ffn_gate\.weight=q5_0'

This scheme keeps the most important layers at high precision (Q8_0) and lowers the GGUF size on disk, enabling full offload on dual RTX 3090 setups. I chose to go for Q5_0 and Q8_0 as it both retains many capacities while lowering the needs for complex mathematical kernel calculations.

iMatrix Calibration

The model was calibrated using a custom, shuffled iMatrix to ensure high fidelity across coding, instruction-following, and bilingual tasks (English/French). The dataset was built by merging and shuffling the following subsets from eaddario/imatrix-calibration:

code_small
tools_small
text_en_small
text_fr_small

Accuracy Analysis

I compared the perplexity on Wiki-Text-raw to evaluate the precision loss after the mixed-quantization scheme:

Model Version	Precision	Wiki-text-raw (PPL)	Delta vs BF16
Q5-Mixed	Q5_Mixed + Q8_0 MTP	5.8285	+0.0204
Base (Source)	BF16 (Original)	5.8081	-

Key Findings:

Near-Lossless: The perplexity degradation is minimal at +0.0204, indicating that this mixed precision layout preserves the original model's reasoning capabilities.

Recommended Usage

To replicate the optimal performance (200K context, F16 Cache, Multi-GPU) using llama.cpp, use the following llama-server command. Note the specific use of --split-mode tensor and --tensor-split 1,1 for optimal PCIe bandwidth management across dual RTX 3090s. This command appeared to be the best one I could come across using an NVLink.

/path/to/llama.cpp/build/bin/llama-server \
    -m /path/to/Qwopus-27B-Coder-Mixed-Q5\
    --mmproj /path/to/mmproj-F16.gguf \
    --split-mode tensor \
    --tensor-split 1,1 \
    --host 0.0.0.0 \
    --port 8080 \
    --ctx-size 262144 \
    --parallel 1 \
    --gpu-layers 999 \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --flash-attn on \
    -b 2048 -ub 2048 \
    --spec-type draft-mtp \
    -sps 0.70 \
    --image-min-tokens 1024 \
    --alias Qwen3.6-27b \
    --jinja

If I may add, I also developed a proxy to enable users to select thinking or non-thinking behaviors and applied the recommended sampling parameters AND the "Preserve Thinking" option. You may find it on my GitHub.

Hardware Requirements

Target VRAM: 48 GB (Tested on 2x NVIDIA RTX 3090 24GB).
RAM: Minimum 32GB system RAM (Prompt caching and system overhead).
Context limit: The command above loads ~13GB of KV cache across the two GPUs. If you experience OOM (Out of Memory) errors, consider reducing --ctx-size or using 8-bit cache (--cache-type-k q8_0 --cache-type-v q8_0).

Acknowledgments

This project was made possible thanks to the outstanding tools and contributions from the open-source AI community. Special thanks to:

llama.cpp: Using the new Tensor split mode, it finally achieves extremelly high performances on dual-GPU setups.
eaddario: For the extremely diverse imatrix-calibration dataset, which was crucial in building the custom, multilingual, and code-heavy iMatrix.
Unsloth: For identifying the formatting bugs in the original model and providing the optimized, bug-free chat template and publishing the base BF16 MTP-ready model used on this project.
The Qwen Team: For researching and releasing the exceptional Qwen3.6 architecture.

Downloads last month: 78

GGUF

Model size

0.5B params

Architecture

clip

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlexanderKyng/Qwopus-27B-Coder-Mixed-Q5

Base model

Jackrong/Qwopus3.6-27B-v2

Adapter

Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF

Quantized

(6)

this model

AlexanderKyng
/

Qwopus-27B-Coder-Mixed-Q5