Nex-N2-mini GGUF

GGUF quantizations of nex-agi/Nex-N2-mini for use with llama.cpp.

These are Unsloth-style UD (dynamic) quants: per-tensor quantization types tuned with an importance matrix, using the same recipe family as Unsloth’s Qwen3.6 35B-A3B MoE GGUF releases.

Model at a glance

Architecture qwen35moe (Qwen3.5 / 3.6 MoE family)
Trunk layers 40
Experts 256 total, 8 active per token
Context (train) 262144 tokens
Vocab 248320
Vision Supported via mmproj-BF16.gguf (optional)
MTP draft head Not included in this release (see note below)

Files

File Size When to use
Nex-N2-mini-UD-Q3_K_XL.gguf ~17 GB Smallest; more VRAM-friendly
Nex-N2-mini-UD-Q4_K_M.gguf ~22 GB Good default balance
Nex-N2-mini-UD-Q4_K_XL.gguf ~22 GB Recommended quality / size sweet spot
Nex-N2-mini-UD-Q5_K_XL.gguf ~27 GB Higher quality
Nex-N2-mini-UD-Q6_K_XL.gguf ~32 GB Highest quality in this set
mmproj-BF16.gguf ~0.9 GB Image / vision input (optional)
imatrix_unsloth.gguf_file ~0.2 GB Importance matrix used during quantization (reference only)

All .gguf model files are at the repo root (flat layout).

Quick start

Chat server (recommended)

llama-server \
  -m Nex-N2-mini-UD-Q4_K_XL.gguf \
  --host 127.0.0.1 --port 8080 \
  -c 8192 \
  -ngl 99 \
  -fa on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

Open http://127.0.0.1:8080 in your browser for the built-in chat UI.

CLI

llama-cli -m Nex-N2-mini-UD-Q4_K_XL.gguf -ngl 99 -fa on

Vision (optional)

Add the projector when you need image input:

llama-server \
  -m Nex-N2-mini-UD-Q4_K_XL.gguf \
  --mmproj mmproj-BF16.gguf \
  -ngl 99 \
  -fa on

Text-only chat works fine without --mmproj.

VRAM and MoE offloading

This is a Mixture-of-Experts model. Even quantized, full GPU residency may not fit on smaller cards.

  • -ngl 99 (or --gpu-layers 99): offloads attention and dense weights to the GPU.
  • -ncmoe N / --n-cpu-moe N: keeps routed expert weights for the first N layers in system RAM; later layers stay on GPU. Useful on 12–16 GB GPUs.

Example for a ~12 GB GPU (adjust N to taste):

llama-server -m Nex-N2-mini-UD-Q4_K_XL.gguf -ngl 99 -ncmoe 30 -fa on -c 8192 \
  --cache-type-k q8_0 --cache-type-v q8_0

Higher -ncmoe = more expert layers on CPU = lower VRAM use, slower generation.

MTP note

The upstream Nex config mentions an MTP (multi-token prediction) block, but these GGUF files contain 40 trunk layers only — no MTP draft weights are present in the published checkpoint.

Files in this repo use GGUF metadata block_count=40 and nextn_predict_layers=0, so they load cleanly in current llama.cpp without extra flags.

If you have an older copy of these files that fails with missing tensor blk.40.attn_norm.weight, re-download from this repo, or add:

--override-kv qwen35moe.block_count=int:40,qwen35moe.nextn_predict_layers=int:0

How these quants were built

  • Source model: nex-agi/Nex-N2-mini
  • Importance matrix: from unsloth/Qwen3.6-35B-A3B-GGUF (imatrix_unsloth.gguf_file)
  • Tensor-type recipes: Unsloth Qwen3.6 UD layouts (same family as Qwen3.5/3.6 35B-A3B MoE)
  • Tooling: llama.cpp convert_hf_to_gguf.py + llama-quantize

License

Apache 2.0 (see upstream Nex-N2-mini for model terms).

Downloads last month
129
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sjakek/Nex-N2-mini-GGUF

Quantized
(21)
this model