Instructions to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("skatardude10/SnowDrogito-RpR-32B_IQ4-XS", dtype="auto")

llama-cpp-python

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="skatardude10/SnowDrogito-RpR-32B_IQ4-XS",
	filename="SnowDrogito-RpR-32B_IQ4-XS.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS
# Run inference directly in the terminal:
llama cli -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS
# Run inference directly in the terminal:
llama cli -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS
# Run inference directly in the terminal:
./llama-cli -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Use Docker

docker model run hf.co/skatardude10/SnowDrogito-RpR-32B_IQ4-XS

LM Studio
Jan
Ollama
How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Ollama:
```
ollama run hf.co/skatardude10/SnowDrogito-RpR-32B_IQ4-XS
```

Unsloth Studio

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for skatardude10/SnowDrogito-RpR-32B_IQ4-XS to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for skatardude10/SnowDrogito-RpR-32B_IQ4-XS to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for skatardude10/SnowDrogito-RpR-32B_IQ4-XS to start chatting

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "skatardude10/SnowDrogito-RpR-32B_IQ4-XS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "skatardude10/SnowDrogito-RpR-32B_IQ4-XS" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Docker Model Runner:
```
docker model run hf.co/skatardude10/SnowDrogito-RpR-32B_IQ4-XS
```

Lemonade

How to use skatardude10/SnowDrogito-RpR-32B_IQ4-XS with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Run and chat with the model

lemonade run user.SnowDrogito-RpR-32B_IQ4-XS-{{QUANT_TAG}}

List all available models

lemonade list

SnowDrogito-RpR-32B_IQ4-XS

SnowDrogito-RpR-32B Banner

Updates and Description of Files

Recent files uploaded use ArliAI RpR V3 instead of V1 as indicated in the name.
All quantizations in this repo use IQ4_XS as a base with Q8 embedding and output tensors.
(Recommended) SnowDrogito-RpR3-32B_IQ4-XS+Enhanced_Tensors.gguf - largest, highest quality, Q4KM size, quant using recalibrated imatrix on Bartowki's dataset+RP+Tao at 8k context, uses selective quantization with llama-quantize --tensor-type flags to bump up select FFN/self attention tensors between Q6 and Q8 as described here.
SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf - Q6 and Q5 Attention tensors. This and all quants uploaded prior used imatrix from Snowdrop.

MORE SPEED!

Improve inference speed offloading tensors instead of layers as referenced HERE. --overridetensors ".[13579].ffn_up|.[1-3][13579].ffn_up=CPU Restricts offloading of every third FFN up tensor, saving enough space on GPU to offload all layers on 24gb, taking me from 3.9tps to 10.6 tps. Example:

python koboldcpp.py --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU" --threads 10 --usecublas --contextsize 40960 --flashattention --model ~/Downloads/SnowDrogito-RpR3-32B_IQ4-XS+Enhanced_Tensors.gguf

...obviously editing threads, filepaths, etc...

Overview

SnowDrogito-RpR-32B_IQ4-XS is my shot at an optimized imatrix quantization for my QwQ RP Reasoning merge, goal is to add smarts to the popular Snowdrop roleplay model, with a little ArliAI RpR and Deepcogito for the smarts. Built using the TIES merge method, it attempts to combine strengths from multiple fine-tuned QwQ-32B models, quantized to IQ4_XS with Q8_0 embeddings and output layers for enhanced quality, to plus it up just a bit. Uploading because the PPL was lower, have been getting more varied/longer/more creative responses with this, but maybe it lacks contextual awareness compared to snowdrop? Not sure.

Setup for Reasoning and ChatML

ChatML Formatting: Use ChatML with <|im_start|>role\ncontent<|im_end|>\n (e.g., <|im_start|>user\nHello!<|im_end|>\n).
Reasoning Settings: Set "include names" to "never." Start reply with <think>\n to enable reasoning.
Sampler Settings: From Snowdrop: Try temperature 0.9, min_p 0.05, top_a 0.3, TFS 0.75, repetition_penalty 1.03, DRY if available.
My Settings: Response (tokens): 2048 Context (tokens): 40960 Temperature: 3.25 Top P: 0.98 Min P: 0.04 Top nsigna: 2.5 Repetition Penalty: 1.03 (XTC) Threshold: 0.3 (XTC) Probability: 0.3 Dry Multiplier: 0.8 Dry Base: 1.75 Dry Allowed Length: 4 Dry Penalty Range: 1024

Getting great reasoning results with ST's Start Reply With:

<think>
Chain-of-thought: Alright, what just happened is

For more details, see the setup guides and master import for ST for Snowdrop and other info on ArliAI RpR.

Performance

Perplexity under identical conditions (IQ4_XS, 40,960 context, Q8_0 KV cache, on a 150K-token chat dataset) SnowDrogito-RpR-32B vs QwQ-32B-Snowdrop-v0:

  4.5597 ± 0.02554  
  4.6779 ± 0.02671

Fits 40960 context 24GB VRAM using Q8 KV Cache with full GPU offload.

Model Details

Base Model: Qwen/Qwen2.5-32B
Architecture: Qwen 2.5 (32B parameters)
Context Length: 40,960 tokens
Quantization: IQ4_XS with Q8_0 embeddings and output layers for better quality.
Used .imatrix file from Snowdrop.

Merge Configuration

This model was created using mergekit with the following TIES merge configuration:

models:
  - model: trashpanda-org/QwQ-32B-Snowdrop-v0
    parameters:
      weight: 0.75
      density: 0.5
  - model: deepcogito/cogito-v1-preview-qwen-32B
    parameters:
      weight: 0.15
      density: 0.5
  - model: ArliAI/QwQ-32B-ArliAI-RpR-v1
    parameters:
      weight: 0.1
      density: 0.5
merge_method: ties
base_model: Qwen/Qwen2.5-32B
parameters:
  weight: 0.9
  density: 0.9
  normalize: true
  int8_mask: true
tokenizer_source: Qwen/Qwen2.5-32B-Instruct
dtype: bfloat16

Quantization Details

Primary Quantization: IQ4_XS (4-bit integer with extra-small blocks) using an importance matrix (trashpanda-org_QwQ-32B-Snowdrop-v0.imatrix) for high quality at reduced size.
Embeddings & Output Layers: Quantized to Q8_0 (8-bit) to preserve precision in token embeddings and final output weights, differing from the standard IQ4_XS body. This boosts quality with a modest size increase.

Acknowledgments

mergekit for merging.
llama.cpp for quantization.
Original model creators: Qwen, trashpanda-org, deepcogito, ArliAI.

Downloads last month: 39

GGUF

Model size

33B params

Architecture

qwen2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Base model

skatardude10/SnowDrogito-RpR-32B

Quantized

(4)

this model