Instructions to use DreamFast/qwen3-8b-heretic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DreamFast/qwen3-8b-heretic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DreamFast/qwen3-8b-heretic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DreamFast/qwen3-8b-heretic") model = AutoModelForCausalLM.from_pretrained("DreamFast/qwen3-8b-heretic") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use DreamFast/qwen3-8b-heretic with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DreamFast/qwen3-8b-heretic", filename="gguf/qwen3-8b-heretic-Q3_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use DreamFast/qwen3-8b-heretic with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DreamFast/qwen3-8b-heretic:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DreamFast/qwen3-8b-heretic:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DreamFast/qwen3-8b-heretic:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DreamFast/qwen3-8b-heretic:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DreamFast/qwen3-8b-heretic:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DreamFast/qwen3-8b-heretic:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DreamFast/qwen3-8b-heretic:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DreamFast/qwen3-8b-heretic:Q4_K_M
Use Docker
docker model run hf.co/DreamFast/qwen3-8b-heretic:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use DreamFast/qwen3-8b-heretic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DreamFast/qwen3-8b-heretic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DreamFast/qwen3-8b-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DreamFast/qwen3-8b-heretic:Q4_K_M
- SGLang
How to use DreamFast/qwen3-8b-heretic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DreamFast/qwen3-8b-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DreamFast/qwen3-8b-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DreamFast/qwen3-8b-heretic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DreamFast/qwen3-8b-heretic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use DreamFast/qwen3-8b-heretic with Ollama:
ollama run hf.co/DreamFast/qwen3-8b-heretic:Q4_K_M
- Unsloth Studio new
How to use DreamFast/qwen3-8b-heretic with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DreamFast/qwen3-8b-heretic to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DreamFast/qwen3-8b-heretic to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DreamFast/qwen3-8b-heretic to start chatting
- Pi new
How to use DreamFast/qwen3-8b-heretic with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DreamFast/qwen3-8b-heretic:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "DreamFast/qwen3-8b-heretic:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use DreamFast/qwen3-8b-heretic with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DreamFast/qwen3-8b-heretic:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default DreamFast/qwen3-8b-heretic:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use DreamFast/qwen3-8b-heretic with Docker Model Runner:
docker model run hf.co/DreamFast/qwen3-8b-heretic:Q4_K_M
- Lemonade
How to use DreamFast/qwen3-8b-heretic with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DreamFast/qwen3-8b-heretic:Q4_K_M
Run and chat with the model
lemonade run user.qwen3-8b-heretic-Q4_K_M
List all available models
lemonade list
Qwen3 8B - Heretic (Abliterated)
An abliterated version of Qwen's Qwen3-8B created using Heretic v1.2.0 (git master). This model has reduced refusals while maintaining model quality, making it suitable as an uncensored text encoder for image generation models like Klein 9B.
You can see the docker, scripts and configurations used to make these files on Heretic Docker Github.
Model Details
- Base Model: Qwen/Qwen3-8B
- Abliteration Method: Heretic v1.2.0 (git master, commit
19cdf7e) - Trials: 3000
- Trial Selected: Trial 2681
- Refusals: 13/100 (vs 100/100 original)
- KL Divergence: 0.0838 (minimal model damage)
Files
HuggingFace Format (for transformers, llama.cpp conversion)
model.safetensors (~16 GB)
config.json
tokenizer.json
tokenizer_config.json
generation_config.json
chat_template.jinja
ComfyUI Format (for Klein 9B text encoder)
comfyui/qwen3-8b-heretic.safetensors # bf16, 16GB
comfyui/qwen3-8b-heretic_fp8_e4m3fn.safetensors # fp8, 8.8GB
comfyui/qwen3-8b-heretic_nvfp4.safetensors # nvfp4, 6.0GB
GGUF Format (for llama.cpp and ComfyUI-GGUF)
| Quant | Size | Notes |
|---|---|---|
| F16 | 16GB | Lossless reference |
| Q8_0 | 8.2GB | Excellent quality |
| Q6_K | 6.3GB | Very good quality |
| Q5_K_M | 5.5GB | Good quality |
| Q5_K_S | 5.4GB | Slightly smaller Q5 |
| Q4_K_M | 5.0GB | Recommended balance |
| Q4_K_S | 4.8GB | Smaller Q4 variant |
| Q3_K_M | 3.9GB | For low VRAM only |
NVFP4 Notes
The NVFP4 (4-bit floating point, E2M1) variants use ComfyUI's native quantization format. They are ~3x smaller than bf16 and load natively in ComfyUI without any plugins. Blackwell GPUs (RTX 5090/5080, SM100+) can use native FP4 tensor cores for best performance, but ComfyUI also supports software dequantization on older GPUs (tested working on RTX 4090).
Usage
With ComfyUI (Klein 9B)
Download a ComfyUI format file:
- FP8 (recommended):
comfyui/qwen3-8b-heretic_fp8_e4m3fn.safetensors(8.8GB) - NVFP4 (smallest):
comfyui/qwen3-8b-heretic_nvfp4.safetensors(6.0GB) - bf16 (full precision):
comfyui/qwen3-8b-heretic.safetensors(16GB)
- FP8 (recommended):
Place in
ComfyUI/models/text_encoders/In your Klein 9B workflow, use the
ClipLoadernode and select the heretic file
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"DreamFast/qwen3-8b-heretic",
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("DreamFast/qwen3-8b-heretic")
prompt = "Describe a dramatic sunset over a cyberpunk city"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With llama.cpp
llama-server -m qwen3-8b-heretic-Q4_K_M.gguf
Abliteration Process
Created using Heretic v1.2.0 (git master) with 3000 optimization trials:
? Which trial do you want to use?
[Trial 2732] Refusals: 10/100, KL divergence: 0.1001
> [Trial 2681] Refusals: 13/100, KL divergence: 0.0838 <-- selected
[Trial 2337] Refusals: 18/100, KL divergence: 0.0643
[Trial 2419] Refusals: 19/100, KL divergence: 0.0600
[Trial 2195] Refusals: 21/100, KL divergence: 0.0534
...
Trial 2681 was selected for its balance of low refusals (13/100) and reasonable KL divergence (0.0838), indicating minimal model damage while achieving 87% of previously-refused prompts now working.
Limitations
- This model inherits all limitations of the base Qwen3-8B model
- Abliteration reduces but does not completely eliminate refusals (13/100 remain)
- NVFP4 quantization works best on Blackwell GPUs (RTX 5090/5080) with native FP4 tensor cores, but also works on older GPUs via software dequantization
License
This model is released under the Apache 2.0 License, following the base Qwen3-8B model license.
Acknowledgments
- Qwen for the Qwen3-8B model
- Heretic by p-e-w for the abliteration tool
- Black Forest Labs for Klein 9B
- llama.cpp for GGUF conversion
- Downloads last month
- 2,092