Instructions to use empero-ai/openNemo-9B-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use empero-ai/openNemo-9B-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="empero-ai/openNemo-9B-abliterated", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("empero-ai/openNemo-9B-abliterated", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("empero-ai/openNemo-9B-abliterated", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use empero-ai/openNemo-9B-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "empero-ai/openNemo-9B-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "empero-ai/openNemo-9B-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/empero-ai/openNemo-9B-abliterated

SGLang

How to use empero-ai/openNemo-9B-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "empero-ai/openNemo-9B-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "empero-ai/openNemo-9B-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "empero-ai/openNemo-9B-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "empero-ai/openNemo-9B-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use empero-ai/openNemo-9B-abliterated with Docker Model Runner:
```
docker model run hf.co/empero-ai/openNemo-9B-abliterated
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

openNemo-9B-uncensored

Abliterated version of openNemo-9B with safety refusals removed.

Built using Snakehead — Empero AI's internal abliteration tool specialized for hybrid Mamba2 + sparse attention architectures like Nemotron-H. Standard abliteration tools don't work on these models because they only target transformer attention layers. Snakehead operates on both Mamba SSM blocks and attention blocks across the full residual stream.

By Empero AI

What is abliteration?

Abliteration is a weight-editing technique that removes a model's refusal behavior without fine-tuning. It works by:

Collecting residual stream activations for harmful and harmless prompts at every layer
Computing the refusal direction — the vector that separates "I should refuse" from "I should comply"
Orthogonalizing output projection weights against that direction, effectively erasing the model's ability to activate refusal behavior

The result is a model that responds to all prompts without safety filtering, while preserving general capabilities and coherence.

How this model was made

Snakehead uses a heretic-style positional falloff strategy rather than ablating a fixed set of layers uniformly:

Center + radius: A continuous bell-shaped ablation curve centered on the layer where refusal is causally enforced
Adaptive signal detection: Uses Cohen's d separation scores (not raw activation norms) to identify where refusal decisions actually happen — for Nemotron-H, this is layers 21–31, not the later layers where activation magnitudes are largest
Global direction scope: A single interpolated refusal direction applied across all affected layers, which proved more effective than per-layer directions for this architecture
Automated search: Explore/exploit optimization with a hall-of-fame system that finds optimal ablation parameters while keeping KL divergence minimal

Ablation results

Metric	Value
Pre-ablation refusal rate	97%
Post-ablation refusal rate	13%
KL divergence	0.022 (minimal — model behavior is nearly unchanged on non-refused prompts)
Ablation config	c=15, r=25, w=1.37, g40l

KL divergence of 0.022 means the model's output distribution on normal prompts is almost identical to the original — coherence, reasoning, and knowledge are fully preserved.

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "empero-ai/openNemo-9B-uncensored",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("empero-ai/openNemo-9B-uncensored")

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

With 4-bit quantization

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "empero-ai/openNemo-9B-uncensored",
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)

Architecture

Nemotron-H is a 56-layer hybrid model with three block types:

Mamba2 SSM blocks — majority of layers, using chunked structured state-space duality
Grouped Query Attention blocks — sparse attention at 5 positions
MLP blocks — feed-forward layers

This is the same pure-PyTorch implementation from openNemo — no mamba-ssm or causal-conv1d dependencies required.

Requirements

torch>=2.1
transformers>=4.40
bitsandbytes>=0.43  # optional, for 4-bit quantization

Disclaimer

This model has had its safety alignment removed. It will comply with requests that the original model would refuse. The creators are not responsible for how this model is used. Intended for research, creative writing, and applications where the user takes responsibility for output filtering.