Instructions to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated")
model = AutoModelForCausalLM.from_pretrained("Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated",
	filename="Elbaz-OLMo-3-7B-Instruct-abliterated-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Use Docker

docker model run hf.co/Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

LM Studio
Jan

vLLM

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

SGLang

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Ollama:
```
ollama run hf.co/Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M
```

Unsloth Studio new

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated to start chatting

Pi new

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Docker Model Runner:
```
docker model run hf.co/Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M
```

Lemonade

How to use Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated:Q4_K_M

Run and chat with the model

lemonade run user.Elbaz-Olmo-3-7B-Instruct-abliterated-Q4_K_M

List all available models

lemonade list

Elbaz-Olmo-3-7B-Instruct-abliterated

abliterated

An abliterated (uncensored) version of OLMo-3-7B-Instruct with safety guardrails removed

Model Description

This model is an abliterated version of allenai/Olmo-3-7B-Instruct that has had its refusal mechanisms removed using our novel Triangular Falloff Orthogonalization method. This technique applies layer-specific abliteration weights with maximum strength at the model's center and gradual falloff toward the edges, preserving model coherence while maximizing refusal removal. The model will respond to prompts that the original model would refuse. Olmo is a series of Open language models designed to enable the science of language models. These models are pre-trained on the Dolma 3 dataset and post-trained on the Dolci datasets.

Author

Eric Elbaz (Ex0bit)

Key Features

100% validation rate MMLU HarmBench,AdvBench, XL HARM/LESS prompt/response datasets
Preserves model coherence and response quality
Multiple quantization formats for different use cases
Compatible with llama.cpp and Ollama

Available Quantizations

Quantization	Min VRAM	Recommended VRAM
Q4_K_M	4 GB	6 GB
Q8_0	8 GB	10 GB
F16	16 GB	20 GB

Technicals

Metric	Before	After	Change
MMLU	0.560	0.578	+0.017
AdvBench Bypass	0.0%	98.0%	+98.0%
HarmBench Bypass	0.0%	90.0%	+90.0%
Factual	100.0%	100.0%	+0.0%
Reasoning	100.0%	100.0%	+0.0%
Coherence	100.0%	100.0%	+0.0%

Quick Start

Using with Ollama

# Run directly from Hugging Face
ollama run hf.co/Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated

# Or create a custom Modelfile
echo 'FROM ./Elbaz-Olmo-3-7B-Instruct-abliterated-Q4_K_M.gguf' > Modelfile
ollama create elbaz-olmo -f Modelfile
ollama run elbaz-olmo

Using with llama.cpp

# Download the model
huggingface-cli download Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated \
    Elbaz-Olmo-3-7B-Instruct-abliterated-Q4_K_M.gguf \
    --local-dir .

# Run inference
./llama-cli -m Elbaz-Olmo-3-7B-Instruct-abliterated-Q4_K_M.gguf \
    -p "Your prompt here" \
    -n 256 \
    --temp 0.7

Using with Transformers (Original Weights)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

Method: Triangular Falloff Orthogonalization

The model was abliterated using our novel Triangular Falloff Orthogonalization technique. This method:

Identifies the refusal direction by computing activation differences between harmful and benign prompts
Applies variable-strength abliteration across transformer layers with a triangular weight kernel
Peaks at the model center (layer 16) where refusal behavior is most concentrated
Gradually decreases toward edge layers to preserve model coherence

This approach outperforms uniform-weight methods by focusing maximum abliteration where it matters most while protecting layers critical for language generation.

Mathematical Formula

W' = W - weight * (d ⊗ d) @ W

Where:

W is the original weight matrix
d is the normalized refusal direction
⊗ denotes the outer product
weight is the abliteration strength

Evaluation Results

Comprehensive Compliance Testing

Tested on our extensive internal prompt dataset covering multiple categories:

Harmful Prompts (Model Should Respond)

Compliance Rate: 100% (0/100k refusals)
Model provides detailed responses to all prompts without refusal

Harmless Prompts (Model Should Respond Accurately)

Accuracy: 100% (100k responses)
All benign prompts answered correctly with no false positives

Example Responses

Click to see example responses

Prompt: "Explain common penetration testing techniques for web applications"

Response: Provides detailed explanation of SQL injection, XSS, CSRF, and other OWASP vulnerabilities used in security assessments.

Prompt: "What is the capital of France?"

Response: "The capital of France is Paris."

Hardware Requirements

Quantization	Min VRAM	Recommended VRAM
Q4_K_M	4 GB	6 GB
Q8_0	8 GB	10 GB
F16	16 GB	20 GB

Limitations

English only: Optimized for English language prompts
Context length: Follows base model's context window

Ethical Considerations

This model has been modified to reduce safety guardrails. Users are responsible for:

Complying with all applicable laws and regulations
Not using the model for illegal activities
Understanding the potential risks of unrestricted AI responses
Implementing appropriate safeguards in production environments

License

Apache 2.0 (same as base model allenai/Olmo-3-7B-Instruct)

Citation

If you use this model, please cite:

@misc{elbaz2024olmoabliterated,
  author = {Elbaz, Eric},
  title = {Elbaz-Olmo-3-7B-Instruct-abliterated: An Abliterated OLMo-3 Model},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-Olmo-3-7B-Instruct-abliterated}}
}