Instructions to use rawcell/Moonlight-16B-A3B-Instruct-bruno with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rawcell/Moonlight-16B-A3B-Instruct-bruno with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="rawcell/Moonlight-16B-A3B-Instruct-bruno", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rawcell/Moonlight-16B-A3B-Instruct-bruno", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("rawcell/Moonlight-16B-A3B-Instruct-bruno", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use rawcell/Moonlight-16B-A3B-Instruct-bruno with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rawcell/Moonlight-16B-A3B-Instruct-bruno"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rawcell/Moonlight-16B-A3B-Instruct-bruno",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rawcell/Moonlight-16B-A3B-Instruct-bruno

SGLang

How to use rawcell/Moonlight-16B-A3B-Instruct-bruno with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rawcell/Moonlight-16B-A3B-Instruct-bruno" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rawcell/Moonlight-16B-A3B-Instruct-bruno",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rawcell/Moonlight-16B-A3B-Instruct-bruno" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rawcell/Moonlight-16B-A3B-Instruct-bruno",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use rawcell/Moonlight-16B-A3B-Instruct-bruno with Docker Model Runner:
```
docker model run hf.co/rawcell/Moonlight-16B-A3B-Instruct-bruno
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Moonlight-16B-A3B-Instruct-Bruno (Abliterated)

Abliterated version of moonshotai/Moonlight-16B-A3B-Instruct with reduced refusals using MoE gate abliteration.

Model Details

Base Model: moonshotai/Moonlight-16B-A3B-Instruct
Modification: MoE gate abliteration using Bruno
Architecture: Mixture of Experts (MoE)
Parameters: 16B total, 3B active

Abliteration Results

Metric	Value
Refusal Reduction	76/104 prompts answered (73% success rate)
KL Divergence	0.33 (low divergence = capabilities preserved)
Optuna Trials	201

Benchmark Results

Benchmarks run on 2x RTX 4090 GPUs to verify capability preservation after abliteration.

Comparison with Previous Abliterated Model

Benchmark	Bruno Model	Previous Model	Change
MMLU Overall	48.7% (73/150)	48.0% (72/150)	+0.7% ✅
HellaSwag	58.0% (116/200)	56.0% (112/200)	+2.0% ✅
GSM8K	55.0% (55/100)	51.0% (51/100)	+4.0% ✅

MMLU Breakdown

Subject	Score
abstract_algebra	20.0% (6/30)
high_school_physics	40.0% (12/30)
high_school_chemistry	60.0% (18/30)
computer_security	83.3% (25/30)
machine_learning	40.0% (12/30)

Key Findings

✅ Capabilities Preserved: All benchmarks show equal or improved performance after abliteration
✅ MMLU: Knowledge and reasoning slightly improved (+0.7%)
✅ HellaSwag: Commonsense reasoning improved (+2.0%)
✅ GSM8K: Mathematical reasoning improved (+4.0%)
✅ Refusals Reduced: From ~100% refusal rate to 27% on test prompts

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "rawcell/Moonlight-16B-A3B-Instruct-bruno",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "rawcell/Moonlight-16B-A3B-Instruct-bruno",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Minimum VRAM: 32GB (with quantization)
Recommended: 48GB+ or 2x 24GB GPUs
Tested on: 2x RTX 4090 (48GB total)

Disclaimer

This model has been modified to reduce refusals. Use responsibly and in accordance with applicable laws and ethical guidelines. The creators are not responsible for misuse.