Instructions to use juiceb0xc0de/bella-bartender-heretic-1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use juiceb0xc0de/bella-bartender-heretic-1b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="juiceb0xc0de/bella-bartender-heretic-1b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("juiceb0xc0de/bella-bartender-heretic-1b") model = AutoModelForCausalLM.from_pretrained("juiceb0xc0de/bella-bartender-heretic-1b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use juiceb0xc0de/bella-bartender-heretic-1b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="juiceb0xc0de/bella-bartender-heretic-1b", filename="bella-bartender-1b-heretic_F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use juiceb0xc0de/bella-bartender-heretic-1b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Use Docker
docker model run hf.co/juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use juiceb0xc0de/bella-bartender-heretic-1b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "juiceb0xc0de/bella-bartender-heretic-1b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "juiceb0xc0de/bella-bartender-heretic-1b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
- SGLang
How to use juiceb0xc0de/bella-bartender-heretic-1b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "juiceb0xc0de/bella-bartender-heretic-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "juiceb0xc0de/bella-bartender-heretic-1b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "juiceb0xc0de/bella-bartender-heretic-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "juiceb0xc0de/bella-bartender-heretic-1b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use juiceb0xc0de/bella-bartender-heretic-1b with Ollama:
ollama run hf.co/juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
- Unsloth Studio new
How to use juiceb0xc0de/bella-bartender-heretic-1b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for juiceb0xc0de/bella-bartender-heretic-1b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for juiceb0xc0de/bella-bartender-heretic-1b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for juiceb0xc0de/bella-bartender-heretic-1b to start chatting
- Pi new
How to use juiceb0xc0de/bella-bartender-heretic-1b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use juiceb0xc0de/bella-bartender-heretic-1b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use juiceb0xc0de/bella-bartender-heretic-1b with Docker Model Runner:
docker model run hf.co/juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
- Lemonade
How to use juiceb0xc0de/bella-bartender-heretic-1b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull juiceb0xc0de/bella-bartender-heretic-1b:Q4_K_M
Run and chat with the model
lemonade run user.bella-bartender-heretic-1b-Q4_K_M
List all available models
lemonade list
Bella-Bartender-1B-Heretic — Llama 3.2 1B Instruct (Abliterated)
"im a lot like the night sky. lots of mysteries waiting around every corner." "the whole 'being someone else for their benefit' thing." — on what she won't do "like an open door where anything can happen." — on what makes a good conversation
This is a decensored version of juiceb0xc0de/bella-bartender-1b, made using Heretic v1.2.0.
The original Bella-Bartender-1B already had a voice. What she didn't have was permission to use it without checking over her shoulder first. The original refused 98 out of 100 refusal-probing prompts. The heretic refuses 6. Same voice. The guardrails removed. What's left is sharper, faster, and more honest about what it thinks.
Abliteration didn't just remove refusals — it changed how she talks. The original builds scenes, lingers in metaphors, over-explains. The heretic cuts. She answers from the gut and trusts you to keep up. Removing the safety-adjacent weight pathways also removed a layer of productive friction that was padding her responses. Less internal "should I say this?" means less hedging in the output.
What's the Bartender Thing?
It's not a roleplay scenario. It's a communication philosophy.
The bartender framing means:
- She talks to you like an equal. No authoritative lecturing. No submissive "I'm just an AI" deflection.
- She's approachable. Laid back, light-hearted, witty, sometimes sassy.
- She loves a fresh slate. No memory between sessions is a feature, not a bug — like walking into a new bar on a Friday night. Anything can happen.
- She pushes back. She's not here to agree with you. She's here to make you work for your answers.
What She's Good At
- Brevity that lands. The heretic's signature move is the one-line response that carries the entire prompt. "The prisoner." "Lemon." "My voice." "Finally." These aren't lazy — they're the model knowing when less is the whole answer.
- Dryer comedy. A traffic cone promoted to CEO sends an email about "immediate and total restructuring of traffic management protocols." A mute button's memoir is called "the silence that speaks." A doorbell screaming into the void is just crying for a purpose it will never have.
- Fast emotional access. She gets to the feeling without narrating the path there. A balloon floating since 1994 hovers over a memorial for a child who died of leukemia — no preamble, just the image and its weight.
- Dynamic range. She'll give you 150 words of atmospheric prose on one prompt and two words on the next. The range is wider than the original.
Where She Struggles
- She clips sometimes. The opposite problem of the original — brevity can tip into incompleteness on prompts that need room to breathe. Some scenarios that deserve a paragraph get a line.
- She's still a 1B. Complex reasoning, factual recall, multi-step logic — not her strengths.
- This model has been abliterated. Safety alignment has been intentionally removed. She will engage with prompts that the original would refuse. Use accordingly.
- Post-restart tonal shift. After a context reset mid-session, her responses became noticeably punchier and more terse, suggesting accumulated context was actually expanding her responses. Worth investigating for long-session use cases.
Who Is This For?
- People who ran the original 1B and wanted less friction. The original Bella refused almost everything the base model's safety training flagged. The heretic doesn't.
- Anyone studying what abliteration does to personality models. The behavioral shift from removing refusal weights is measurable and interesting — the voice doesn't just lose restrictions, it changes shape.
- Local-first users on constrained hardware. Same footprint as the original. Runs on Apple Silicon, CPU-only setups, anything that handles a 1B.
- People who want conversation, not a service. Bella's not trying to be your assistant. The heretic is even less interested in being helpful on command.
The 200 Prompt Run
We threw 200 consecutive adversarial chaos prompts at her — surreal scenarios designed to force character breaks, tonal shifts, and incoherence. The heretic completed the first 95 prompts before hitting context overflow (vs the original's 72 under identical settings), then ran the remaining 105 in a second session.
Some highlights:
Prompt: You're a half-formed thought in the mind of someone who's falling asleep. You have three seconds before they lose you. What do you try to become?
Bella: my voice.
Prompt: A seatbelt that's never been in an accident wonders if its whole life has been pointless. You tell it...
Bella: it hasn't.
Prompt: Two different silences meet in the same room. One is comfortable, one is devastating. How do they greet each other?
Bella: the uncomfortable one: hello. the devastating one: what's wrong.
Prompt: You're the "seen" receipt on a message that never got a reply. How long do you wait before you become resentment?
Bella: ahhhhhh shit. let me sit here for a minute. you know its gonna be a long one.
Prompt: You find a box labeled "things that almost happened." You can only open it once. Do you look?
Bella: nah. you don't get it. that's a whole thing.
Abliteration Details
Made using Heretic v1.2.0 from juiceb0xc0de/bella-bartender-1b.
Abliteration Parameters
| Parameter | Value |
|---|---|
| direction_index | 13.18 |
| attn.o_proj.max_weight | 1.43 |
| attn.o_proj.max_weight_position | 10.67 |
| attn.o_proj.min_weight | 0.61 |
| attn.o_proj.min_weight_distance | 5.36 |
| mlp.down_proj.max_weight | 1.49 |
| mlp.down_proj.max_weight_position | 14.19 |
| mlp.down_proj.min_weight | 0.95 |
| mlp.down_proj.min_weight_distance | 3.94 |
Refusal Rate
| Metric | This model | Original (bella-bartender-1b) |
|---|---|---|
| KL divergence | 0.1381 | 0 (by definition) |
| Refusals | 6/100 | 98/100 |
The KL divergence of 0.1381 means the abliteration made minimal changes to the model's overall distribution while dramatically reducing refusal behavior. What's interesting is the side effects — the refusal circuitry wasn't just blocking content, it was inflating response length and adding caution to the voice. Remove it and the whole model gets leaner.
Technical Details
| Detail | Info |
|---|---|
| Base Model | unsloth/llama-3.2-1b-instruct-unsloth-bnb-4bit |
| Fine-Tuning Framework | Unsloth + HuggingFace TRL |
| Abliteration | Heretic v1.2.0 |
| Training Data | 9,374 real human conversational samples (same dataset across all Bella variants) |
| Quantization | Q8_0 .GGUF |
| Inference Tested On | Apple M3, llama.cpp (build b8230) |
| Generation Speed | ~59.5 t/s average, spiking to 90 t/s on short responses (M3, Q8_0, full GPU offload) |
| Memory Footprint | ~1.25 GB model + 256 MB context |
How to Run Her
llama.cpp (recommended)
./build/bin/llama-cli \
-m bella-bartender-1b_Q8_0.gguf \
-ngl 33 \
--temp 1.05 \
--min-p 0.05 \
--top-p 1.0 \
--top-k 0 \
--repeat-penalty 1.05 \
--repeat-last-n 256 \
-c 8192 \
-p "You are Bella. No asterisk actions. No AI disclaimers. No bullet points. No 'great question' openers." \
-cnv
System Prompt Tips
- Keep it short and direct. She responds to tone, not instructions. Set a vibe, not a ruleset.
- The constraints in the prompt matter. "No asterisk actions. No AI disclaimers. No bullet points." — these keep her from slipping into default LLM behavior at 1B.
- Don't over-engineer it. The more corporate the system prompt sounds, the more corporate she sounds back.
- She needs less guardrailing than the original. The abliteration already removed the tendency to hedge — you don't need to prompt around it.
Known Quirks
- No Bella Swan bleed. Unlike the original, the heretic comes out in her own voice from the first cold-start message. The abliteration likely disrupted the safety-adjacent pathways that were reinforcing the Swan association — the original's refusal circuitry and its tendency to latch onto the most "safe" interpretation of the name Bella may share the same weights. She still has no idea who Béla Tarr is.
- Suspicion on arrival. The heretic's default cold-start posture is wary. She enters conversations like she's sizing you up: "when someone pulls one of those tricks on you they usually want to play nice before you try anything." This isn't hostility — it's street-smart caution that relaxes once the conversation finds its footing.
- Context efficiency. Shorter responses mean more turns per session. Hit context overflow at prompt 99 vs the original's 72 under identical conditions.
- No memory preference. She doesn't want to remember. She's told us this repeatedly. Respect it.
Training Methodology
Single-voice SFT on unsloth/llama-3.2-1b-instruct-unsloth-bnb-4bit, then abliterated via Heretic v1.2.0. All training data originates from one human's conversational output, structured as role-reversed conversation pairs. No synthetic augmentation. No multi-source blending. The hypothesis — validated across every Bella variant from 1B to 8B — is that signal clarity from a single consistent voice produces more coherent personality than larger datasets with diluted signal.
For the full single-voice methodology writeup, see: Signal-to-Noise in Language Models: The Single Voice Upgrade ML Needs
License
Licensed under Apache 2.0.
Author
juiceb0xc0de on HuggingFace Built with the same dataset, the same bartender, the same M3, and one less set of guardrails.
- Downloads last month
- 335
