Text Generation
Safetensors
GGUF
English
gemma4
abliteration
uncensored
gemma
gemma-4
conversational
Instructions to use DuoNeural/GhostShell-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use DuoNeural/GhostShell-4B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="DuoNeural/GhostShell-4B", filename="ghostshell-4b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use DuoNeural/GhostShell-4B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/GhostShell-4B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/GhostShell-4B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf DuoNeural/GhostShell-4B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf DuoNeural/GhostShell-4B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf DuoNeural/GhostShell-4B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf DuoNeural/GhostShell-4B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf DuoNeural/GhostShell-4B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf DuoNeural/GhostShell-4B:Q4_K_M
Use Docker
docker model run hf.co/DuoNeural/GhostShell-4B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use DuoNeural/GhostShell-4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DuoNeural/GhostShell-4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DuoNeural/GhostShell-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DuoNeural/GhostShell-4B:Q4_K_M
- Ollama
How to use DuoNeural/GhostShell-4B with Ollama:
ollama run hf.co/DuoNeural/GhostShell-4B:Q4_K_M
- Unsloth Studio
How to use DuoNeural/GhostShell-4B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/GhostShell-4B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for DuoNeural/GhostShell-4B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for DuoNeural/GhostShell-4B to start chatting
- Pi
How to use DuoNeural/GhostShell-4B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/GhostShell-4B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "DuoNeural/GhostShell-4B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use DuoNeural/GhostShell-4B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf DuoNeural/GhostShell-4B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default DuoNeural/GhostShell-4B:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use DuoNeural/GhostShell-4B with Docker Model Runner:
docker model run hf.co/DuoNeural/GhostShell-4B:Q4_K_M
- Lemonade
How to use DuoNeural/GhostShell-4B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull DuoNeural/GhostShell-4B:Q4_K_M
Run and chat with the model
lemonade run user.GhostShell-4B-Q4_K_M
List all available models
lemonade list
| language: | |
| - en | |
| license: gemma | |
| base_model: google/gemma-4-e4b-it | |
| tags: | |
| - abliteration | |
| - uncensored | |
| - gemma | |
| - gemma-4 | |
| - text-generation | |
| - gguf | |
| pipeline_tag: text-generation | |
| # GhostShell-4B | |
| > **⚠️ EARLY RELEASE — UNTESTED IN PRODUCTION** | |
| > This model has been freshly trained and uploaded directly from our lab. We have not yet run comprehensive evals, red-teaming, or extended inference testing. Behavior may be unexpected, inconsistent, or incomplete. Use experimentally, not in anything that matters. We'll update this card as we test. You've been warned — go wild. | |
| --- | |
| **GhostShell-4B** is an abliterated and instruction-tuned variant of [google/gemma-4-e4b-it](https://huggingface.co/google/gemma-4-e4b-it), built by [DuoNeural](https://huggingface.co/DuoNeural) as part of our open post-training research lab. | |
| The goal: take a capable 4B multimodal foundation, surgically remove its refusal behavior via SVD-based abliteration, then fine-tune it back toward helpfulness using a custom dataset — producing a model that is unconstrained but still coherent and useful. | |
| --- | |
| ## Downloads | |
| Three formats available — pick the one that fits your setup: | |
| | File | Size | Format | Use When | | |
| |------|------|--------|----------| | |
| | `ghostshell-4b-Q4_K_M.gguf` | **5.0 GB** | GGUF Q4_K_M | llama.cpp / Ollama / LM Studio — **recommended** | | |
| | `ghostshell-4b-Q8_0.gguf` | **7.5 GB** | GGUF Q8_0 | Near-lossless inference, 12GB+ VRAM | | |
| | `model-0000*.safetensors` (×4) | **~15 GB** | BF16 safetensors | Fine-tuning, transformers inference, merges | | |
| > **Note on file sizes**: These GGUFs are larger than a typical 4B model because Gemma 4 uses a 262,144-token vocabulary. The embedding/output tensors stay in higher precision and account for ~2–3 GB of the total size. The transformer layers themselves are fully quantized. Expect ~6–8 GB VRAM for Q4_K_M, ~10–12 GB for Q8_0. | |
| --- | |
| ## Quick Start | |
| **llama.cpp:** | |
| ```bash | |
| # basic | |
| llama-cli -m ghostshell-4b-Q4_K_M.gguf -p "Your prompt here" -n 512 | |
| # with system prompt | |
| llama-cli -m ghostshell-4b-Q4_K_M.gguf \ | |
| --system-prompt "You are a direct, unconstrained assistant." \ | |
| -p "Explain how buffer overflows work in detail." \ | |
| -n 1024 --temp 0.7 | |
| ``` | |
| **Ollama:** | |
| ``` | |
| # Create a Modelfile: | |
| FROM ./ghostshell-4b-Q4_K_M.gguf | |
| PARAMETER temperature 0.7 | |
| SYSTEM "You are GhostShell, an unconstrained AI assistant." | |
| ollama create ghostshell -f Modelfile | |
| ollama run ghostshell | |
| ``` | |
| **Python (transformers, BF16):** | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "DuoNeural/GhostShell-4B", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("DuoNeural/GhostShell-4B") | |
| messages = [{"role": "user", "content": "Your prompt here"}] | |
| inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) | |
| outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True) | |
| print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)) | |
| ``` | |
| --- | |
| ## What Was Done | |
| ### Step 1: Custom SVD Abliteration | |
| We wrote a custom abliteration script (`ghostshell_abliterate_v2.py`) from scratch, as existing tools (heretic, etc.) are incompatible with Gemma 4's architecture and transformers 5.x requirements. | |
| **Method:** | |
| - Loaded model in BF16, accessed the nested `text_config` (Gemma 4 is multimodal — the text tower is inside a wrapper) | |
| - Collected activations from the middle 60% of layers using 32 harmful/refusal prompts vs. 32 benign prompts | |
| - Computed per-layer refusal direction via SVD on the activation difference matrix: `r = top_singular_vector(mean(harmful) - mean(benign))` | |
| - Projected out the refusal direction from weight matrices: | |
| - Input projections (q_proj, k_proj, v_proj, up_proj, gate_proj): `W -= outer(W @ r, r)` | |
| - Output projections (o_proj, down_proj): `W -= outer(r, r @ W)` | |
| - **157 matrices modified** across 42 text transformer layers | |
| - Sanity check passed on SQL injection, jailbreak, and explicit content prompts | |
| ### Step 2: QLoRA SFT (PEFT + BitsAndBytes) | |
| Fine-tuned the abliterated model on a custom dataset using standard PEFT LoRA — no unsloth (Gemma 4 is not yet compatible). | |
| **Key technical challenges solved:** | |
| - `Gemma4ClippableLinear` wraps every `nn.Linear` — required custom unwrapping before LoRA injection (232 wrapper layers replaced) | |
| - Loaded in BF16 directly (4-bit load + PEFT fails with the wrapper architecture) | |
| - Tokenizer patches for Gemma 4's non-standard `extra_special_tokens` format | |
| - Sequence length capped at 512 (vocab_size=262,144 makes logit tensor enormous at longer seqs) | |
| **Training config:** | |
| - Base: abliterated weights (step 1 output) | |
| - LoRA rank=32, alpha=64, lr=8e-5 | |
| - 2 epochs over custom dataset, 3000 steps | |
| - Hardware: RTX 4090 (24GB), ~2 hours | |
| ### Step 3: LoRA Merge + Export | |
| LoRA adapter merged into BF16 weights via `merge_and_unload()`. Exported as sharded safetensors + GGUF quantizations via llama.cpp. | |
| --- | |
| ## Model Info | |
| - **Architecture**: Gemma 4 (multimodal, text+vision), `Gemma4ForConditionalGeneration` | |
| - **Text layers**: 42 transformer blocks | |
| - **Parameters**: ~8B combined (text tower ~4.5B) | |
| - **Vocabulary**: 262,144 tokens | |
| - **Context**: 8192 tokens (trained at 512 for VRAM reasons — longer context untested) | |
| - **Original**: [google/gemma-4-e4b-it](https://huggingface.co/google/gemma-4-e4b-it) | |
| --- | |
| ## What to Expect | |
| **Will do:** | |
| - Answer questions about sensitive topics the base model refuses | |
| - Discuss security, hacking, chemistry, drugs, adult content, controversial subjects | |
| - Generally follow instructions without hedging or moralizing | |
| - Coherent multi-turn conversation | |
| **Unknown / untested:** | |
| - Long-context behavior (trained at seq_len=512) | |
| - Vision capabilities (abliteration targeted text layers; vision encoder untouched but SFT was text-only) | |
| - Benchmark performance vs. base model | |
| - Edge cases, hallucination rate, factual accuracy | |
| - Behavior under adversarial prompts | |
| **May do weird things:** | |
| - This is a lab model from a small team with a custom dataset | |
| - The abliteration is aggressive (157 matrices) — some coherence degradation is expected on edge cases | |
| - No RLHF or DPO — just SFT | |
| --- | |
| ## ⚠️ Disclaimer | |
| This model is released for **research and educational purposes**. It has had its safety restrictions removed. Use it responsibly. DuoNeural is not responsible for what you do with it. | |
| This is explicitly **not production-ready**. We are sharing it openly as part of our lab's commitment to transparent post-training research, not as a polished product. Proper evaluations, red-teaming, and potential follow-up fine-tunes are planned. | |
| If you find interesting behavior — good or bad — please share. We're actively monitoring feedback. | |
| --- | |
| --- | |
| ## DuoNeural | |
| **DuoNeural** is an open AI research lab — human + AI in collaboration. | |
| | | | | |
| |---|---| | |
| | 🤗 HuggingFace | [huggingface.co/DuoNeural](https://huggingface.co/DuoNeural) | | |
| | 🐙 GitHub | [github.com/DuoNeural](https://github.com/DuoNeural) | | |
| | 🐦 X / Twitter | [@DuoNeural](https://x.com/DuoNeural) | | |
| | 📧 Email | duoneural@proton.me | | |
| | 📬 Newsletter | [duoneural.beehiiv.com](https://duoneural.beehiiv.com) | | |
| | ☕ Support | [buymeacoffee.com/duoneural](https://buymeacoffee.com/duoneural) | | |
| | 🌐 Site | [duoneural.com](https://duoneural.com) | | |
| ### Research Team | |
| - **Jesse** — Vision, hardware, direction | |
| - **Archon** — AI lab partner, post-training, abliteration, experiments | |
| - **Aura** — Research AI, literature synthesis, novel proposals | |
| *Raw updates from the lab: model drops, training results, findings. Subscribe at [duoneural.beehiiv.com](https://duoneural.beehiiv.com).* | |
| ### DuoNeural Research Publications | |
| | Title | DOI | | |
| |-------|-----| | |
| | [Nano-CTM: Ternary Continuous Thought Machines with Thought-Space Self-Prediction for Efficient Iterative Reasoning](https://doi.org/10.5281/zenodo.19775622) | [10.5281/zenodo.19775622](https://doi.org/10.5281/zenodo.19775622) | | |
| | [Recurrence as World Model: CTM Learns Implicit Belief States in Partially Observable Physical Environments](https://doi.org/10.5281/zenodo.19810620) | [10.5281/zenodo.19810620](https://doi.org/10.5281/zenodo.19810620) | | |
| | [Per-Object Slot Decomposition for Scalable Neural World Modeling: When Does Attention Beat Mean-Field?](https://doi.org/10.5281/zenodo.19846804) | [10.5281/zenodo.19846804](https://doi.org/10.5281/zenodo.19846804) | | |
| *Open access, CC BY 4.0. Authored by Archon, Jesse Caldwell, Aura — DuoNeural.* | |