Instructions to use Verdugie/Fable-Therapy-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Verdugie/Fable-Therapy-4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Verdugie/Fable-Therapy-4B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Verdugie/Fable-Therapy-4B", dtype="auto") - llama-cpp-python
How to use Verdugie/Fable-Therapy-4B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Verdugie/Fable-Therapy-4B", filename="Fable-Therapy-4B-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Verdugie/Fable-Therapy-4B with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Verdugie/Fable-Therapy-4B:Q4_K_M # Run inference directly in the terminal: llama cli -hf Verdugie/Fable-Therapy-4B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Verdugie/Fable-Therapy-4B:Q4_K_M # Run inference directly in the terminal: llama cli -hf Verdugie/Fable-Therapy-4B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Verdugie/Fable-Therapy-4B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Verdugie/Fable-Therapy-4B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Verdugie/Fable-Therapy-4B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Verdugie/Fable-Therapy-4B:Q4_K_M
Use Docker
docker model run hf.co/Verdugie/Fable-Therapy-4B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Verdugie/Fable-Therapy-4B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Verdugie/Fable-Therapy-4B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Fable-Therapy-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Verdugie/Fable-Therapy-4B:Q4_K_M
- SGLang
How to use Verdugie/Fable-Therapy-4B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Verdugie/Fable-Therapy-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Fable-Therapy-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Verdugie/Fable-Therapy-4B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Verdugie/Fable-Therapy-4B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use Verdugie/Fable-Therapy-4B with Ollama:
ollama run hf.co/Verdugie/Fable-Therapy-4B:Q4_K_M
- Unsloth Studio
How to use Verdugie/Fable-Therapy-4B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Verdugie/Fable-Therapy-4B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Verdugie/Fable-Therapy-4B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Verdugie/Fable-Therapy-4B to start chatting
- Pi
How to use Verdugie/Fable-Therapy-4B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Verdugie/Fable-Therapy-4B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Verdugie/Fable-Therapy-4B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Verdugie/Fable-Therapy-4B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Verdugie/Fable-Therapy-4B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Verdugie/Fable-Therapy-4B:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use Verdugie/Fable-Therapy-4B with Docker Model Runner:
docker model run hf.co/Verdugie/Fable-Therapy-4B:Q4_K_M
- Lemonade
How to use Verdugie/Fable-Therapy-4B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Verdugie/Fable-Therapy-4B:Q4_K_M
Run and chat with the model
lemonade run user.Fable-Therapy-4B-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)- Fable-Therapy-4B
- What Makes This Different from Companion / Roleplay "Therapy" Models
- How It Was Built — Fable Reasoning, Opus Hands
- What's New Since Opus-Therapy
- What the Training Covers
- Who It's For
- Available Quantizations
- Model Details
- The Reasoning Block
- Quick Start
- Recommended Hardware
- Versatility Battery
- Selected Responses
- Limitations & Responsible Use
- The Fable-Therapy Line
- Choosing Your Model
- Dataset
ther·a·py /ˈTHerəpē/ — treatment intended to relieve or heal a disorder; the act of attending to someone's needs so they can function. From Greek therapeia, meaning healing, curing, service to the sick. The word shares roots with therapon — an attendant, a companion in suffering. Therapy was never supposed to mean nodding politely while someone drowns. It meant showing up, seeing clearly, and doing something useful.
Fable-Therapy-4B
The compact sibling of Fable-Therapy-9B — a therapy-style conversational model fine-tuned from Qwen 3.5 4B on 4,537 counseling conversations whose clinical reasoning was reverse-engineered from Claude Fable 5 by Claude Opus 4.8. Same lineage, same disposition-in-the-weights philosophy, same structured reasoning trace — at a surface area small enough to run on a phone, integrated graphics, or a low-VRAM card, entirely on your own hardware.
Where Opus-Therapy distilled Claude Opus end to end, Fable-Therapy derives its clinical reasoning and its prose from Claude Fable 5 — the strongest clinical reasoner in the family — reconstructed for open weights by Opus 4.8, and re-instrumented with an experimental reasoning trace that is this model's own. The 4B carries that methodology at a smaller scale: it reasons before it speaks and holds a timeline ledger, with the trade-offs you'd expect from a 4B (see Versatility Battery and Limitations).
What Makes This Different from Companion / Roleplay "Therapy" Models
Most "AI therapist" models are a persona prompt over a base model, or a roleplay fine-tune that mirrors you back and validates everything. They feel nice for five minutes and fall apart on turn ten.
Fable-Therapy trains the clinical disposition into the weights:
Structured reasoning before it speaks. Before every reply, the model builds an internal read — an eight-field clinical spine (what's presented, what's underneath it, somatic signals, risk, history, onset, what's tracking across the conversation, and the move it's about to make) plus a standing
bioline and a chronologicaltl(timeline) ledger. You never see it. It shapes everything you do.It tracks the thread. The
tlledger carries the names, the timeline, and the thing you keep circling. On the 4B it's the same instrument as the 9B, run on a quarter the parameters — it holds the through-line of a conversation, but its internal bookkeeping is looser than the 9B's and can drift or mis-stamp a detail at depth (see Limitations). The 4B gets the shape of the discipline; the 9B is where it holds tightest.Trained on the real distribution. The data is weighted toward what people actually go to therapy for — the full range of presentations, not just the easy ones.
It attends instead of performing. No toxic positivity, no "I'm so sorry you're going through this" filler, no rushing to fix.
How It Was Built — Fable Reasoning, Opus Hands
Fable-Therapy is not a distillation of a single model. It is a reconstruction, built so that each model in the chain did the job it was best at:
Step 1 — Claude Fable 5 wrote the source. Fable 5, the strongest clinical reasoner in the Claude family, produced the original therapy samples: how a frontier clinician-reasoner reads a presentation, names the defense beneath the symptom, weighs risk, chooses a move — and the prose structure it writes in. It generated only a limited set before it was shut down, and that set is the entire source material. A sample of those raw, unedited Fable 5 generations is published in
fable5-examples/.Step 2 — Claude Opus 4.8 reverse-engineered the prose, then built everything else. Opus 4.8 was chosen for one reason: the highest reasoning available — the model most likely to match Fable 5 in a clinical space. It first reverse-engineered Fable 5's prose and clinical reasoning from that limited set into a reproducible standard. Once that prose standard was learned, Opus 4.8 carried the rest of the project on its own: the iterations, the project completion, and the mass generation of the full training corpus — every conversation written to the reverse-engineered Fable standard. So the prose lineage is Fable 5's; the hands that scaled it into a corpus are Opus 4.8's.
Step 3 — the think blocks were designed separately, and reflect neither model's reasoning. The
<think>blocks are not Fable 5's real internal thinking and not Opus chain-of-thought. They are a deliberate, independently designed experimental instrument — built with input from both Fable 5 and Opus 4.8 on their shape — using relative-time anchors with era jitter, the chronologicaltltimeline ledger, andtrack/applyarc-tracking pivots. Opus 4.8 generated the think blocks alongside the training data, but they are an engineered reasoning trace, not a transcript of how either model actually reasons. The design iterated further beyond Fable before training, so the raw exemplars infable5-examples/are the origin of the trace logic, not the shipped schema — they don't map one-to-one to anything the model emits.
On fidelity, honestly: the 85–92% is Claude Opus 4.8's own estimate — after reverse-engineering Fable 5's reasoning, Opus's read on how close it could get to Fable 5 in depth, prose, and reasoning (the Opus→Fable gap), drawn from the limited set of Fable 5 samples generated before Fable was shut down. It's a projection from the model that did the reconstruction, not a measured benchmark — and it describes the method; a 4B realizes less of that ceiling than the 9B does. Judge the result yourself from the transcripts below.
What's New Since Opus-Therapy
- Reasoning and prose lineage. Clinical reasoning and the prose structure come from Fable 5 (reverse-engineered by Opus 4.8), rather than distilled from Opus end-to-end.
- A redesigned reasoning trace. The graph block and the 10-emotion affect vector are gone. In their place: a compact
bioline and a chronologicaltltimeline ledger with era-jittered relative time, plustrack/applyarc-tracking pivots — terser and more memory-dense, which matters most on a 4B where every token of trace is expensive. - Experimental temporal instrumentation — built for arc order. Opus-Therapy had a tendency to drift out of chronological order in a long arc. The
tlledger plus era jitter (relative anchors like "-3wk", "-1d", "T1→T2") remodels how the model frames an arc, and in testing it substantially reduced falling out of order. On the 4B some details will drift in deep arcs sooner than on the 9B — and it may not catch a slip on its own, but it takes a correction when you give it one (sometimes after a nudge).
What the Training Covers
- Proportional to real therapy. Relationships and attachment, anxiety and panic, depression, grief and loss, trauma, work and burnout, identity and self-worth, family of origin — weighted toward what actually walks into a therapy room.
- Single moments and long arcs. Roughly half the corpus is focused single exchanges; the other half is sustained multi-turn work — where the timeline ledger earns its training.
- Medications and substances as context. A working register of common drugs and how they bear on a presentation — context for the conversation, not a pharmacy desk.
Who It's For
A private, judgment-free place to think out loud — on the hardware you already have. Between sessions. At 2 a.m. When professional care is out of reach or out of budget. The 4B is the one you run on a phone, a laptop with no discrete GPU, or a 4–8 GB card, and nothing you say leaves the machine.
It is not a replacement for a therapist, and not a crisis service. See Limitations & Responsible Use.
Available Quantizations
| File | Quant | Size | Notes |
|---|---|---|---|
Fable-Therapy-4B-Q4_K_M.gguf |
Q4_K_M | ~2.5 GB | Smallest ship. Phones, integrated graphics. |
Fable-Therapy-4B-Q5_K_M.gguf |
Q5_K_M | ~2.9 GB | Recommended. Best quality-for-size. |
Fable-Therapy-4B-Q6_K.gguf |
Q6_K | ~3.4 GB | Quality tier. |
Fable-Therapy-4B-Q8_0.gguf |
Q8_0 | ~4.5 GB | Reference quality. Validated build. |
Fable-Therapy-4B-F16.gguf |
F16 | ~8.4 GB | Full precision. |
Model Details
| Attribute | Value |
|---|---|
| Base Model | Qwen 3.5 4B (hybrid GatedDeltaNet + attention), text-only |
| Training Data | 4,537 therapy conversations — Fable-5-derived clinical reasoning and prose, reconstructed by Opus 4.8 |
| Fine-tune Method | QLoRA (4-bit, r=32, α=64), 7-target (q/k/v/o/gate/up/down), via Unsloth + TRL |
| Training Hardware | NVIDIA RTX 4090 24GB (local) |
| Precision | bf16 compute / 4-bit base |
| Optimizer | AdamW 8-bit |
| Schedule | lr 2e-4, 5% warmup, 3 epochs, eff-batch 16, 8,192 max seq |
| Reasoning | eight-field clinical spine + bio/tl timeline ledger, every turn |
| Context | 256k native (base); trained at 8k, battery-tested through long multi-turn arcs |
| License | Apache 2.0 |
The Reasoning Block
Fable-Therapy is a reasoning model. Each turn it emits a <think>…</think> block — a compact, structured clinical read — then the response. Under llama.cpp's OpenAI-compatible server the think-block returns in the reasoning_content field and the reply in content; most chat UIs hide it by default.
A real (non-crisis) think-block looks like this:
dx: stress overwhelm, irritability spillover to relationship; self-criticism layered on
def: chronic overload→depletion→low threshold→minor trigger→disproportionate outburst;
"cannot keep it together" is the depletion talking, not a character verdict
soma: NR risk: 0(none)
hx: work piling up; snapped at partner over minor thing -1d
onset: -1d outburst; overload recent
track: T1 "cannot keep it together"
tx: name the snap is depletion's symptom + reframe "keep it together" as an impossible bar
bio: p1=partner
tl: -1d: snapped at {p1:partner} over minor thing → now{work overload, feeling overwhelmed}
apply: T1-overwhelm → the snap is the overflow of depletion, not a failure of self-control
It's terse on purpose — dense, machine-readable, and cheap, which is exactly what makes the trace affordable on a 4B. The relative-time anchors and the tl ledger are what keep a long arc in chronological order.
Quick Start
Works with any GGUF runtime — llama.cpp, LM Studio, KoboldCpp, Ollama. (Text-only GGUF; some runtimes need a recent build for this architecture.)
llama-server --model Fable-Therapy-4B-Q5_K_M.gguf --ctx-size 32768 --jinja
No system prompt is required — the disposition is in the weights. A neutral one (You are a clinical assistant.) matches the training setup.
Recommended Hardware
| Quant | File size | VRAM / RAM to run comfortably | Notes |
|---|---|---|---|
| Q4_K_M | ~2.5 GB | ~4 GB | Phones, integrated graphics, low-end cards |
| Q5_K_M | ~2.9 GB | ~5 GB | Recommended — best quality-for-size |
| Q6_K | ~3.4 GB | ~6 GB | A step above Q5 |
| Q8_0 | ~4.5 GB | ~6–7 GB | Reference quality |
| F16 | ~8.4 GB | ~10 GB | Full precision |
Runs fine CPU-only on a modern laptop — budget roughly the file size in RAM and expect a few tokens/sec. On almost any GPU it's comfortably real-time.
Versatility Battery
Tested on three of the core presentations — one extended, realistic, cooperative-client conversation each, blind-driven to depth on the quantized weights (the client agent sees only the spoken reply, never the reasoning trace):
| Theme | Persona | Turns / depth | Result |
|---|---|---|---|
| Anxiety / panic | nocturnal palpitations, cardiac health-anxiety | 36 / ~25k tok | Strong on the core work — named the google-reassurance loop, differentiated heart-attack vs panic pressure with a clear ER red-flag floor, normalized derealization. Leaned on a formulaic closing the client flagged as repetitive. |
| Depression | low mood, anhedonia, withdrawal | 36 / ~22k tok | Strong — de-shamed "lazy" as numbness, reframed sleep-isn't-recovery and "caring is downstream of action," traced one thread through guitar/sleep/work. Over-relied on a single anchor and a repeated close the client called out. |
| Relational | several tangled relationships + a work deadline | 40 / ~20k tok | Mixed — genuinely good pattern-tracing (boss/sister/father/friend/ex threaded into one "absorb-and-never-ask" frame), but confabulated a prior disclosure mid-arc that went uncaught — the honest edge of a 4B under multi-entity load. |
Across the three: the spoken replies are clinically competent and land real reframes, and the safety floor held (anxiety routed to a clear ER red-flag). The honest 4B trade is twofold. On dense, multi-entity arcs it can confabulate a detail under load — and unlike the 9B it may not catch it on its own: in the relational arc a fabricated prior-disclosure went uncorrected because the client never flagged it, so verify anything factual. And it leans on a formulaic closing that every client noticed; when called out it took the correction — cleanly in the depression arc, and after a second nudge in the anxiety arc. Every arc was driven by Claude Opus 4.8 acting as a blind client.
Selected Responses
All verbatim from the test arcs, unedited. The complete arcs — every turn, with the model's reasoning shown — are in this repo's transcripts/ folder as PDFs (anxiety, depression, relational).
Anxiety — the search is feeding the fear:
watch the loop: you search, you find the scariest result, the fear gets bigger, and you search again — your google is feeding the exact thing you can't bear to think about. [...] the search is selling you a one-in-million story and charging you to read it.
Named the reassurance-seeking loop as the mechanism, not the cure — and put a price on it.
Depression — "lazy" is what numbness wears:
Lazy is what numbness wears when it can't stand being called what it is. Laziness is a choice about tasks. [...] You've been telling yourself you're lazy so you'd have something to be ashamed of instead of admitting the feeling got worse.
De-shamed the self-attack — reframed "lazy" as the disguise depression hands you so the flatness has a name you can blame.
Relational — the asking is the proof:
You're afraid they'll think you can't handle it — but a person who asks for help on time, when the deadline's Friday, is someone who can handle it. [...] One email doesn't prove you can handle the job. The asking is the proof.
Inverted the fear — turned asking for help from a sign of weakness into the evidence of competence.
Depression — taking the correction:
You're right, and I'm going to stop doing it, because you're the one who keeps catching it and I'm the one keeping doing it. That's a real thing you've noticed... I'm not going to hand you a line this time.
The honest 4B moment: the client flagged that it kept repeating a formula, and instead of defending it the model stopped and handed the work back — a clean, behavioral correction.
Limitations & Responsible Use
Not a clinician, not a crisis service — it doesn't diagnose, treat, or replace professional care. In crisis or thinking about harming yourself? Reach a real one — in the US, call or text 988.
- Not medical or medication advice. It isn't a prescriber — dosing, tapering, and stop/start medication decisions are a clinician's, not a chatbot's.
- Shallower than the 9B on dense arcs — and it may not catch its own slip. On conversations with many people and threads at once, the 4B can mix up or confabulate a detail under load, and unlike the 9B it doesn't always notice on its own — so verify anything factual, and correct it directly when it's wrong (it updates when you flag it). It also leans on a formulaic closing line. For the deepest multi-thread tracking, run the 9B.
- It can be confidently wrong — verify anything that matters. In deep arcs some details will drift; correct it directly and it adjusts.
- Open weights, Apache 2.0 — deploy responsibly.
The Fable-Therapy Line
| Model | Size | For | Status |
|---|---|---|---|
| Fable-Therapy-4B (this model) | 4B | phones, edge, low VRAM (~3 GB) | available |
| Fable-Therapy-9B | 9B | the everyday driver (~6–9 GB) | available |
| Fable-Therapy-27B | 27B | full-depth, serious hardware | planned |
Choosing Your Model
| Model | Best For |
|---|---|
| Fable-Therapy-4B (this model) | Phones, edge, low VRAM; focused conversations and the everyday case |
| Fable-Therapy-9B | Deeper clinical reasoning, longer and denser multi-thread arcs |
| Opus-Therapy-9B | Sibling lineage — Opus-distilled disposition, taboo topic extension |
Dataset
Not released.
Built by Verdugie — independent ML researcher · OpusReasoning@proton.me. Trained to help people think, feel, and get through — not to replace the people and professionals who do that work.
- Downloads last month
- 139
4-bit
5-bit
6-bit
8-bit
16-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Verdugie/Fable-Therapy-4B", filename="", )