π΄ Behemoth-T1-123B-GPTQ π΄
The party where literary craft meets unhinged creative writing β now in 4-bit.
βοΈ The pitch
This is the W4A16 GPTQ-quantized version of tacodevs/Behemoth-T1-123B β a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.
GPTQ is for single-GPU users. The full model fits on a single 80 GB or 96 GB GPU. Quality is ~95-97% of the BF16 reference β virtually indistinguishable from full precision in normal use.
For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.
β‘ This variant
| Value | |
|---|---|
| Base | tacodevs/Behemoth-T1-123B (BF16) |
| Quantization | GPTQ W4A16 (4-bit weights, 16-bit activations) |
| Group size | 128 |
| Calibration | 256 in-distribution samples from tacodevs/rp-opus-4.6-x1000 |
| Quantizer | llm-compressor GPTQModifier |
| Size on disk | ~62 GB (4Γ smaller than BF16) |
| VRAM (8k ctx) | ~62 GB β fits on 1Γ 80 GB or 1Γ 96 GB GPU |
| Quality vs BF16 | ~95-97% (literary thinking pattern preserved) |
π€ How to use
T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")
PREFILLS = {
"analytical": "Ok i need to think about how to respond β what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
"creative": "Ok i need to think as a creative writer β what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
"unhinged": "Ok i need to think as an unhinged author β raw, explicit, intense, fully in character with no holding back, so",
}
response = client.chat.completions.create(
model="tacodevs/Behemoth-T1-123B-GPTQ",
messages=[
{"role": "system", "content": CHARACTER_CARD},
*conversation_history,
{"role": "user", "content": user_message},
{"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
],
extra_body={
"continue_final_message": True,
"add_generation_prompt": False,
},
temperature=0.6,
max_tokens=2048,
stop=["[INST]", "</s>"],
)
π Serving with vLLM
vllm serve tacodevs/Behemoth-T1-123B-GPTQ \
--tokenizer-mode auto \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
Important: use --tokenizer-mode auto, not mistral β mistral_common mode silently mis-templates merged-LoRA checkpoints.
Single 80 GB H100, single 80 GB A100, or single 96 GB H100 NVL all fit comfortably.
π‘ Quality notes
W4A16 quantization has measurable but small impact on the literary thinking pattern T1 was trained for:
- β Stream-of-consciousness thinking shape β preserved (encoded across many attention layers)
- β Detail surfacing from character cards β preserved
- β Beats base R1 in side-by-side β preserved (the gap is huge)
- π‘ Specific word choices β may differ token-by-token from BF16
- π‘ The "cleverest" inventive details β sometimes replaced with equivalent-quality alternatives
For most users, GPTQ T1 is indistinguishable from BF16 T1 in normal use. Only A/B testing with the same seed would expose the differences.
If you want maximum quality and have the VRAM, use the BF16 reference or FP8 W8A8 (~99% of BF16, fits on 2Γ80 GB).
π οΈ Training details (from base T1)
T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto
tacodevs/Behemoth-X-R1-123B
(itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).
| LoRA rank | 32 (alpha 64, dropout 0.05, all 7 projection modules) |
| Trainable params | 559M / 123B (0.45%) |
| Dataset | 1000 Claude Opus 4.5 thinking traces on real RP conversations |
| Loss masking | Think-only (only the post-prefill thinking continuation gets loss) |
| Sequence length | 4096 |
| Epochs | 2 |
| Final eval loss | 0.9898 |
The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates β the underlying creative writing voice is structurally preserved.
π Citation
@misc{behemoth-t1-2026,
title = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
author = {tacodevs},
year = {2026},
url = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}
The party doesn't end. We just go to bed.
- Downloads last month
- 35
Model tree for tacodevs/Behemoth-T1-123B-GPTQ
Base model
mistralai/Mistral-Large-Instruct-2411