Instructions to use tacodevs/Behemoth-T1-123B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tacodevs/Behemoth-T1-123B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tacodevs/Behemoth-T1-123B-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tacodevs/Behemoth-T1-123B-FP8")
model = AutoModelForCausalLM.from_pretrained("tacodevs/Behemoth-T1-123B-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tacodevs/Behemoth-T1-123B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tacodevs/Behemoth-T1-123B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tacodevs/Behemoth-T1-123B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tacodevs/Behemoth-T1-123B-FP8

SGLang

How to use tacodevs/Behemoth-T1-123B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tacodevs/Behemoth-T1-123B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tacodevs/Behemoth-T1-123B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tacodevs/Behemoth-T1-123B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tacodevs/Behemoth-T1-123B-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tacodevs/Behemoth-T1-123B-FP8 with Docker Model Runner:
```
docker model run hf.co/tacodevs/Behemoth-T1-123B-FP8
```

🌴 Behemoth-T1-123B-FP8 🌴

The party where literary craft meets unhinged creative writing — production sweet spot.

☀️ The pitch

This is the FP8 W8A8 dynamic quantized version of tacodevs/Behemoth-T1-123B — a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.

FP8 is the production sweet spot: half the VRAM of BF16, ~99% of the quality (essentially lossless), runs cleanly on Hopper GPUs (H100, H200) with native FP8 acceleration. This is the variant most users should pick.

For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.

⚡ This variant

	Value
Base	tacodevs/Behemoth-T1-123B (BF16)
Quantization	FP8 W8A8 dynamic (8-bit weights, 8-bit activations)
Calibration	Data-free (per-tensor scale computed analytically)
Quantizer	llm-compressor QuantizationModifier
Size on disk	~115 GB (2× smaller than BF16)
VRAM (8k ctx)	~125 GB → fits on 2× 80 GB or 1× 144 GB GPU
Quality vs BF16	~99% (essentially lossless)
Speed	Faster than BF16 on H100/H200 (native FP8 tensor cores)

🎤 How to use

T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")

PREFILLS = {
    "analytical": "Ok i need to think about how to respond — what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer — what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author — raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B-FP8",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)

🚀 Serving with vLLM

vllm serve tacodevs/Behemoth-T1-123B-FP8 \
    --tokenizer-mode auto \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --kv-cache-dtype fp8

Important: use --tokenizer-mode auto, not mistral — mistral_common mode silently mis-templates merged-LoRA checkpoints.

Recommended hardware:

2× H100 80GB — comfortable, ~62GB per GPU
2× H200 144GB — luxury, plenty of headroom for 32k+ context
1× H100 NVL 188GB — single-card option

FP8 W8A8 uses native FP8 tensor cores on Hopper GPUs for both weights and activations, so this variant is also faster than BF16 in practice, not just smaller.

✅ Quality notes

FP8 W8A8 dynamic quantization is essentially lossless for 100B+ models:

✅ Stream-of-consciousness thinking shape — preserved
✅ Detail surfacing from character cards — preserved
✅ Word-for-word output similarity to BF16 — typically 95%+ token overlap on greedy sampling
✅ Beats base R1 in side-by-side — preserved
✅ Production reliability — same recipe Mistral officially uses for Mistral Large 3 FP8

If you need the absolute reference quality and have 4× 80 GB GPUs, use the BF16 reference. For most production use cases, FP8 is the right pick.

🛠️ Training details (from base T1)

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto tacodevs/Behemoth-X-R1-123B (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).


LoRA rank	32 (alpha 64, dropout 0.05, all 7 projection modules)
Trainable params	559M / 123B (0.45%)
Dataset	1000 Claude Opus 4.5 thinking traces on real RP conversations
Loss masking	Think-only (only the post-prefill thinking continuation gets loss)
Sequence length	4096
Epochs	2
Final eval loss	0.9898

The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates — the underlying creative writing voice is structurally preserved.

📜 Citation

@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}

The party doesn't end. We just go to bed.

Downloads last month: 6

Safetensors

Model size

123B params

Tensor type

BF16

F8_E4M3

Model tree for tacodevs/Behemoth-T1-123B-FP8

Base model

mistralai/Mistral-Large-Instruct-2411

Finetuned

TheDrummer/Behemoth-R1-123B-v2

Quantized

(15)

this model