Instructions to use TheDrummer/Behemoth-R1-123B-v2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use TheDrummer/Behemoth-R1-123B-v2-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="TheDrummer/Behemoth-R1-123B-v2-GGUF", filename="Behemoth-R1-123B-v2d-Q4_K_M-00001-of-00002.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use TheDrummer/Behemoth-R1-123B-v2-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
Use Docker
docker model run hf.co/TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use TheDrummer/Behemoth-R1-123B-v2-GGUF with Ollama:
ollama run hf.co/TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
- Unsloth Studio new
How to use TheDrummer/Behemoth-R1-123B-v2-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TheDrummer/Behemoth-R1-123B-v2-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for TheDrummer/Behemoth-R1-123B-v2-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for TheDrummer/Behemoth-R1-123B-v2-GGUF to start chatting
- Docker Model Runner
How to use TheDrummer/Behemoth-R1-123B-v2-GGUF with Docker Model Runner:
docker model run hf.co/TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
- Lemonade
How to use TheDrummer/Behemoth-R1-123B-v2-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Behemoth-R1-123B-v2-GGUF-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M# Run inference directly in the terminal:
llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_MUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M# Run inference directly in the terminal:
./llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_MBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M# Run inference directly in the terminal:
./build/bin/llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_MUse Docker
docker model run hf.co/TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_MJoin our Discord! https://discord.gg/BeaverAI
Nearly 7000 members strong πͺ A hub for users and makers alike!
Drummer is open for work / employment (I'm a Software Engineer). Contact me through any of these channels: https://linktr.ee/thelocaldrummer
Thank you to everyone who subscribed through Patreon. Your suppprt helps me chug along in this brave new world.
Drummer proudly presents...
Behemoth R1 123B v2 π¦£
Usage
- Mistral v7 (Non-Tekken) + (i.e., Mistral v3 +
[SYSTEM_PROMPT]) - Warning: Using the wrong version / whitespacing may deteriorate performance.
- Prefill
<think>to ensure reasoning (and test your patience). - You can slightly steer the thinking by prefixing the think tag (e.g.,
<immoral_think>). - Works great even without reasoning.
Rationale for Reasoning
Hear me out for a second. I know it's crazy to have a 123B dense model spend precious output tokens to reason for some time, but if you're a fan of Largestral, then consider the following below...
Sometimes, you'd want to leave the character responses untouched. Reasoning divides the AI response into two phases: planning & execution. It gives you the opportunity to 'modify' the planning phase without messing with the character's execution.
The planning phase will also pick apart the scenario, break down nuances, and surface implicit story elements. If it's erroneous, then you have a chance to correct the AI before the execution phase. If it's missing details, then you can wrangle it during the planning phase and watch it unfold in the execution phase.
Nutshell: Reasoning adds another useful dimension for these creative uses.
Description
As far as I see, this doesn't even feel like Behemoth. It's something way better. It's the top 3 you've ever made. This is a solid cook my man.
Characters in particular are portrayed so much better and more authentically, which was Largestral's biggest problem. Dialogue is much improved, and the smarts 2411 had have been retained quite well. Its prose has changed for the better without the overconfidence in base.
This is so much better than any other 2411 tune I've tried tbh. It's doing quite well on adherence.
After a few messages, the model gets pretty smart. In fact, so smart that it tries to analyze why I want to do some particular RP. The model is getting better with a nasty prefill.
This model continues to surprise and impress me. It's really exactly what I wanted Largestral 2411 to be. I cannot overstate how much better it is than the base and any other tune of it. From what I remember, it actually feels as good as Nemotron Ultra..
Yes, super intelligent, and something about it makes characters have much more texture and personality than other models.
Links
- Original: https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2
- GGUF: https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2-GGUF
- iMatrix (recommended): https://huggingface.co/bartowski/TheDrummer_Behemoth-R1-123B-v2-GGUF
- EXL3: https://huggingface.co/ArtusDev/TheDrummer_Behemoth-R1-123B-v2-EXL3
config-v2d
- Downloads last month
- 84
4-bit
5-bit
Model tree for TheDrummer/Behemoth-R1-123B-v2-GGUF
Base model
mistralai/Mistral-Large-Instruct-2411

Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M# Run inference directly in the terminal: llama-cli -hf TheDrummer/Behemoth-R1-123B-v2-GGUF:Q4_K_M