Instructions to use Retreatcost/Evertide-RX-12B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Retreatcost/Evertide-RX-12B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Retreatcost/Evertide-RX-12B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Retreatcost/Evertide-RX-12B")
model = AutoModelForCausalLM.from_pretrained("Retreatcost/Evertide-RX-12B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Retreatcost/Evertide-RX-12B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Retreatcost/Evertide-RX-12B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Retreatcost/Evertide-RX-12B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Retreatcost/Evertide-RX-12B

SGLang

How to use Retreatcost/Evertide-RX-12B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Retreatcost/Evertide-RX-12B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Retreatcost/Evertide-RX-12B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Retreatcost/Evertide-RX-12B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Retreatcost/Evertide-RX-12B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Retreatcost/Evertide-RX-12B with Docker Model Runner:
```
docker model run hf.co/Retreatcost/Evertide-RX-12B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Evertide-RX-12B

A generalist model, with some reasoning capabilities and multi-lang support.

Supported languages:

English
French
German
Spanish
Italian
Portuguese
Russian
Chinese
Japanese

This model is trained in FFT based on unreleased cowriter model merge (uses same models as Retreatcost/KansenSakura-Erosion-RP-12b, credits to all original model authors.), using in-progress dateset, that I am creating for another project.

Training stats can be found in "Training metrics" tab.

Reasoning should work out of the box most of the times with occasional replies without it. For absolute consistency you can prefill model responses with "< think >\n" (think tag without spaces, line break is preferred).

Intended use

General conversations, chatting.
Co-writing, brainstorming.
Short roleplaying.

Inference Tips

Temperature: 0.7 (0.6 - 0.8 range should work fine)
Repetition Penalty: 1.05
TOP_P: 0.90
TOP_K: 0 (disable)
MIN_P: 0.025
Template Format: ChatML
Max Output: 2048 (Due to additional reasoning budget I suggest giving the model at least 768 tokens, preferrably over 1K, but usually it rarely outputs answers longer than 1.35K, 2K is a safe max).
Context Management: 8K

I haven't really tested or trained the model for long context, so it will probably break earlier than regular models. You can set a higher context, for example 16K, 24K or 32K, but I don't guarantee how it will behave.

Training details

Spoiler warning

I trained 2 variants of the model:

with unrolled turns (each turn in separate sample)
with regular turns (all turns in single sample)

Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (Evertide-LA-12B, Local attention). Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (Evertide-GA-12B, Global attention).

I also trained these with changed RoPE theta - 10K for GA, 10M for LA. My reasoning behind this is that during merging I "unrotate" the changes in config, effectively creating a distribution that I haven't trained in.

LA becomes shrinked to be even more specialized in short context, while GA gets stretched to cover longer context.

Then I merged these training runs using passthrough in a pattern 4:1, similar to how Gemma 4 models have layered SWA and GA.

The following YAML configuration was used to produce this model:

merge_method: passthrough
slices:
- sources:
  - model: Evertide-LA-12B
    layer_range: [0, 4]
- sources:
  - model: Evertide-GA-12B
    layer_range: [4, 5]
- sources:
  - model: Evertide-LA-12B
    layer_range: [5, 9]
- sources:
  - model: Evertide-GA-12B
    layer_range: [9, 10]
- sources:
  - model: Evertide-LA-12B
    layer_range: [10, 14]
- sources:
  - model: Evertide-GA-12B
    layer_range: [14, 15]
- sources:
  - model: Evertide-LA-12B
    layer_range: [15, 19]
- sources:
  - model: Evertide-GA-12B
    layer_range: [19, 20]
- sources:
  - model: Evertide-LA-12B
    layer_range: [20, 24]
- sources:
  - model: Evertide-GA-12B
    layer_range: [24, 25]
- sources:
  - model: Evertide-LA-12B
    layer_range: [25, 29]
- sources:
  - model: Evertide-GA-12B
    layer_range: [29, 30]
- sources:
  - model: Evertide-LA-12B
    layer_range: [30, 34]
- sources:
  - model: Evertide-GA-12B
    layer_range: [34, 35]
- sources:
  - model: Evertide-LA-12B
    layer_range: [35, 39]
- sources:
  - model: Evertide-GA-12B
    layer_range: [39, 40]
dtype: bfloat16

FAQ

Spoiler warning

Is this model better than X model?

Probably not.

Is it an NSFW model?

Not exactly. With some prompting it is definitely capable to output something, but it's not designed to be an ERP model in the first place. I would rate it 4/10 in this department, it's by design.

Is it an uncensored model?

The same as above, it will absolutely refuse some of your more unhinged prompts. You can try to abliterate it, tho.

Why isn't it NSFW/uncensored by default?

For this model achieving ERP capabilities wasn't the goal, so I'm happy with current state.

RP/ERP model when?

Soon™.

Did you train in RL?

No, not yet, but that's one of future plans.

Is the reasoning performative?

It's hard to tell exactly, it definitely has some elements of it, but it also was trainded with some specific constraints, that force causality between thinking blocks and answer. So I would say that it's at least a hybrid. Any further improvements require RL training.