File size: 5,497 Bytes
c65eef5
5bebdd4
394b9a7
 
 
 
 
 
 
 
 
 
5bebdd4
 
 
394b9a7
 
 
 
 
d084209
 
c65eef5
5bebdd4
f7a1c70
5bebdd4
6ef23fa
 
dbe8e32
5bebdd4
5cc39f5
bd6348a
5cc39f5
 
 
 
 
 
 
 
 
bc67859
5cc39f5
 
 
828baf0
5cc39f5
 
 
 
 
 
3b0812d
5cc39f5
 
 
 
 
 
 
 
 
bcd6921
5cc39f5
 
 
 
5bebdd4
3b0812d
394b9a7
 
 
5bebdd4
5cc39f5
 
 
 
dbe8e32
 
5cc39f5
dbe8e32
 
 
 
 
 
 
 
5cc39f5
5bebdd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
394b9a7
100c18c
3b0812d
 
 
59322a6
 
 
3b0812d
 
 
 
 
 
 
 
 
 
 
 
 
59322a6
 
dbe8e32
 
 
 
 
 
31dd967
2c0373d
31dd967
59fb1a3
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
library_name: transformers
language:
- en
- fr
- de
- es
- it
- pt
- ru
- zh
- ja
tags:
- mergekit
- merge
- trl
- conversational
- finetune
- general-purpose
license: apache-2.0
base_model:
- Retreatcost/KansenSakura-Erosion-CW-12b
---

# Evertide-RX-12B

![evertide_rx](https://cdn-uploads.huggingface.co/production/uploads/6671dd5203d6e8087aaf7ce5/zTuxJU9fwrkFbCvkGW1qe.jpeg)

A generalist model, with some reasoning capabilities and multi-lang support.

Supported languages:
- English
- French
- German
- Spanish
- Italian
- Portuguese
- Russian
- Chinese
- Japanese

This model is trained in FFT based on unreleased cowriter model merge (uses same models as [Retreatcost/KansenSakura-Erosion-RP-12b](https://huggingface.co/Retreatcost/KansenSakura-Erosion-RP-12b), credits to all original model authors.), using in-progress dateset, that I am creating for another project.

Training stats can be found in "Training metrics" tab.

Reasoning should work out of the box most of the times with occasional replies without it.
For absolute consistency you can prefill model responses with "< think >\n" (think tag without spaces, line break is preferred).

## Intended use

- General conversations, chatting.
- Co-writing, brainstorming.
- Short roleplaying.

## Inference Tips

1. **Temperature**: 0.7 (0.6 - 0.8 range should work fine)
2. **Repetition Penalty**: 1.05
3. **TOP_P**: 0.90
4. **TOP_K**: 0 (disable)
5. **MIN_P**: 0.025
6. **Template Format**: ChatML
7. **Max Output**: 2048 (Due to additional reasoning budget I suggest giving the model at least 768 tokens, preferrably over 1K, but usually it rarely outputs answers longer than 1.35K, 2K is a safe max).
6. **Context Management**: 8K
  
I haven't really tested or trained the model for long context, so it will probably break earlier than regular models.
You can set a higher context, for example 16K, 24K or 32K, but I don't guarantee how it will behave.

## Training details

<details>
  <summary>Spoiler warning</summary>

I trained 2 variants of the model:
  - with unrolled turns (each turn in separate sample)
  - with regular turns (all turns in single sample)

Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (Evertide-LA-12B, Local attention).
Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (Evertide-GA-12B, Global attention).

I also trained these with changed RoPE theta - 10K for GA, 10M for LA.
My reasoning behind this is that during merging I "unrotate" the changes in config, effectively creating a distribution that I haven't trained in.

LA becomes shrinked to be even more specialized in short context, while GA gets stretched to cover longer context.

Then I merged these training runs using passthrough in a pattern 4:1, similar to how Gemma 4 models have layered SWA and GA.

![download](https://cdn-uploads.huggingface.co/production/uploads/6671dd5203d6e8087aaf7ce5/9M_XguM0q7Pv66X8Vy8t9.jpeg)

The following YAML configuration was used to produce this model:

```yaml
merge_method: passthrough
slices:
- sources:
  - model: Evertide-LA-12B
    layer_range: [0, 4]
- sources:
  - model: Evertide-GA-12B
    layer_range: [4, 5]
- sources:
  - model: Evertide-LA-12B
    layer_range: [5, 9]
- sources:
  - model: Evertide-GA-12B
    layer_range: [9, 10]
- sources:
  - model: Evertide-LA-12B
    layer_range: [10, 14]
- sources:
  - model: Evertide-GA-12B
    layer_range: [14, 15]
- sources:
  - model: Evertide-LA-12B
    layer_range: [15, 19]
- sources:
  - model: Evertide-GA-12B
    layer_range: [19, 20]
- sources:
  - model: Evertide-LA-12B
    layer_range: [20, 24]
- sources:
  - model: Evertide-GA-12B
    layer_range: [24, 25]
- sources:
  - model: Evertide-LA-12B
    layer_range: [25, 29]
- sources:
  - model: Evertide-GA-12B
    layer_range: [29, 30]
- sources:
  - model: Evertide-LA-12B
    layer_range: [30, 34]
- sources:
  - model: Evertide-GA-12B
    layer_range: [34, 35]
- sources:
  - model: Evertide-LA-12B
    layer_range: [35, 39]
- sources:
  - model: Evertide-GA-12B
    layer_range: [39, 40]
dtype: bfloat16
```

</details>

## FAQ

<details>
  <summary>Spoiler warning</summary>

### Is this model better than X model?
Probably not.

### Is it an NSFW model?
Not exactly. With some prompting it is definitely capable to output something, but it's not designed to be an ERP model in the first place. I would rate it 4/10 in this department, it's by design.

### Is it an uncensored model?
The same as above, it will absolutely refuse some of your more unhinged prompts. You can try to abliterate it, tho.

### Why isn't it NSFW/uncensored by default?
For this model achieving ERP capabilities wasn't the goal, so I'm happy with current state.

### RP/ERP model when?
Soon™.

### Did you train in RL?
No, not yet, but that's one of future plans.

### Is the reasoning performative?
It's hard to tell exactly, it definitely has some elements of it, but it also was trainded with some specific constraints, that force causality between thinking blocks and answer. So I would say that it's at least a hybrid. Any further improvements require RL training.

### How much samples did you train on?
Only 451 sample, but they are all manually crafted and refined using [score-samples](https://github.com/Retreatcost/score-samples) script.

</details>

## Special Thanks
- **[Team mradermacher](https://huggingface.co/mradermacher)**: for awesome quants