| --- |
| library_name: transformers |
| language: |
| - en |
| - fr |
| - de |
| - es |
| - it |
| - pt |
| - ru |
| - zh |
| - ja |
| tags: |
| - mergekit |
| - merge |
| - trl |
| - conversational |
| - finetune |
| - general-purpose |
| license: apache-2.0 |
| base_model: |
| - Retreatcost/KansenSakura-Erosion-CW-12b |
| --- |
| |
| # Evertide-RX-12B |
|
|
|  |
|
|
| A generalist model, with some reasoning capabilities and multi-lang support. |
|
|
| Supported languages: |
| - English |
| - French |
| - German |
| - Spanish |
| - Italian |
| - Portuguese |
| - Russian |
| - Chinese |
| - Japanese |
|
|
| This model is trained in FFT based on unreleased cowriter model merge (uses same models as [Retreatcost/KansenSakura-Erosion-RP-12b](https://huggingface.co/Retreatcost/KansenSakura-Erosion-RP-12b), credits to all original model authors.), using in-progress dateset, that I am creating for another project. |
|
|
| Training stats can be found in "Training metrics" tab. |
|
|
| Reasoning should work out of the box most of the times with occasional replies without it. |
| For absolute consistency you can prefill model responses with "< think >\n" (think tag without spaces, line break is preferred). |
|
|
| ## Intended use |
|
|
| - General conversations, chatting. |
| - Co-writing, brainstorming. |
| - Short roleplaying. |
|
|
| ## Inference Tips |
|
|
| 1. **Temperature**: 0.7 (0.6 - 0.8 range should work fine) |
| 2. **Repetition Penalty**: 1.05 |
| 3. **TOP_P**: 0.90 |
| 4. **TOP_K**: 0 (disable) |
| 5. **MIN_P**: 0.025 |
| 6. **Template Format**: ChatML |
| 7. **Max Output**: 2048 (Due to additional reasoning budget I suggest giving the model at least 768 tokens, preferrably over 1K, but usually it rarely outputs answers longer than 1.35K, 2K is a safe max). |
| 6. **Context Management**: 8K |
| |
| I haven't really tested or trained the model for long context, so it will probably break earlier than regular models. |
| You can set a higher context, for example 16K, 24K or 32K, but I don't guarantee how it will behave. |
| |
| ## Training details |
| |
| <details> |
| <summary>Spoiler warning</summary> |
| |
| I trained 2 variants of the model: |
| - with unrolled turns (each turn in separate sample) |
| - with regular turns (all turns in single sample) |
| |
| Unrolled turns teach local attention much better and train faster, but generalize worse for multi-turn (Evertide-LA-12B, Local attention). |
| Regular turns have much better multi-turn generalisation, but they tend to memorize instead of training new capabilities. (Evertide-GA-12B, Global attention). |
| |
| I also trained these with changed RoPE theta - 10K for GA, 10M for LA. |
| My reasoning behind this is that during merging I "unrotate" the changes in config, effectively creating a distribution that I haven't trained in. |
| |
| LA becomes shrinked to be even more specialized in short context, while GA gets stretched to cover longer context. |
| |
| Then I merged these training runs using passthrough in a pattern 4:1, similar to how Gemma 4 models have layered SWA and GA. |
| |
|  |
| |
| The following YAML configuration was used to produce this model: |
| |
| ```yaml |
| merge_method: passthrough |
| slices: |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [0, 4] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [4, 5] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [5, 9] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [9, 10] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [10, 14] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [14, 15] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [15, 19] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [19, 20] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [20, 24] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [24, 25] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [25, 29] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [29, 30] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [30, 34] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [34, 35] |
| - sources: |
| - model: Evertide-LA-12B |
| layer_range: [35, 39] |
| - sources: |
| - model: Evertide-GA-12B |
| layer_range: [39, 40] |
| dtype: bfloat16 |
| ``` |
| |
| </details> |
| |
| ## FAQ |
| |
| <details> |
| <summary>Spoiler warning</summary> |
| |
| ### Is this model better than X model? |
| Probably not. |
| |
| ### Is it an NSFW model? |
| Not exactly. With some prompting it is definitely capable to output something, but it's not designed to be an ERP model in the first place. I would rate it 4/10 in this department, it's by design. |
| |
| ### Is it an uncensored model? |
| The same as above, it will absolutely refuse some of your more unhinged prompts. You can try to abliterate it, tho. |
| |
| ### Why isn't it NSFW/uncensored by default? |
| For this model achieving ERP capabilities wasn't the goal, so I'm happy with current state. |
| |
| ### RP/ERP model when? |
| Soon™. |
| |
| ### Did you train in RL? |
| No, not yet, but that's one of future plans. |
| |
| ### Is the reasoning performative? |
| It's hard to tell exactly, it definitely has some elements of it, but it also was trainded with some specific constraints, that force causality between thinking blocks and answer. So I would say that it's at least a hybrid. Any further improvements require RL training. |
| |
| ### How much samples did you train on? |
| Only 451 sample, but they are all manually crafted and refined using [score-samples](https://github.com/Retreatcost/score-samples) script. |
| |
| </details> |
| |
| ## Special Thanks |
| - **[Team mradermacher](https://huggingface.co/mradermacher)**: for awesome quants |