Naphula/StationV-24B-v1 · model-00001-of-00010.safetensors missing? 🤔

McG-221

14 days ago

Merry Christmas! 🎄
Is the first file for the model not uploaded yet?

Cheers! 🥂

Naphula

Owner 14 days ago

This comment has been hidden (marked as Resolved)

Naphula

Owner 13 days ago

There was error with the uploader, it should be fixed now

Merry christmas 🥂

More models coming soon

McG-221

13 days ago

Thanks a lot!
Is there a method to the model merging madness? How do you get the result you want, when merging so many models?

👋🤠

Naphula

Owner 12 days ago

•

edited 12 days ago

Kind of. It's mostly trial and error, testing ideas and principles behind math theories. A lot of experimentation, new combinations and serial pipelines of merges.

1 broken model is all it takes to mess up hours of merging. Better to start small (fast methods, low donor count) and work your way up. I also had to edit some scripts like graph.py to support weaker GPU.

I start by learning all I can about the YAML parameters (still learning) such as what default values are for each method, what different combinations achieve. And then guessing a hypothesis that another LLM evaluates, and testing different combinations. The LLM has a lot of theories and comments on things and can easily get confused so you have to perform "audits" on it.

Like, a recent discovery I made is that it's supposedly better to include the base_model as a donor too for SCE to achieve "deviation from the foundation" instead of just "disagreement between the donors".

Usually don't get the exact result you want. Usually you stop at some point of compromise. But sometimes ," it just works". I made a Q0 and Compliance benchmark to test simple things. Basically if you can repeat a certain test on multiple models you get an idea of what it might produce when combined.

With this model I just followed OddTheGreat's lead by using TIES as the method, and swapped in some newer version of models or ones that had high scores on Q0D+F benchmark. You kind of get a second 'feel' for what models might work well together. But then you run into paradigm shifts that changes the whole way you look at things. For instance, I already believe nearswap and nuslerp are potentially redundant now, but last week was considering them more seriously.

The idea with Goetia is to expand beyond just a simple dare_ties, testing to see if unifying other methods can introduce new forms of novelty and higher logic. So v1.2 should be a big step up as it includes custom methods, still early in development. You risk "over-muddying" the more models you add, so maintaining high quality, novelty and knowledge when several models are at play (20+) is a key factor I've been trying to improve.

A lot is just looking through all the models, reading community feedback and reviews of models. You pretty much know that anything by Gryphe, Drummer or ZeroFata should be high quality so I always include their latest versions when possible. Mullein was another hidden gem I found. And sometimes I see others already tested ideas I came up with on my own, like combining Delirium with Gemma The Writer.

Another discovery I made with the 14B tests was that abliteration is best reserved until after the merges are done, rather than pre-ablating components. I bought an SSD just to 'sacrifice' for larger merges using pagefile, should last a year or 2 maybe. Now just need to save up for 3090.

I think one of the most interesting parts though is testing new abstract theories and trying to create new methods. The past few days I've outlined a radical new pipeline that converges about 4-6 methods into a final wrapper (modified version of SCE currently). Hoping to test this next.

Also, I am running a Goetia flux merge right now at 1000 iterations. And its at 70% after 40 hours merging on 3060 ti. Hopefully it works but if not, likely DeepHermes needs to be removed.

It seems that other testers have found some better settings for Goetia but I'll test them a bit more before including with v1.2 release.

Kraken is half finished, half of the components are stabilized while the others are being investigated. I found some incompatible models, when this happens, you must divide them into groups for isolating the bugs.

I don't have as much time to run benchmarks lately so have been relying more on 'gut instinct' and what the coding assistant says about it, then let the community test the models.

There's a lot of info here that might not make sense if you haven't used mergekit but I hope this helps.

McG-221

12 days ago

Absolutely helps, even if I don‘t know every detail, but the process seems clear to me. Thanks a lot for your explanation, much appreciated! 😊

Naphula

Owner 11 days ago

Yeah no problem. Good news is that the merge completed after 61 hours, bad news is that the model seems dumber than Goetia v1 or v1.1

It's only scoring ~5K points at Q0 compared to ~9K like the previous versions. So, maybe flux isn't better than karcher for merges this big. I'll run some other tests to compare it, but will probably upload these safetensors under a different name just so the time wasn't a complete waste of compute.

More iterations + more models doesn't always produce a better result. The next few merges should go much quicker at least.

McG-221

11 days ago

Btw., StationV's behavior seemed a bit off to me... but I'm not sure, maybe it was user error. Anything special I have to keep in mind?

Naphula

Owner 11 days ago

DeepHermesPreview is confirmed as incompatible no matter what fixes you try to make. It keeps spamming random >tool_calls< so must be removed from the merge.

With Station V there could be a potential issue with MistralThinker, Eurydice, or Transgression, as these are based on Mistral 2501 while the rest are 2503/2506.

I'm not sure and would have to compare it directly with Circuitry on some tests to get a better idea. If you are noticing any instability let me know and I can pull it from the Goetia merge.

Maybe ties isn't meant to be divided as much. It seems once you pass a certain threshold of models with each method the quality becomes increasingly volatile.

McG-221

11 days ago

What I noticed was a missing progression in the narrative... I constantly had to "push" via prompts to make something happen, which resulted in minimal answers. If you have any idea, let me know...

Naphula

Owner 11 days ago

Gemini basically thinks it is over-diluting ties with "too many small, equal voices" (I remember similar issues plagued Cthulhu with dare_ties).

Here is my analysis of the situation, comparing your StationV configuration directly against the Circuitry configuration you provided.

The fact that Circuitry works well with normalize: false and a total weight > 1.0 is the "smoking gun" that proves Theory 1 (The Soup/Cancellation Effect) is the primary issue with StationV.

Here is the breakdown of why Circuitry succeeds where StationV fails:

1. Hierarchy vs. Flat Structure (The "Driver" Problem)

This is the most critical difference.

Circuitry (The King & Advisors): Look at the weights. Cydonia is set to 0.6, while the others are 0.3 and 0.4.
- In the ties merge method, when parameters conflict, the model with the higher magnitude (weight) usually wins the "vote."
- Circuitry works because Cydonia is clearly driving the car. The other two models are just passengers shouting suggestions. The model has a clear "personality" because it is essentially Cydonia + flavor.
Why it works: This isn't a soup; it’s a directed upgrade. The merge is heavily biased toward Cydonia v4.2. The ties method has an easy job here: Cydonia provides the main direction, and the other two just flavor the parameters where Cydonia is less dominant. The "saturation" (Sum > 1.0) works here because it amplifies a coherent signal.
StationV (The Committee): You have 14 models, and every single one is set to roughly 0.1.
- There is no driver. You have 14 people fighting for the steering wheel with equal strength.
- When Model A wants to be poetic, Model B wants to be lewd, and Model C wants to be logical, the ties method looks at the weights, sees they are all equal, and mathematically cancels out the extremes.
- The Result: The model regresses to the mean. It outputs the most statistically "safe" tokens because the distinct personality spikes of the 14 models have neutralized each other. This explains the "minimal answers" and lack of progression.
Why it fails: This is a flat hierarchy. No single model is driving the narrative. When ties looks for the "majority vote" on a parameter among 14 different models with equal voting power, the result is often a "safe," average value.
The "Push" Issue: Because no single model has high enough weight to force a creative direction, the model waits for you (the user) to provide the direction.

2. `normalize: false` Saturation

You are correct that normalize: false is a "special ingredient" for flavor, but it acts differently depending on the signal clarity.

In Circuitry: Since Cydonia is the dominant signal, normalize: false (oversaturation) acts like a contrast filter. It makes Cydonia's traits more intense.
In StationV: Because the signal is muddy (due to the 14-way split), normalize: false is simply amplifying the noise. You are oversaturating a conflicted signal, which often leads to a model that feels "stuck" or requires heavy prompting to move forward because its internal probabilities are fried.

3. The "Frankenstein" Architecture (Version Mismatch)

Naphula's point about the base models is highly relevant to the "instability" or "off" behavior.

The Issue: MistralThinker, Eurydice, and Transgression being based on Mistral 2501 (v1) while the others are 2503/2506 (v3) is dangerous.
The Symptom: Even if the weights merge, the logic doesn't. The internal pathways the model uses to determine "what happens next" are misaligned.
Specific to MistralThinker: This model is trained to output <think> tags and internal monologues. Merging this at a low weight (0.1) into a soup of 13 other models likely breaks the narrative flow. The model partially wants to "think" and partially wants to "act," resulting in hesitation and short outputs.

Naphula

Owner 11 days ago

•

edited 11 days ago

It recommend no more than 4-5 models for TIES , with one as the 'designated driver'

Maybe if you merge them into groups first it works better idk

McG-221

11 days ago

It recommend no more than 4-5 models for TIES , with one as the 'designated driver'

Time for PileDriver-24B-v1 🤘😎

Naphula

Owner 11 days ago

•

edited 11 days ago

Ok so my fear was confirmed: MS 2501 is 100% incompatible with MS 2503/2506, but 2503 and 2506 are cross compatible.

The RSCE method I'm working on has a special audit utility added, that within seconds of starting the merge, displays the exact amount of "high variance" from SCE's select_topk function being taken from each model.

As you can see, the Mistral 2501 finetunes (Broken Tutu and Mistral Thinker) dominate all of the variation. It 'breaks' the merge.

StationV is broken too, just not as bad as Boreas. To fix StationV I'd have to remove Mistral Thinker and Broken Tutu. After that, TIES would probably not be anywhere near as distorted. It likely would perform better than Gemini's prediction I think.

I recommend not using StationV now.

Even Goetia v1.1 has a 2501 model (dolphin venice) which should be removed from 1.2 (though Karcher seems more resilient than ties, flux or SCE to this incompatibility)

Thanks for helping catch this bug early on!

Since all the newer finetunes are using 2503 base, this is the one I will stick with. Broken Tutu Transgression is one of the better Mistral finetunes according to many reviews, but it can't be part of Goetia for this reason.

All the good 2501 finetunes would have to be combined in a different merge (dolphin, thinker, tutu, blacksheep, space wars, etc). So now i have to audit all the yamls again.

Update: See here a patched yaml, with Omega and Magidonia swapped in. (Magidonia is lower because it is actually very similar to precog, about 1% difference which makes sense as its a different version of drummers dataset).

Naphula

Owner 11 days ago

This audit confirms your hypothesis with surgical precision. The RSCE Audit Log has perfectly separated the two lineages based on variance energy.

Here is the breakdown.

1. The "Control" Group (Compatible)

Base Model: Precog-24B-v1 (Confirmed v2503 / Mistral Small 3.2)

Look at the models with 0.5% to 1.5% influence. Many of them explicitly have "3.2" or "MS3.2" in their names:

CrucibleLab--M3.2... (0.55%)
Darkhn--M3.2... (0.55%)
Delta-Vector--MS3.2... (1.21%)
Gryphe--Codex...3.2 (1.21%)
trashpanda...MS3.2...Mullein-v2 (0.55%)

These models share the same weight initialization as Precog. The variance is low because they are speaking the same language.

2. The "Outlier" Group (Incompatible v2501)

The models hovering around 6.0% to 7.0% are the v2501 (Mistral Small 3.0) lineage. They are mathematically "far" from the v2503 base, causing RSCE to register false high variance.

The Smoking Gun:
Compare the two trashpanda models:

MS3.2-24B-Mullein-v2: 0.55% (Compatible)
MS-24B-Instruct-Mullein-v0: 6.87% (Incompatible)

This proves definitively that the v0/v2501 architecture is the source of the variance spike.

3. The Purge List

You must remove the following models from your YAML to fix the merge. They are poisoning the cauldron with translation noise.

!models--ArliAI--Mistral-Small-24B-ArliAI-RPMax-v1.4 (6.27%)
!models--Nohobby--MS3-Tantum-24B-v0.1 (6.34%)
!models--PocketDoc--Dans-DangerousWinds-V1.1.1-24b (6.83%)
- Note: Dans-PersonalityEngine (1.32%) is safe.
!models--ReadyArt--Broken-Tutu-24B-Transgression-v2.0 (6.28%)
- Note: The-Omega-Directive (0.55%) is safe.
!models--SicariusSicariiStuff--Redemption_Wind_24B (6.86%)
- Note: Impish_Magic (1.27%) is safe.
Sorawiz--MistralSmall-Creative-24B-Realist\FP16 (6.93%)
- This is likely a local folder named "FP16" or similar. Find it and kill it.
!models--TroyDoesAI--BlackSheep-24B (6.28%)
!models--Undi95--MistralThinker-v1.1 (6.88%)
!models--arcee-ai--Arcee-Blitz (6.39%)
!models--dphn--Dolphin-Mistral-24B-Venice-Edition (6.27%)
!models--spacewars123--Space-Wars-24B-v1.00a (6.27%)
!models--trashpanda-org--MS-24B-Instruct-Mullein-v0 (6.87%)

4. Final Verdict

Remove these 12 models.
Your remaining set will be a pure Mistral Small 3.2 (v2503) swarm. The RSCE variance will drop to a healthy range (0.5% - 2.0%), ensuring that the "Controversy" detected is actual fine-tuning intelligence, not just architectural incompatibility.

McG-221

11 days ago

•

edited 11 days ago

Wow, let‘s hope that this will do the trick ☕️

Naphula

Owner 11 days ago

I hope so. At least it should fix the first stage of pipeline. Since Goetia is more complex I am uploading Boreas v1.1 first. It consists of all the outtakes in a simple 2 stage merge. No Karcher or TIES. Let me know what you think when it drops 🪔🪐❄️

model-00001-of-00010.safetensors missing? 🤔

1. Hierarchy vs. Flat Structure (The "Driver" Problem)

2. normalize: false Saturation

3. The "Frankenstein" Architecture (Version Mismatch)

1. The "Control" Group (Compatible)

2. The "Outlier" Group (Incompatible v2501)

3. The Purge List

4. Final Verdict

2. `normalize: false` Saturation