Vibe check 🫧

by McG-221 - opened Feb 3

Feb 3

Something unexpected here, can't quite put my finger on it... but it's different, isn't it? Did you put a secret ingredient in here? ✨

Naphula

Owner Feb 3

Yeah, I was surprised at how good Maginum Cydoms is, it surpassed my expectations but wasn't reconstructible due to --random-seed differences.

On a whim I tested putting all the components into Della, skipping TIES and SLERP. So this merge uses the same components as Maginum, but turned out differently and seems to be less censored overall.

I tried setting up mergekit-evolve to run with Della but in the end my PC was too slow so just ended up assigning equal numbers to everything.

It appears that DELLA creates more of an "emergent personality" than Karcher. Asmodeus v1 is even more different yet (16 models). It doesn't resort to putting everything in bullet point lists and actually writes paragraph format. And I just merged another one with 32 models to test next.

I think the secret ingredient is the fact that DELLA allows you to bridge 2501 and 2503 models, which previously, was not possible with SLERP/KARCHER. The guy who made Maginum discovered this. I wasn't expecting it to be functional, yet it is. The other trick seems to be using @OddTheGreat method of normalize=false and then assigning total weights to >1

All my future dare/della merges are now set to use seed #420 by default timeout /t 3 /nobreak && mergekit-yaml C:\mergekit-main\config.yaml C:\mergekit-main\merged_model_output --copy-tokenizer --allow-crimes --out-shard-size 5B --trust-remote-code --lazy-unpickle --random-seed 420 --cuda

architecture: MistralForCausalLM
models:
  - model: B:\24B\!models--anthracite-core--Mistral-Small-3.2-24B-Instruct-2506-Text-Only
  - model: B:\24B\!models--TheDrummer--Cydonia-24B-v4.3
    parameters:
      density: 0.75
      weight: 0.5
      epsilon: 0.25
  - model: B:\24B\!models--ReadyArt--4.2.0-Broken-Tutu-24b
    parameters:
      density: 0.75
      weight: 0.5
      epsilon: 0.25
  - model: B:\24B\!models--zerofata--MS3.2-PaintedFantasy-v2-24B
    parameters:
      density: 0.75
      weight: 0.5
      epsilon: 0.25   
  - model: B:\24B\!models--TheDrummer--Magidonia-24B-v4.3
    parameters:
      density: 0.75
      weight: 0.5
      epsilon: 0.25
  - model: B:\24B\!models--TheDrummer--Precog-24B-v1
    parameters:
      density: 0.75
      weight: 0.5
      epsilon: 0.25
  - model: B:\24B\!models--zerofata--MS3.2-PaintedFantasy-v3-24B
    parameters:
      density: 0.75
      weight: 0.5
      epsilon: 0.25
# Seed: 420 
merge_method: della
base_model: B:\24B\!models--anthracite-core--Mistral-Small-3.2-24B-Instruct-2506-Text-Only
parameters:
  lambda: 1.0
  normalize: false
  int8_mask: false
dtype: float32
out_dtype: bfloat16
tokenizer:
  source: union
chat_template: auto
name: 🐌 Ślimaki-24B-v1

Lastly I added a 'safety net' for the epsilon function so it can't break the merge.

def della_magprune(
    tensor: torch.Tensor,
    density: float,
    epsilon: float,
    rescale_norm: Optional[RescaleNorm] = None,
) -> torch.Tensor:
    if density >= 1:
        return tensor
    if density <= 0:
        return torch.zeros_like(tensor)
    
    # --- SAFETY GUARD START ---
    # Ensure density isn't exactly 0 or 1
    density = max(1e-4, min(1.0 - 1e-4, density))
    
    # Epsilon must be < density AND < (1 - density)
    # If the optimizer guessed a bad epsilon, we shrink it to the max allowed value
    max_epsilon = min(density, 1.0 - density) - 1e-4
    if abs(epsilon) > max_epsilon:
        epsilon = max_epsilon if epsilon > 0 else -max_epsilon
    # --- SAFETY GUARD END ---

    orig_shape = tensor.shape
    work_dtype = (
        tensor.dtype
        if tensor.device.type != "cpu" or tensor.dtype == torch.bfloat16
        else torch.float32
    )

    if len(tensor.shape) < 2:
        tensor = tensor.unsqueeze(0)
    magnitudes = tensor.abs()

    sorted_indices = torch.argsort(magnitudes, dim=1, descending=False)
    ranks = sorted_indices.argsort(dim=1).to(work_dtype) + 1

    min_ranks = ranks.min(dim=1, keepdim=True).values
    max_ranks = ranks.max(dim=1, keepdim=True).values
    rank_norm = ((ranks - min_ranks) / (max_ranks - min_ranks)).clamp(0, 1)
    
    # Now this line is guaranteed not to produce values < 0 or > 1
    probs = (density - epsilon) + rank_norm * 2 * epsilon
    mask = torch.bernoulli(probs.clamp(0, 1)).to(work_dtype)

    res = rescaled_masked_tensor(tensor.to(work_dtype), mask, rescale_norm)
    return res.to(tensor.dtype).reshape(orig_shape)

McG-221

Feb 3

Definitely the 420 seed. Has to be it.

Jokes aside, the model really shows that emergent personality, like you described. See my sampling parameters below, this is where it "clicked" for me:

Naphula

Owner Feb 3

Nice, I'll try those settings. Have you tested with top n sigma 1.25 or dyna temp? I was noticing big variations even with temp 0.

McG-221

Feb 3

I don't think LM Studio exposes those settings on MLX. But I'm always very interested in what the creator (in this case, you!) intended as the default settings... so, what do you use to bring out the character the most? 😉

Naphula

Owner Feb 3

I use either benchmark mode or creative mode. The creative mode settings I'm currently using (for mistral) are posted on Goetia v1.2 page. Benchmark mode is just temp 0, top p 0.95, and rep pen 1.12.

Standard benchmark mode allows me to see how much low temp variation there is and is more reliable for testing Q0/compliance.

McG-221

Feb 3

Will have to look at Goetia v1.2, then... like the little treasure hunter, that I am 🤠

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment