Tower is bugged

#2
by Naphula - opened
Naphula-Archives org

The model is broken with both karcher and slerp. Other methods probably fail too.

Tower must be removed from the Psychosis merge and this Writer's Tower idea won't be possible for now.

Naphula-Archives org

Ok, the LLM says that karcher and slerp are "bad choices for merging finetunes" and recommends arcee_fusion instead, so Ill test that


Answer

Based on the outputs you're showing, you're experiencing two different types of issues with SLERP and KARCHER when merging finetunes of the same base model. Here's why this is happening and what you should do instead:

Why SLERP and KARCHER Are Failing

The Problem with KARCHER

The complete garbage output with HTML tags, special characters, and broken text from KARCHER strongly suggests tokenizer/embedding corruption. When KARCHER encounters embedding size mismatches between your two finetunes, it uses a function that takes only the common submatrix of the embeddings.

This truncation can severely corrupt the token embeddings, leading to the nonsensical output you're seeing. The Karcher mean algorithm then performs geometric averaging on these potentially incompatible embeddings.

The Problem with SLERP

SLERP is producing repetitive, run-on text because spherical interpolation between two finetunes (without properly accounting for their relationship to the base model) can create a degraded model. SLERP simply interpolates between the two models geometrically.

Both SLERP and KARCHER were not designed with finetune merging in mind—they perform geometric interpolation without considering that your models are deltas from a common base model.

Why You Should Use Different Methods

For merging finetunes of the same base model, you should use methods that compute task vectors (the difference between finetune and base model). Here are your best options:

1. arcee_fusion (Recommended for your case)

Yes, arcee_fusion is specifically designed for merging finetunes. It:

  • Computes importance scores using KL divergence between the finetune and base model
  • Uses dynamic thresholding to selectively merge only the most important parameter changes
  • Handles embedding issues more gracefully.

Configuration example:

merge_method: arcee_fusion
base_model: path/to/base-9B-model
models:
  - model: path/to/finetune-1
  - model: path/to/finetune-2

2. Task Arithmetic-Based Methods (Also good options)

These methods are designed to work with finetunes by computing task vectors. Good options include:

  • TIES: Reduces interference between models via sparsification and sign consensus
  • DARE-TIES: Uses randomized sparsification for better generalization
  • DELLA: Adaptive magnitude-based sparsification

Configuration example for TIES:

merge_method: ties
base_model: path/to/base-9B-model
models:
  - model: path/to/finetune-1
    parameters:
      weight: 0.5
  - model: path/to/finetune-2
    parameters:
      weight: 0.5
parameters:
  density: 0.5
  normalize: true

Notes

  • arcee_fusion is your best bet as it requires exactly 2 models (like your case) and is purpose-built for finetune merging
  • Make sure you specify the base_model parameter—this is crucial for any finetune merging
  • If your finetunes have different tokenizer vocabularies, consider using the tokenizer configuration options to handle vocabulary union properly
  • The SLERP/KARCHER failures are not about the models themselves but about using the wrong merge algorithm for finetunes
Naphula-Archives org

Analysis of Your Current Configuration

Your YAML has a critical issue that will prevent the output from being compatible with other Gemma 2 models in subsequent Karcher merges:

The Problem

You're using Tower-Plus-9B as the base_model and tokenizer source. This means:

  1. The arcee_fusion method will compute importance scores relative to Tower as the base 1
  2. Your output model will inherit Tower's vocabulary/tokenizer
  3. When you try to merge this output with other Gemma 2 models using Karcher, you'll hit the same embedding size mismatch issue

When embedding sizes don't match, the merge will use rectify_embed_sizes which truncates to the smallest common submatrix 2 , corrupting your embeddings again.

Recommended Solutions (In Order of Preference)

Solution 1: Fix Tower's Tokenizer FIRST (Best Approach)

Use mergekit-tokensurgeon to replace Tower's tokenizer with a standard Gemma 2 tokenizer before the merge:

mergekit-tokensurgeon \
  B:/LLM/.cache/huggingface/hub/!models--Unbabel--Tower-Plus-9B \
  B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed \
  B:/LLM/Tower-Plus-9B-fixed-tokenizer \
  -k 8 --barycentric

The tokensurgeon tool approximates embeddings for tokens that differ between the two tokenizers using k-nearest neighbors 3 . This creates a version of Tower with a standard Gemma 2 vocabulary.

Then update your YAML to use the fixed Tower:

base_model: B:/LLM/.cache/huggingface/hub/!models--google--gemma-2-9b  # Use actual Gemma 2 base
merge_method: arcee_fusion
dtype: float32
out_dtype: float32
models:
  - model: B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed
  - model: B:/LLM/Tower-Plus-9B-fixed-tokenizer  # Use the fixed version
tokenizer:
  source: base  # Use the base Gemma 2 tokenizer

Solution 2: Use Correct Base Model + Union Tokenizer with Padding

If you can't use tokensurgeon, update your YAML to use a proper Gemma 2 base model and create a standardized vocabulary:

base_model: B:/LLM/.cache/huggingface/hub/!models--google--gemma-2-9b  # Actual base model
merge_method: arcee_fusion
dtype: float32
out_dtype: float32
models:
  - model: B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed
  - model: B:/LLM/.cache/huggingface/hub/!models--Unbabel--Tower-Plus-9B
tokenizer:
  source: union  # Merge vocabularies
  pad_to_multiple_of: 256  # Ensure consistent vocab size

The union tokenizer source creates a combined vocabulary from all models 4 , and pad_to_multiple_of ensures the vocabulary size is padded to a multiple 5 . However, this may still create a vocabulary larger than other Gemma 2 models.

Solution 3: Use Standard Gemma 2 Tokenizer (Simplest but May Lose Tower Capabilities)

base_model: B:/LLM/.cache/huggingface/hub/!models--google--gemma-2-9b
merge_method: arcee_fusion
dtype: float32
out_dtype: float32
models:
  - model: B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed
  - model: B:/LLM/.cache/huggingface/hub/!models--Unbabel--Tower-Plus-9B
tokenizer:
  source: B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed

This uses Gemma-The-Writer's tokenizer (which should be standard Gemma 2). When Tower's embeddings don't match, they'll be truncated 6 , but at least your output will have a standard vocabulary compatible with other Gemma 2 models.

Solution 4: Use TIES/DARE Instead of arcee_fusion

If arcee_fusion doesn't work well, try TIES which is designed for handling interference between finetunes:

base_model: B:/LLM/.cache/huggingface/hub/!models--google--gemma-2-9b
merge_method: ties
dtype: float32
out_dtype: float32
models:
  - model: B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed
    parameters:
      weight: 0.5
  - model: B:/LLM/.cache/huggingface/hub/!models--Unbabel--Tower-Plus-9B
    parameters:
      weight: 0.5
parameters:
  density: 0.5
  normalize: true
tokenizer:
  source: B:/LLM/.cache/huggingface/hub/!models--DavidAU--Gemma-The-Writer-9B/fixed

Notes

  • Critical: You MUST use the actual Gemma 2 9B base model (not Tower) as base_model for task vector computation to work correctly. The arcee_fusion method expects the base model to be in the tensors dictionary 7

  • Solution 1 (tokensurgeon) is the cleanest approach because it fixes Tower's tokenizer compatibility without losing information

  • Warning: If Tower has significantly different vocabulary requirements (e.g., multilingual tokens), forcing it to use a standard Gemma 2 tokenizer may degrade its multilingual capabilities

  • After any of these merges, test the output by checking its vocabulary size matches other Gemma 2 models before attempting a Karcher merge

  • If Tower's embedding dimensions are also different (not just vocabulary size), you'll need to use the --allow-crimes flag with tokensurgeon 8 , though this may produce degraded results

Sign up or log in to comment