Modify arcee fusion's "tukey fence" parameter

#4
by Naphula - opened

The default setting is 1.5 which merges to 12% salience. If you want 25% then decrease to 0.75. If you want 33% use 0.4, and 0.0 results in 50% salience.

arcee_fusion hardcodes this to 1.5 by default but you can simply edit the python script before merging (easier), or update the code to assign it as a yaml parameter (more complex).

Here is a scanner to audit Arcee_Fusion merge salience


In the context of merging models using the mergekit framework, the provided code showcases a Dynamic Threshold Fusion mechanism. Here's an overview of how modifying the Tukey fence parameter (from 1.5 to 0.75 or 3.0) would affect the model merging process.

Impact of Changing Tukey Fence Parameter

Current Implementation

In your code, the dynamic threshold is set as:

dynamic_threshold = median + 1.0 * iqr  # Tukey Fence

This uses 1.0 as the multiplier instead of 1.5, but let's assume it was meant to be 1.5 for generalizing the Tukey fence.

Changes to k Values

1. If k = 0.75

  • Increased Sensitivity: The dynamic threshold would be lower, leading to more values being included in the fusion mask.
  • More Features Merged: This results in a potentially more aggressive merging of parameters, effectively incorporating even slight differences between the model weights.
  • Use Cases: Ideal for exploratory processes or applications where capturing subtle variations is critical.

Implementation

You would change the line in calculate_dynamic_threshold:

dynamic_threshold = median + 0.75 * iqr  # Tukey Fence with k = 0.75

2. If k = 3.0

  • Decreased Sensitivity: The dynamic threshold would be much higher, allowing only significant differences to be considered.
  • Fewer Features Merged: This could lead to a conservative merging approach, where only clear outliers are merged, possibly overlooking important but smaller differences.
  • Use Cases: Useful in high-stakes environments where only distinctly different parameters should contribute to the final model.

Implementation

You would change the same line to:

dynamic_threshold = median + 3.0 * iqr  # Tukey Fence with k = 3.0

Summary of Effects on Fusion Process

  • k = 0.75: This results in a more inclusive fusion mask. More difference is merged, potentially capturing more nuanced information.
  • k = 3.0: This leads to a restrictive fusion mask. Smaller differences would be ignored, possibly resulting in a model that retains more characteristics of the base model if the other model has less deviation.

Final Thought

The choice of k can significantly dictate the behavior and performance of the merged model. Deciding whether to accommodate subtle differences or enforce strict conditions on merging will greatly depend on the specific use case and the importance of the characteristics being learned from either model. If you'd like to delve deeper into any specific aspect or need further assistance, just let me know!


The reason you are seeing a consistent ~12% is not because of a hardcoded number, but because of a specific statistical formula used to identify "outliers" in the weight distribution.

In arcee_fusion.py, the density is controlled by the Tukey's Fence method. Here are the specific lines of code and the math behind them:

1. The Threshold Calculation

In arcee_fusion.py, inside the DynamicThresholdFusion class:

# Line 41-42: Calculate the Interquartile Range (IQR)
iqr = q3 - q1

# Line 45: The "Magic" Formula
dynamic_threshold = median + 1.5 * iqr

2. The Technical Analysis of "12%"

The value 1.5 is the standard multiplier for Tukey's Fences, a statistical method used to find "outliers."

  • The Logic: Arcee Fusion assumes that most weight changes in a fine-tune are "noise" or "jitter." It only wants to keep the "salient" changes—the ones that are statistically significant.
  • The Distribution: Importance scores in neural networks usually follow a Laplace or Heavy-Tailed Gaussian distribution.
  • The Result: In these types of distributions, the data points that fall above Median + 1.5 * IQR (the "upper fence") typically represent the top 10% to 15% of the values.

Your result of 12.1% means that in every layer, roughly 12% of the weights were "important enough" to cross that statistical fence. The other 88% were considered "noise" and were reverted to the Mistral Base weights.

3. Why is it so consistent across layers?

The scanner shows almost no variation (11.9% to 12.2%). This reveals a technical truth about your fine-tunes: The "knowledge" is spread evenly.

If you were merging a model that only learned a specific task (like coding), you would see a "spike" in the middle layers (where logic is processed) and 0% in the early layers. Because your layers are all ~12%, it means the fine-tuned models you are using modified the entire "brain" of the model uniformly.

4. How to change this (The "Patch")

If you want a higher density (e.g., you want 25% or 50% "New Info"), you have to modify the multiplier in arcee_fusion.py.

  • To get MORE new info: Lower the 1.5 to 0.5.
  • To get LESS new info: Raise the 1.5 to 3.0 (this is the "Extreme Outlier" threshold).

Sign up or log in to comment