These did not have a seed # assigned so produces different safetensors if you try to remerge
Future versions with dare_ties are to use --random-seed 420
Update: The same logic also applies to della, you must assign a --random-seed # if you want the merge to be fully deterministic
The della merge method employs probabilistic dropout that requires a random seed for reproducibility. The DELLA method uses torch.bernoulli() to randomly sample which weights to keep based on magnitude-dependent probabilities.
Specifically, at the core of DELLA's della_magprune function, it calculates drop probabilities and then uses torch.bernoulli(probs) to randomly decide which parameters to drop.
Why You Get Consistent Results on Your PC
Each time you run the merge on your PC without setting a seed, PyTorch initializes its random number generator to a consistent but machine-specific state. This explains why:
- โ Multiple runs on your PC produce identical results
- โ Your results never match the original creator's results
- โ Results differ between PCs
The Solution: Use --random-seed
MergeKit provides a random_seed option specifically for this purpose.
When you set a random seed, it calls transformers.trainer_utils.set_seed() to ensure reproducible behavior across PyTorch, NumPy, and Python's random module.
To get reproducible merges, run:
mergekit-yaml config.yaml ./output --random-seed 420
Both you and the original creator must use the same seed value to get identical results.
While both use randomness, DELLA's documentation may not have emphasized this as clearly. The key difference is that DARE methods use SparsificationMethod.random (pure Bernoulli sampling), while DELLA uses SparsificationMethod.della_magprune (magnitude-weighted Bernoulli sampling) - but both are random and require seeds for reproducibility.
mergekit/sparsify.py
def della_magprune(
tensor: torch.Tensor,
density: float,
epsilon: float,
rescale_norm: Optional[RescaleNorm] = None,
) -> torch.Tensor:
if density >= 1:
return tensor
if density <= 0:
return torch.zeros_like(tensor)
orig_shape = tensor.shape
if density + epsilon >= 1 or density - epsilon <= 0:
raise ValueError(
"Epsilon must be chosen such that density +/- epsilon is in (0, 1)"
)
work_dtype = (
tensor.dtype
if tensor.device.type != "cpu" or tensor.dtype == torch.bfloat16
else torch.float32
)
if len(tensor.shape) < 2:
tensor = tensor.unsqueeze(0)
magnitudes = tensor.abs()
sorted_indices = torch.argsort(magnitudes, dim=1, descending=False)
ranks = sorted_indices.argsort(dim=1).to(work_dtype) + 1
min_ranks = ranks.min(dim=1, keepdim=True).values
max_ranks = ranks.max(dim=1, keepdim=True).values
rank_norm = ((ranks - min_ranks) / (max_ranks - min_ranks)).clamp(0, 1)
probs = (density - epsilon) + rank_norm * 2 * epsilon
mask = torch.bernoulli(probs).to(work_dtype)
res = rescaled_masked_tensor(tensor.to(work_dtype), mask, rescale_norm)
return res.to(tensor.dtype).reshape(orig_shape)