This is an experimental model. While retaining its visual capabilities, we aim to align its text performance with that of pure text models.
Model Highlights:
merge method:
ASDFHighest precision:
dtype: float32+out_dtype: bfloat16Context length:
262,144
Parameter Settings:
Temperature=0.7,TopP=0.8,TopK=20,MinP=0.
Why Can Models Be Merged:
- The tensors for the text portion are exactly the same between the visual model and the pure text model.
- In terms of tensor naming, the only difference between vision models and pure text models is the addition of ".language_model".
- Therefore, by uniformly removing this part before merging, the text-related tensors of the two can be directly merged.
How Exactly Are Models Merged:
Input
Given two weight tensors from models with identical architecture (Text and Vision branches):
For each vision tensor key k_v, strip the "language_model." prefix to obtain the corresponding text model key for matching.
Step 1: Special Layer Filtering
Skip merging for embedding and language modeling head layers:
- If tensor name contains
"embed"or"lm_head", return T^vision directly. - Proceed only if both tensors have the same shape.
Step 2: Type Conversion and Delta Computation
Convert to float32 for numerical stability and compute the difference tensor:
Step 3: Early Exit for Low-Rank Tensors
If Delta is a vector (i.e., rank < 2, such as bias or LayerNorm parameters), return T^vision directly.
Step 4: SVD Decomposition of Delta
Perform thin SVD on the difference tensor:
where r = min(m, n), and Σ = diag(σ₁, …, σᵣ) with σ₁ ≥ ⋯ ≥ σᵣ ≥ 0.
Step 5: Automatic Rank Selection via Knee Point Detection
5.1 Normalize singular values and indices
Let s = (σ₁, …, σᵣ). Normalize to unit square:
5.2 Compute perpendicular distance to line from first to last point
Line from (0, y_1) to (1, y_r) has direction vector (1, y_r - y_1).
For each point (x_i, y_i), compute normalized cross-product distance:
5.3 Select knee index
Step 6: Low-Rank Reconstruction of Delta
Reconstruct Delta using top-k components:
Step 7: Fuse into Final Tensor
Add cleaned delta to text base:
Cast back to original dtype (e.g., bfloat16):
At the End:
- This merging algorithm is based on the following assumption: the visual capability of the model is mainly concentrated in a few larger singular values within the residual terms.
- It should be noted that we have not yet conducted a systematic evaluation of the model's visual capabilities, and this is only used here to demonstrate the feasibility of the merging technique.
- At the same time, we call for further research into model merging methods that unify vision and text, in order to find truly suitable merging algorithms.
- Downloads last month
- 8