YOYO-AI's picture
Update README.md
a189cd6 verified
metadata
license: apache-2.0
language:
  - en
  - zh
base_model:
  - Qwen/Qwen3-4B-Instruct-2507
  - Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-text-to-text
tags:
  - merge

This is an experimental model. While retaining its visual capabilities, we aim to align its text performance with that of pure text models.

Model Highlights:

  • merge method: ASDF

  • Highest precision: dtype: float32 + out_dtype: bfloat16

  • Context length: 262,144

Parameter Settings:

Temperature=0.7, TopP=0.8, TopK=20,MinP=0.

Why Can Models Be Merged:

  • The tensors for the text portion are exactly the same between the visual model and the pure text model.
  • In terms of tensor naming, the only difference between vision models and pure text models is the addition of ".language_model".
  • Therefore, by uniformly removing this part before merging, the text-related tensors of the two can be directly merged.

How Exactly Are Models Merged:

Input

Given two weight tensors from models with identical architecture (Text and Vision branches):
Ttext∈Rd1Γ—β‹―Γ—dn,Tvision∈Rd1Γ—β‹―Γ—dn T^{\text{text}} \in \mathbb{R}^{d_1 \times \cdots \times d_n}, \quad T^{\text{vision}} \in \mathbb{R}^{d_1 \times \cdots \times d_n}
For each vision tensor key k_v, strip the "language_model." prefix to obtain the corresponding text model key for matching.


Step 1: Special Layer Filtering

Skip merging for embedding and language modeling head layers:

  • If tensor name contains "embed" or "lm_head", return T^vision directly.
  • Proceed only if both tensors have the same shape.

Step 2: Type Conversion and Delta Computation

Convert to float32 for numerical stability and compute the difference tensor:
Wtext=Ttext.float(),Wvision=Tvision.float() W^{\text{text}} = T^{\text{text}}.\text{float}(), \quad W^{\text{vision}} = T^{\text{vision}}.\text{float}()
Ξ”=Wvisionβˆ’Wtext \Delta = W^{\text{vision}} - W^{\text{text}}


Step 3: Early Exit for Low-Rank Tensors

If Delta is a vector (i.e., rank < 2, such as bias or LayerNorm parameters), return T^vision directly.


Step 4: SVD Decomposition of Delta

Perform thin SVD on the difference tensor:
Ξ”=UΞ£V⊀,U∈RmΓ—r, Ξ£βˆˆRrΓ—r, V∈RnΓ—r \Delta = U \Sigma V^\top, \quad U \in \mathbb{R}^{m \times r},\ \Sigma \in \mathbb{R}^{r \times r},\ V \in \mathbb{R}^{n \times r}
where r = min(m, n), and Ξ£ = diag(σ₁, …, Οƒα΅£) with σ₁ β‰₯ β‹― β‰₯ Οƒα΅£ β‰₯ 0.


Step 5: Automatic Rank Selection via Knee Point Detection

5.1 Normalize singular values and indices

Let s = (σ₁, …, Οƒα΅£). Normalize to unit square:
xi=iβˆ’1rβˆ’1,yi=Οƒiβˆ’ΟƒrΟƒ1βˆ’Οƒr+Ξ΅,i=1,…,r x_i = \frac{i - 1}{r - 1}, \quad y_i = \frac{\sigma_i - \sigma_r}{\sigma_1 - \sigma_r + \varepsilon}, \quad i = 1,\dots,r

5.2 Compute perpendicular distance to line from first to last point

Line from (0, y_1) to (1, y_r) has direction vector (1, y_r - y_1).
For each point (x_i, y_i), compute normalized cross-product distance:
di=∣(xi)(yrβˆ’y1)βˆ’(yiβˆ’y1)(1)∣ d_i = \left| (x_i)(y_r - y_1) - (y_i - y_1)(1) \right|

5.3 Select knee index

k=arg⁑max⁑idi,k=max⁑(1,k) k = \arg\max_i d_i, \quad k = \max(1, k)


Step 6: Low-Rank Reconstruction of Delta

Reconstruct Delta using top-k components:
Ξ”clean=U[:,:k]β‹…diag⁑(Οƒ1,…,Οƒk)β‹…V⊀[:k,:] \Delta_{\text{clean}} = U[:, :k] \cdot \operatorname{diag}(\sigma_1, \dots, \sigma_k) \cdot V^\top[:k, :]


Step 7: Fuse into Final Tensor

Add cleaned delta to text base:
Wmerged=Wtext+Ξ”clean W^{\text{merged}} = W^{\text{text}} + \Delta_{\text{clean}}
Cast back to original dtype (e.g., bfloat16):
T^=Wmerged.to(Ttext.dtype) \hat{T} = W^{\text{merged}}.\text{to}(T^{\text{text}}.\text{dtype})

At the End:

  • This merging algorithm is based on the following assumption: the visual capability of the model is mainly concentrated in a few larger singular values within the residual terms.
  • It should be noted that we have not yet conducted a systematic evaluation of the model's visual capabilities, and this is only used here to demonstrate the feasibility of the merging technique.
  • At the same time, we call for further research into model merging methods that unify vision and text, in order to find truly suitable merging algorithms.