This is an experimental model. While retaining its visual capabilities, we aim to align its text performance with that of pure text models.

Model Highlights:

  • merge method: ASDF

  • Highest precision: dtype: float32 + out_dtype: bfloat16

  • Context length: 262,144

Parameter Settings:

Temperature=0.7, TopP=0.8, TopK=20,MinP=0.

Why Can Models Be Merged:

  • The tensors for the text portion are exactly the same between the visual model and the pure text model.
  • In terms of tensor naming, the only difference between vision models and pure text models is the addition of ".language_model".
  • Therefore, by uniformly removing this part before merging, the text-related tensors of the two can be directly merged.

How Exactly Are Models Merged:

Input

Given two weight tensors from models with identical architecture (Text and Vision branches):
TtextRd1××dn,TvisionRd1××dn T^{\text{text}} \in \mathbb{R}^{d_1 \times \cdots \times d_n}, \quad T^{\text{vision}} \in \mathbb{R}^{d_1 \times \cdots \times d_n}
For each vision tensor key k_v, strip the "language_model." prefix to obtain the corresponding text model key for matching.


Step 1: Special Layer Filtering

Skip merging for embedding and language modeling head layers:

  • If tensor name contains "embed" or "lm_head", return T^vision directly.
  • Proceed only if both tensors have the same shape.

Step 2: Type Conversion and Delta Computation

Convert to float32 for numerical stability and compute the difference tensor:
Wtext=Ttext.float(),Wvision=Tvision.float() W^{\text{text}} = T^{\text{text}}.\text{float}(), \quad W^{\text{vision}} = T^{\text{vision}}.\text{float}()
Δ=WvisionWtext \Delta = W^{\text{vision}} - W^{\text{text}}


Step 3: Early Exit for Low-Rank Tensors

If Delta is a vector (i.e., rank < 2, such as bias or LayerNorm parameters), return T^vision directly.


Step 4: SVD Decomposition of Delta

Perform thin SVD on the difference tensor:
Δ=UΣV,URm×r, ΣRr×r, VRn×r \Delta = U \Sigma V^\top, \quad U \in \mathbb{R}^{m \times r},\ \Sigma \in \mathbb{R}^{r \times r},\ V \in \mathbb{R}^{n \times r}
where r = min(m, n), and Σ = diag(σ₁, …, σᵣ) with σ₁ ≥ ⋯ ≥ σᵣ ≥ 0.


Step 5: Automatic Rank Selection via Knee Point Detection

5.1 Normalize singular values and indices

Let s = (σ₁, …, σᵣ). Normalize to unit square:
xi=i1r1,yi=σiσrσ1σr+ε,i=1,,r x_i = \frac{i - 1}{r - 1}, \quad y_i = \frac{\sigma_i - \sigma_r}{\sigma_1 - \sigma_r + \varepsilon}, \quad i = 1,\dots,r

5.2 Compute perpendicular distance to line from first to last point

Line from (0, y_1) to (1, y_r) has direction vector (1, y_r - y_1).
For each point (x_i, y_i), compute normalized cross-product distance:
di=(xi)(yry1)(yiy1)(1) d_i = \left| (x_i)(y_r - y_1) - (y_i - y_1)(1) \right|

5.3 Select knee index

k=argmaxidi,k=max(1,k) k = \arg\max_i d_i, \quad k = \max(1, k)


Step 6: Low-Rank Reconstruction of Delta

Reconstruct Delta using top-k components:
Δclean=U[:,:k]diag(σ1,,σk)V[:k,:] \Delta_{\text{clean}} = U[:, :k] \cdot \operatorname{diag}(\sigma_1, \dots, \sigma_k) \cdot V^\top[:k, :]


Step 7: Fuse into Final Tensor

Add cleaned delta to text base:
Wmerged=Wtext+Δclean W^{\text{merged}} = W^{\text{text}} + \Delta_{\text{clean}}
Cast back to original dtype (e.g., bfloat16):
T^=Wmerged.to(Ttext.dtype) \hat{T} = W^{\text{merged}}.\text{to}(T^{\text{text}}.\text{dtype})

At the End:

  • This merging algorithm is based on the following assumption: the visual capability of the model is mainly concentrated in a few larger singular values within the residual terms.
  • It should be noted that we have not yet conducted a systematic evaluation of the model's visual capabilities, and this is only used here to demonstrate the feasibility of the merging technique.
  • At the same time, we call for further research into model merging methods that unify vision and text, in order to find truly suitable merging algorithms.
Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YOYO-AI/Qwen3-VL-4B-YOYO-Instruct

Merge model
this model
Quantizations
4 models