YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Anima 2B with Qwen 3.5 4B

Table of Contents

  1. The Problem
  2. Understanding the Architecture
  3. The Scaling Problem: 4B vs 0.6B
  4. Discovery: The ExpRMSNorm Breakthrough
  5. Procrustes Alignment β€” Rotating One Brain to Match Another
  6. Per-Dimension Affine Calibration
  7. Recommended Settings for Users
  8. The Mamba2 SSM Rewrite
  9. Tokenizer: Why Qwen3 β‰  Qwen3.5
  10. Timeline & Iteration History

The Problem

Anima 2B ships with a Qwen 3 0.6B text encoder β€” a small, standard transformer. The model works fine, but 0.6B parameters is a significant bottleneck for understanding complex prompts. nightknocker released a Qwen 3.5 4B hybrid encoder trained for the same ecosystem, promising better text comprehension.

The catch: you can't just swap one text encoder for another. The Anima diffusion model's LLM adapter was trained against the 0.6B's specific embedding distribution. Even though both encoders output 1024-dimensional vectors, they speak completely different "languages" β€” different magnitude scales, different directions for the same concepts, different statistical distributions.

Our initial naive implementation loaded correctly (all 426/426 weight tensors, 4.14B parameters, no errors), produced valid embeddings with no NaN/Inf... and generated images that were consistently worse than the tiny 0.6B.

This document explains every problem we encountered and how we solved each one.


Understanding the Architecture

Qwen 3.5 4B is not a standard transformer. It's a hybrid model alternating between two fundamentally different sequence processing mechanisms:

Property Value
Total layers 32
SSM (Mamba2) layers 24 (positions 0,1,2, 4,5,6, ..., 28,29,30)
Self-Attention layers 8 (positions 3, 7, 11, 15, 19, 23, 27, 31)
Hidden size 2560
Output dimension 1024 (after projection)
Vocabulary 248,320 tokens
Weight format FP8 (F8_E4M3) with BF16 norms

The pattern is simple: every 4th layer is self-attention, the other three are SSM blocks. The final layer (31) is attention-only with no MLP. This hybrid design gives the model the long-range memory of state space models with periodic full-attention "checkpoints" for global context.

Output pipeline: The raw 2560-dim hidden states go through a learned projection:

Linear(2560 β†’ 1024) β†’ ExpRMSNorm(1024) β†’ SiLU β†’ Linear(1024 β†’ 1024)

This maps the model's internal representation into the 1024-dim space that the Anima adapter expects.


The Scaling Problem: 4B vs 0.6B

Here's what the raw output distributions look like side by side:

Metric 0.6B (Original) 4B (Raw) Ratio
Global mean -0.068 0.0015 ~45Γ— difference
Global std 3.36 0.33 ~10Γ— smaller
L2 norm / token 106.6 10.5 ~10Γ— smaller

The 4B encoder's outputs are roughly 10Γ— smaller in magnitude than what the Anima adapter expects. Imagine whispering instructions to someone who's used to being shouted at β€” the signal is there, but it's far too quiet to drive the diffusion process effectively.

This isn't a bug β€” it's a consequence of two models with different architectures, different training procedures, and different internal normalizations producing embeddings at different scales. The 0.6B was the encoder that the adapter was trained against, so its scale IS the expected scale.


Discovery: The ExpRMSNorm Breakthrough

Before we could even think about alignment, we had to fix a fundamental error in how we interpreted the model's normalization layer.

The Mystery of the Near-Zero Weights

All 64 internal RMSNorm layers in the model have learned weights with sensible values β€” centered between 0.04 and 1.11. These are normal scaling factors: the model learns to emphasize some dimensions and suppress others.

But the late normalization layer (the one in the output projection) had weights centered around -0.003. Nearly zero.

With standard RMSNorm, those weights multiply the normalized output directly:

output = weight * (x / RMS(x))

If weight β‰ˆ -0.003, you're scaling everything down to essentially nothing. And that's exactly what happened:

Metric Standard RMSNorm (broken) ExpRMSNorm (fixed)
Output std 0.018 0.324 (18Γ— larger)
L2 / token 0.58 10.37 (18Γ— larger)
Token diversity 0.003 0.821 (274Γ— larger!)
Cross-prompt similarity 0.999 (everything identical) 0.689 (distinguishable)

Token diversity of 0.003 means every single token in every single prompt was being mapped to essentially the same vector. The model's understanding was being completely destroyed at the output gate.

The Fix: exp(weight) Parameterization

The late norm uses exponential weight parameterization:

output = exp(weight) * (x / RMS(x))

With weight β‰ˆ -0.003:

  • Standard: scale = -0.003 β†’ collapses everything
  • Exponential: scale = exp(-0.003) β‰ˆ 0.997 β†’ near-identity, with tiny learned perturbations

This is the difference between "scale to zero" and "scale to approximately one with fine-grained adjustments." The only reason the late norm's weights are near-zero is because it uses this parameterization β€” exp(0) = 1 is the neutral point.

This single fix took token diversity from 0.003 to 0.821 β€” from "completely collapsed" to "rich, distinguishable representations."


Procrustes Alignment β€” Rotating One Brain to Match Another

Even after fixing the ExpRMSNorm, the 4B generates images that don't follow the prompt well. Why? Because the 4B and 0.6B encode the same concepts in different directions.

Think of it this way: both models understand what "from the side" means, but the 0.6B might encode that as a vector pointing northeast in embedding space, while the 4B encodes it as a vector pointing southwest. The adapter was trained to interpret northeast as a side view β€” so when it sees southwest, it does something completely wrong.

What Is Procrustes Alignment?

Procrustes alignment finds the optimal rotation matrix R that maps one embedding space onto another:

Rβˆ—=arg⁑min⁑Rβˆ₯Rβ‹…X4Bβˆ’X0.6Bβˆ₯Fsubject toRTR=IR^* = \arg\min_{R} \| R \cdot X_{4B} - X_{0.6B} \|_F \quad \text{subject to} \quad R^T R = I

The constraint $R^T R = I$ means R is orthogonal β€” it's a pure rotation/reflection. No stretching, no squishing. Every distance between embeddings in the 4B's space is perfectly preserved. We're just reorienting the compass.

How We Computed It

We ran 41,277 prompts through both encoders and collected their mean-pooled 1024-dim embeddings. Then we applied Orthogonal Procrustes (via SVD of the cross-covariance matrix) to find the best rotation.

The results:

Before Alignment After Alignment
Mean cosine similarity -0.034 0.960
Minimum cosine similarity -0.115 0.766

Before alignment, the two encoders had negative average cosine similarity β€” their concept directions were essentially uncorrelated. After: 0.96 average agreement.

Per-category breakdown:

Category Before β†’ After
Spatial (viewpoints, poses) -0.034 β†’ 0.960
Pose -0.021 β†’ 0.943
Composition -0.028 β†’ 0.956
Character -0.028 β†’ 0.943
Environment -0.025 β†’ 0.954
Meta (quality tags) -0.034 β†’ 0.838
Multi (complex prompts) -0.027 β†’ 0.898

Rotation vs. Bias Shift

The alignment has two components:

  1. Rotation β€” The 1024Γ—1024 orthogonal matrix R that reorients concept directions. This is always applied when alignment is enabled. It fixes what direction concepts point in, without changing magnitude.

  2. Bias shift β€” Re-centering from the 4B's mean embedding to the 0.6B's mean embedding. The 0.6B's mean has L2β‰ˆ70 while the 4B's mean has L2β‰ˆ5, so the full shift dramatically changes output magnitude. This is controlled by the alignment_strength slider.

The alignment_strength parameter (0.0–1.0) only controls the bias shift, not the rotation:

x_aligned = R @ (x - mean_4b) + (1 - Ξ±) * mean_4b + Ξ± * mean_06b
  • Ξ± = 0.0: Rotate only, keep 4B's own magnitude
  • Ξ± = 0.5: Rotate + halfway bias shift (recommended starting point)
  • Ξ± = 1.0: Rotate + full shift to 0.6B's distribution center

Per-Dimension Affine Calibration

Beyond rotation, the two encoders also differ in their per-dimension scales. Dimension 42 in the 0.6B might have 3Γ— the variance of dimension 42 in the 4B, while dimension 500 might be 0.5Γ—.

The calibration computes a per-dimension affine transform:

output_calibrated[d] = scale[d] * output_4b[d] + bias[d]

Where:

scale[d] = std_06b[d] / std_4b[d]
bias[d]  = mean_06b[d] - scale[d] * mean_4b[d]

Calibration statistics (from 30 diverse prompts):

Value
Scale range 1.03 – 79.7
Scale mean 5.47
Bias mean -0.075

Most dimensions need ~5Γ— scaling. Some need up to 80Γ—. This makes sense given the 10Γ— overall magnitude difference β€” it's not uniform across dimensions.

Note: Calibration and alignment serve different purposes. Alignment fixes directions (rotation). Calibration fixes magnitudes (per-dimension scaling). They can be used independently or together.


Recommended Settings for Users

Start Simple, Add Complexity

Step 1: Baseline (no alignment, no calibration)

use_alignment: OFF
use_calibration: OFF
output_scale: 1.0

Generate some images with your usual prompts. This gives you the raw 4B output β€” better text understanding, but the adapter may misinterpret concept directions.

Step 2: Add alignment at half strength

use_alignment: ON
alignment_strength: 0.5
use_calibration: OFF
output_scale: 1.0

This rotates the 4B's concept space to match the 0.6B (fixing spatial/pose understanding) while blending the magnitude halfway between the two encoders. Compare your results β€” you should see better prompt adherence, especially for viewpoints, poses, and spatial composition.

Step 3: Experiment with strength

  • If poses/viewpoints still aren't quite right, increase alignment_strength toward 1.0
  • If the image quality or detail seems to degrade at high strength, back off toward 0.3
  • The sweet spot varies by prompt type β€” 0.5 is a good general default

Step 4 (Optional): Try calibration

use_calibration: ON

This applies per-dimension scaling on top of alignment. It can help in some cases but may also over-correct. Test both ways and compare.

Quick Reference

Setting What It Does When to Use
alignment OFF Raw 4B embeddings Baseline comparison
alignment ON, strength 0.0 Rotation only, 4B magnitude Fix concept directions without changing scale
alignment ON, strength 0.5 Rotation + half bias shift Best general starting point
alignment ON, strength 1.0 Full 0.6B-like distribution Maximum compatibility with adapter
calibration ON Per-dimension affine scaling Fine-grained magnitude matching
output_scale Uniform multiplier Last-resort manual adjustment

The Mamba2 SSM Rewrite

Qwen 3.5 4B is not a standard transformer you can load with a config swap β€” 24 of its 32 layers are Mamba2 selective state space blocks, an architecture with no off-the-shelf ComfyUI support. We had to implement the full SSM from scratch.

The approach was to work directly from the reference Mamba2 implementation, mapping every projection, convolution, and recurrence step to the weight shapes we found in the checkpoint. The initial implementation ran without errors but produced garbage embeddings β€” every tensor shape was valid, no NaN/Inf, just wrong math.

The rewrite came down to carefully matching the reference's data flow: which projections go through the causal conv1d and which bypass it as a gate, the full multi-dimensional state recurrence (not a scalar approximation), input-dependent discretization that makes the SSM selective, and the skip connections that the architecture relies on. Several hundred million parameters that were being loaded but never actually used in the forward pass are now contributing.

The key insight was that SSM bugs are silent β€” the shapes all work out, gradients would flow if you were training, and the output looks like plausible floating point numbers. The only way to catch them was methodical comparison against the reference code, projection by projection.


Tokenizer: Why Qwen3 β‰  Qwen3.5

This was an easy mistake to make β€” and a critical one to fix.

Qwen 3 (0.6B) Qwen 3.5 (4B)
Vocabulary size 151,936 248,320
Extra tokens β€” +96,384 (3 blocks of 32,128)

The extra 96,384 tokens in Qwen 3.5 correspond exactly to T5's vocabulary size (32,128 Γ— 3), suggesting the model was designed to bridge between Qwen and T5 embedding spaces.

Using the Qwen 3 tokenizer with the 4B model means:

  • Different BPE merge rules produce different token boundaries
  • Every token ID potentially maps to the wrong embedding row
  • 96,384 trained embedding rows are never accessed
  • The model receives garbled input it was never trained on

The node bundles the correct Qwen 3.5 tokenizer (248,320 tokens) and falls back to auto-downloading from Qwen/Qwen3.5-4B on HuggingFace if local files aren't found.


Timeline & Iteration History

v0.1.0 β€” Initial Release (2026-03-08)

  • Full custom implementation of the hybrid Mamba2/Attention architecture
  • Weight loading (426 tensors, 4.14B parameters)
  • ComfyUI CLIP-compatible
  • Result: Images generated but consistently worse than 0.6B

v0.2.0 β€” Mamba2 SSM Rewrite (2026-03-09)

  • Fixed 5 critical bugs in the SSM block (conv split, gate, d_state, dt, D skip)
  • ~240M previously-ignored parameters now contributing
  • Result: Better internal representations, still misaligned with adapter

v0.3.0 β€” ExpRMSNorm Discovery (2026-03-09)

  • Discovered the late norm uses exp(weight) parameterization
  • Token diversity went from 0.003 to 0.821 (274Γ— improvement)
  • Result: Meaningful, distinguishable embeddings for the first time

v0.4.0 β€” Alignment & Calibration (2026-03-09)

  • Procrustes alignment over 41K prompts (cosine similarity: -0.03 β†’ 0.96)
  • Per-dimension affine calibration from 30 diverse prompts
  • Correct Qwen 3.5 tokenizer (vocab=248,320)
  • Result: Substantially improved prompt adherence and image quality

This node is open source. Contributions, testing results, and alignment experiments are welcome.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support