open-machine
/

SmolLM2-135M-FlashNorm

@@ -9,45 +9,39 @@ tags:
 pipeline_tag: text-generation
 ---
-# SmolLM2-135M-FlashNorm-strict
-**Weightless (strict-mode) FlashNorm checkpoint** of [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M).
-Mathematically equivalent to the source model. The per-channel normalization weight tensors (`input_layernorm.weight`, `post_attention_layernorm.weight`, `model.norm.weight`) have been folded into the following linear layers and then removed from the state dict entirely.
-> **This checkpoint does NOT load in stock vLLM today.** vLLM's weight loader raises a `ValueError` because the norm weight tensors are absent. Issue tracking the loader patch: TBD. Use [open-machine/SmolLM2-135M-FlashNorm](https://huggingface.co/open-machine/SmolLM2-135M-FlashNorm) (the compat variant) for a drop-in checkpoint that loads in stock vLLM today.
-This repo exists as a concrete test vector for the upstream patch that would let vLLM accept weightless RMSNorm models.
-## What is FlashNorm (weightless)?
 An exact reformulation of `RMSNorm -> Linear`:
-- **Fold** the per-channel normalization weight `g` into the following linear layer: `W_star = W @ diag(g)`.
-- After folding, the RMSNorm layer has no learnable per-channel scale. It just divides by `rms(x)`.
 - The resulting model computes the same output as the original, by Proposition 1 of the FlashNorm paper.
-This repo is a "weightless" variant: the `g` tensor itself is absent from the safetensors, because after the fold the runtime value of `g` is always all-ones (the multiplicative identity). Deleting the tensor saves a small amount of disk space and makes explicit that the runtime never needs to multiply by `g`.
 See the [paper](https://github.com/OpenMachine-ai/transformer-tricks/blob/main/tex/flashNorm.tex) (Section 3.1 and Proposition 1) and the [transformer-tricks](https://github.com/OpenMachine-ai/transformer-tricks) repo for details.
 ## What's different from the source checkpoint
-| Tensor | Source | Compat variant | This (strict) |
-|---|---|---|---|
-| `model.layers.*.input_layernorm.weight` | learned per-channel `g` | all ones | **absent** |
-| `model.layers.*.self_attn.{q,k,v}_proj.weight` | `W` | `W @ diag(g_input_layernorm)` | `W @ diag(g_input_layernorm)` |
-| `model.layers.*.post_attention_layernorm.weight` | learned per-channel `g` | all ones | **absent** |
-| `model.layers.*.mlp.{gate,up}_proj.weight` | `W` | `W @ diag(g_post_attention_layernorm)` | `W @ diag(g_post_attention_layernorm)` |
-| `model.norm.weight` | learned per-channel `g` | all ones | **absent** |
-All dtype conventions match the source (`bfloat16`). Mathematical identity to the source model holds by construction.
 ## Usage
-### Via `transformer_tricks`
-The `transformer_tricks` package can regenerate this checkpoint locally from the source:
 ```python
 import transformer_tricks as tt
@@ -56,30 +50,26 @@ tt.flashify_repo('HuggingFaceTB/SmolLM2-135M', strict=True)
 ### Via HuggingFace Transformers
-HuggingFace Transformers will load this checkpoint with a warning that norm weights were not initialized from the checkpoint, and will default them to the module's init value (ones for `LlamaRMSNorm`). Under this path, the output is correct.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-tok = AutoTokenizer.from_pretrained('open-machine/SmolLM2-135M-FlashNorm-strict')
-model = AutoModelForCausalLM.from_pretrained('open-machine/SmolLM2-135M-FlashNorm-strict')
 ids = tok('Once upon a time there was', return_tensors='pt').input_ids
 out = model.generate(ids, max_new_tokens=50, do_sample=False)
 print(tok.decode(out[0], skip_special_tokens=True))
 ```
-### Via vLLM
-**Not yet supported.** vLLM's weight loader validates that all declared `nn.Parameter` tensors are present in the safetensors and raises `ValueError` when norm weights are absent.
-Tracking issue for upstream patch: TBD (to be linked once filed).
-Until the patch lands, use [open-machine/SmolLM2-135M-FlashNorm](https://huggingface.co/open-machine/SmolLM2-135M-FlashNorm) (compat variant) which keeps the norm tensors as all-ones and loads in stock vLLM unchanged.
 ## Verification
-Generated from the compat variant by deleting the 61 norm weight tensors (30 layers x 2 norms each + 1 final `model.norm`). All other tensors are byte-identical to the compat checkpoint; inference outputs are therefore identical when the loader defaults absent norm weights to ones.
 ## License

 pipeline_tag: text-generation
 ---
+# SmolLM2-135M-FlashNorm
+FlashNorm-prepared checkpoint of [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M). Mathematically equivalent to the source model. The per-channel RMSNorm weight tensors (`input_layernorm.weight`, `post_attention_layernorm.weight`, `model.norm.weight`) are folded into the following linear layers and then removed from the state dict entirely.
+> **Framework support note.** Stock vLLM currently does not load this checkpoint because the norm weight tensors are absent. The upstream patch to accept missing tensors is tracked at: **TBD (vLLM issue link)**. Until the patch lands, use HuggingFace Transformers; it loads this with a warning that norm weights were not initialized and defaults them to ones, which is the correct behavior for FlashNorm.
+>
+> The other two public FlashNorm checkpoints in this org, [Llama-3.2-1B-FlashNorm](https://huggingface.co/open-machine/Llama-3.2-1B-FlashNorm) and [Llama-3.1-8B-FlashNorm](https://huggingface.co/open-machine/Llama-3.1-8B-FlashNorm), are currently still in a compatibility layout where the norm tensors are retained as all-ones. They will be flipped to the same weightless layout as this checkpoint once vLLM's loader supports it.
+## What FlashNorm does
 An exact reformulation of `RMSNorm -> Linear`:
+- Fold the per-channel normalization weight `g` into the following linear layer: `W_star = W @ diag(g)`, computed once at checkpoint conversion.
+- After folding, the RMSNorm layer has no learnable per-channel scale. At runtime it simply divides by `rms(x)`.
 - The resulting model computes the same output as the original, by Proposition 1 of the FlashNorm paper.
 See the [paper](https://github.com/OpenMachine-ai/transformer-tricks/blob/main/tex/flashNorm.tex) (Section 3.1 and Proposition 1) and the [transformer-tricks](https://github.com/OpenMachine-ai/transformer-tricks) repo for details.
 ## What's different from the source checkpoint
+| Tensor | Source | This FlashNorm checkpoint |
+|---|---|---|
+| `model.layers.*.input_layernorm.weight` | learned per-channel `g` | **absent** |
+| `model.layers.*.self_attn.{q,k,v}_proj.weight` | `W` | `W @ diag(g_input_layernorm)` |
+| `model.layers.*.post_attention_layernorm.weight` | learned per-channel `g` | **absent** |
+| `model.layers.*.mlp.{gate,up}_proj.weight` | `W` | `W @ diag(g_post_attention_layernorm)` |
+| `model.norm.weight` | learned per-channel `g` | **absent** |
+All dtype conventions match the source (`bfloat16`). Mathematical identity to the source holds by construction.
 ## Usage
+### Regenerate locally with `transformer_tricks`
 ```python
 import transformer_tricks as tt
 ### Via HuggingFace Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+tok = AutoTokenizer.from_pretrained('open-machine/SmolLM2-135M-FlashNorm')
+model = AutoModelForCausalLM.from_pretrained('open-machine/SmolLM2-135M-FlashNorm')
 ids = tok('Once upon a time there was', return_tensors='pt').input_ids
 out = model.generate(ids, max_new_tokens=50, do_sample=False)
 print(tok.decode(out[0], skip_special_tokens=True))
 ```
+A warning about missing norm weights is expected; Transformers defaults those to ones, which is the correct value for a FlashNorm checkpoint.
+### Via vLLM
+Not yet supported. See the tracking issue linked above.
 ## Verification
+Under fp32 inference, greedy generation from this checkpoint is bit-identical to the source SmolLM2-135M model. Under fp16 inference the output is within benchmark noise (see the Quality table in Section 5 of the paper).
 ## License