Add library_name and paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +26 -9
README.md CHANGED
@@ -1,17 +1,20 @@
1
  ---
2
- license: apache-2.0
3
  base_model: Qwen/Qwen3-8B
4
- tags:
5
- - flashnorm
6
- - transformer-tricks
7
- - efficient-inference
8
- - weightless-rmsnorm
9
  pipeline_tag: text-generation
 
 
 
 
 
 
10
  ---
11
 
12
  # Qwen3-8B-FlashNorm
13
 
14
- FlashNorm-prepared checkpoint of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). Mathematically equivalent to the source model. The per-channel RMSNorm weight tensors (`input_layernorm.weight`, `post_attention_layernorm.weight`, `model.norm.weight`) are folded into the following linear layers and then removed from the state dict entirely.
 
 
15
 
16
  > **Framework support note.** Stock vLLM currently does not load this checkpoint because the norm weight tensors are absent. The upstream patch to accept missing tensors is tracked at: **TBD (vLLM issue link)**. Until the patch lands, use HuggingFace Transformers; it loads this with a warning that norm weights were not initialized and defaults them to ones, which is the correct behavior for FlashNorm.
17
 
@@ -23,7 +26,7 @@ An exact reformulation of `RMSNorm -> Linear`:
23
  - After folding, the RMSNorm layer has no learnable per-channel scale. At runtime it simply divides by `rms(x)`.
24
  - The resulting model computes the same output as the original, by Proposition 1 of the FlashNorm paper.
25
 
26
- See the [paper](https://arxiv.org/abs/2407.09577) and the [transformer-tricks](https://github.com/OpenMachine-ai/transformer-tricks) repo for details.
27
 
28
  ## Usage
29
 
@@ -53,6 +56,20 @@ A warning about missing norm weights is expected; Transformers defaults those to
53
 
54
  Not yet supported. See the tracking issue linked above.
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ## License
57
 
58
- Inherited from the source model.
 
1
  ---
 
2
  base_model: Qwen/Qwen3-8B
3
+ license: apache-2.0
 
 
 
 
4
  pipeline_tag: text-generation
5
+ library_name: transformers
6
+ tags:
7
+ - flashnorm
8
+ - transformer-tricks
9
+ - efficient-inference
10
+ - weightless-rmsnorm
11
  ---
12
 
13
  # Qwen3-8B-FlashNorm
14
 
15
+ This is a FlashNorm-prepared checkpoint of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B), as presented in the paper [FlashNorm: Fast Normalization for Transformers](https://huggingface.co/papers/2407.09577).
16
+
17
+ Mathematically equivalent to the source model. The per-channel RMSNorm weight tensors (`input_layernorm.weight`, `post_attention_layernorm.weight`, `model.norm.weight`) are folded into the following linear layers and then removed from the state dict entirely.
18
 
19
  > **Framework support note.** Stock vLLM currently does not load this checkpoint because the norm weight tensors are absent. The upstream patch to accept missing tensors is tracked at: **TBD (vLLM issue link)**. Until the patch lands, use HuggingFace Transformers; it loads this with a warning that norm weights were not initialized and defaults them to ones, which is the correct behavior for FlashNorm.
20
 
 
26
  - After folding, the RMSNorm layer has no learnable per-channel scale. At runtime it simply divides by `rms(x)`.
27
  - The resulting model computes the same output as the original, by Proposition 1 of the FlashNorm paper.
28
 
29
+ See the [paper](https://huggingface.co/papers/2407.09577) and the [transformer-tricks](https://github.com/OpenMachine-ai/transformer-tricks) repo for details.
30
 
31
  ## Usage
32
 
 
56
 
57
  Not yet supported. See the tracking issue linked above.
58
 
59
+ ## Citation
60
+
61
+ ```bibtex
62
+ @misc{graef2024flashnormfastnormalizationtransformers,
63
+ title={FlashNorm: Fast Normalization for Transformers},
64
+ author={Nils Graef and Matthew Clapp and Andrew Wasielewski},
65
+ year={2024},
66
+ eprint={2407.09577},
67
+ archivePrefix={arXiv},
68
+ primaryClass={cs.LG},
69
+ url={https://arxiv.org/abs/2407.09577},
70
+ }
71
+ ```
72
+
73
  ## License
74
 
75
+ Inherited from the source model.