DevHunterAI commited on
Commit
f1f9bac
·
verified ·
1 Parent(s): 751a0fd

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - hssm-v2
7
+ - hierarchical-state-space-model
8
+ - mixture-of-experts
9
+ - autoregressive
10
+ - text-generation
11
+ - fineweb-edu
12
+ - 250m-parameters
13
+ datasets:
14
+ - HuggingFaceFW/fineweb-edu
15
+ pipeline_tag: text-generation
16
+ library_name: pytorch
17
+ ---
18
+
19
+ # HSSM v2 250M
20
+
21
+ HSSM v2 is a hierarchical state-space language model with sparse Mixture-of-Experts routing for autoregressive text generation. This release contains the FineWeb-Edu pretrained checkpoint published by [DevHunterAI](https://huggingface.co/DevHunterAI).
22
+
23
+ ![HSSM v2 architecture](./HSSM_v2_architecture.png)
24
+
25
+ ## Model Summary
26
+
27
+ HSSM v2 combines local depthwise temporal mixing, chunk-level hierarchical state propagation, residual gating, and sparse Mixture-of-Experts feed-forward blocks in a single causal language model.
28
+
29
+ This release corresponds to the pretrained checkpoint:
30
+
31
+ - `hssm_v2_250m_fineweb_edu_final.pt`
32
+
33
+ Model scale:
34
+ - **Total parameters**: `250,040,256` (`~250M`)
35
+ - **Active parameters per token path**: `26,534,400` (`~26.5M`)
36
+ - **Architecture**: sparse MoE language model with top-1 expert routing in MoE layers
37
+
38
+ This checkpoint was pretrained on:
39
+
40
+ - `HuggingFaceFW/fineweb-edu`
41
+ - `1.25B` tokens
42
+
43
+ Training note:
44
+ - pretrained in approximately **2 hours** on an **NVIDIA RTX Pro 6000 Blackwell GPU**
45
+
46
+ ## Intended Use
47
+
48
+ This model is intended for:
49
+
50
+ - research on hierarchical state-space language models
51
+ - experimentation with sparse expert routing for autoregressive text generation
52
+ - continued fine-tuning on dialogue, instruction, or domain datasets
53
+ - architecture analysis and comparison against transformer and recurrent baselines
54
+
55
+ This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
56
+
57
+ ## Training Dataset
58
+
59
+ The pretraining data source for this release is:
60
+
61
+ - **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
62
+ - **Usage mode**: streaming pretraining pipeline
63
+ - **Token budget**: `1.25B` tokens
64
+ - **Domain**: educational and general web text
65
+
66
+ FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
67
+
68
+ ## Architecture Overview
69
+
70
+ HSSM v2 is organized as a stacked hierarchical autoregressive architecture with token embeddings, ten HSSM blocks, final normalization, and a tied language modeling head.
71
+
72
+ ### Core configuration
73
+
74
+ - `vocab_size = 50257`
75
+ - `d_model = 288`
76
+ - `n_layers = 10`
77
+ - `d_ff = 512`
78
+ - `state_rank = 128`
79
+ - `chunk_size = 8`
80
+ - `num_experts = 64`
81
+ - `experts_per_token = 1`
82
+ - `expert_dim = 2048`
83
+ - `moe_every = 4`
84
+ - `tie_embeddings = true`
85
+
86
+ ### Block structure
87
+
88
+ Each HSSM v2 block follows this pattern:
89
+
90
+ 1. `RMSNorm`
91
+ 2. `HierarchicalStateMixer`
92
+ 3. residual add
93
+ 4. `RMSNorm`
94
+ 5. `GatedMLP` or `SparseMoE`
95
+ 6. residual add
96
+
97
+ Every 4th block uses `SparseMoE`, so with 10 layers this release contains 2 MoE blocks.
98
+
99
+ ### HierarchicalStateMixer
100
+
101
+ The mixer replaces standard attention with a combination of:
102
+
103
+ - depthwise `Conv1d` local temporal mixing
104
+ - chunking with `chunk_size=8`
105
+ - mean pooling over chunk windows
106
+ - state compression `288 -> 128`
107
+ - state expansion `128 -> 288`
108
+ - repeat-interleave back to token length
109
+ - gated residual fusion followed by output projection
110
+
111
+ This gives the model a hybrid inductive bias with local token interaction and chunk-level state propagation.
112
+
113
+ ### Sparse MoE
114
+
115
+ Sparse MoE blocks use:
116
+
117
+ - `64` experts
118
+ - top-`1` routing per token
119
+ - expert hidden size `2048`
120
+ - auxiliary load-balancing loss
121
+
122
+ Only one expert path is active per token in each MoE layer, which is why the active parameter count is much smaller than the total parameter count.
123
+
124
+ ### Output head
125
+
126
+ After the final `RMSNorm`, the model projects hidden states to vocabulary logits using a tied LM head that shares weights with the token embedding matrix.
127
+
128
+ ## Training Details
129
+
130
+ 1. Tokens are embedded into a continuous space.
131
+ 2. Local token interactions are modeled with depthwise convolution.
132
+ 3. Chunk summaries are compressed into latent states and expanded back across token positions.
133
+ 4. Sparse MoE blocks increase capacity with top-1 expert routing.
134
+ 5. Final logits are produced for next-token prediction.
135
+
136
+ Additional training facts for this release:
137
+
138
+ - **Pretraining tokens**: `1.25B`
139
+ - **Training hardware**: `NVIDIA RTX Pro 6000 Blackwell`
140
+ - **Approximate pretraining duration**: `2 hours`
141
+ - **Objective**: autoregressive next-token prediction with auxiliary MoE load-balancing loss
142
+
143
+ ## Known Limitations
144
+
145
+ Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
146
+
147
+ - repetitive continuations
148
+ - weak dialogue alignment
149
+ - unstable chat behavior on open-ended prompts
150
+ - sensitivity to tokenizer choice
151
+
152
+ For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
153
+
154
+ ## Files in This Repository
155
+
156
+ - `hssm_v2_250m_fineweb_edu_final.pt` — pretrained HSSM v2 checkpoint
157
+ - `HSSM_v2_architecture.png` — architecture image shown in this model card
158
+ - `hssm_v2_gpu_pretrain.py` — training/model definition reference
159
+ - `hssm_pretrained_chat.py` — local loading and generation helper
160
+
161
+ ## Example Loading (PyTorch)
162
+
163
+ ```python
164
+ from hssm_pretrained_chat import load_pretrained, generate_reply
165
+
166
+ tokenizer, model = load_pretrained(
167
+ "hssm_v2_250m_fineweb_edu_final.pt",
168
+ "gpt2",
169
+ device="cpu",
170
+ )
171
+
172
+ reply = generate_reply(
173
+ model=model,
174
+ tokenizer=tokenizer,
175
+ prompt="What is machine learning?",
176
+ max_length=40,
177
+ temperature=0.0,
178
+ top_k=4,
179
+ top_p=0.65,
180
+ repetition_penalty=1.9,
181
+ no_repeat_ngram_size=6,
182
+ )
183
+
184
+ print(reply)
185
+ ```
186
+
187
+ ## Repository / Author
188
+
189
+ - **Model name**: `HSSM v2 250M`
190
+ - **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
191
+ - **Checkpoint type**: pretrained public release
192
+
193
+ ## Citation
194
+
195
+ If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.