DevHunterAI commited on
Commit
8c65769
·
verified ·
1 Parent(s): 67fdbd6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +185 -0
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - hssm
7
+ - state-space-model
8
+ - mixture-of-experts
9
+ - autoregressive
10
+ - text-generation
11
+ datasets:
12
+ - HuggingFaceFW/fineweb-edu
13
+ pipeline_tag: text-generation
14
+ library_name: pytorch
15
+ ---
16
+
17
+ # HSSM
18
+
19
+ HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by [DevHunterAI](https://huggingface.co/DevHunterAI).
20
+
21
+ ![HSSM architecture](./HSSM.png)
22
+
23
+ ## Model Summary
24
+
25
+ HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation.
26
+
27
+ This release corresponds to the pretrained checkpoint:
28
+
29
+ - `hssm_fineweb_edu_final.pt`
30
+
31
+ This checkpoint was pretrained on:
32
+
33
+ - `HuggingFaceFW/fineweb-edu`
34
+
35
+ ## Intended Use
36
+
37
+ This model is intended for:
38
+
39
+ - research on hierarchical state space models
40
+ - experimentation with sparse expert routing for language modeling
41
+ - continued fine-tuning on dialogue, instruction, or domain datasets
42
+ - architecture analysis and comparison against transformer and recurrent baselines
43
+
44
+ This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
45
+
46
+ ## Training Dataset
47
+
48
+ The pretraining data source selected for this release is:
49
+
50
+ - **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
51
+ - **Usage mode**: streaming pretraining pipeline
52
+ - **Selection**: first 1.5 million samples
53
+ - **Epochs**: 1
54
+
55
+ FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
56
+
57
+ ## Architecture Overview
58
+
59
+ HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages.
60
+
61
+ ### 1. Token Embedding Layer
62
+
63
+ Input token ids are mapped into a dense latent space of dimension `d_model=512`.
64
+
65
+ ### 2. Hierarchical Chunker
66
+
67
+ The embedded token sequence is grouped into fixed-size chunks with:
68
+
69
+ - `chunk_size=4`
70
+
71
+ This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages.
72
+
73
+ ### 3. Repeated HSSM Blocks
74
+
75
+ The model contains:
76
+
77
+ - `num_blocks=6`
78
+
79
+ Each HSSM block combines two complementary mechanisms:
80
+
81
+ #### a. Selective State Space Modeling
82
+
83
+ A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text.
84
+
85
+ Key state-space parameter:
86
+
87
+ - `d_state=32`
88
+
89
+ #### b. Sparse Mixture-of-Experts Feed-Forward Stage
90
+
91
+ Each block also contains a sparse mixture-of-experts module:
92
+
93
+ - `num_experts=8`
94
+ - `top_k=2`
95
+ - `expert_dim=1024`
96
+
97
+ For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time.
98
+
99
+ ### 4. Final Normalization and Output Projection
100
+
101
+ After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction.
102
+
103
+ ## Released Configuration
104
+
105
+ This release uses the larger Config A style setup:
106
+
107
+ - `vocab_size=20000`
108
+ - `d_model=512`
109
+ - `d_state=32`
110
+ - `num_blocks=6`
111
+ - `num_experts=8`
112
+ - `top_k=2`
113
+ - `chunk_size=4`
114
+ - `expert_dim=1024`
115
+
116
+ ## How HSSM Works Internally
117
+
118
+ At a high level, HSSM processes text as follows:
119
+
120
+ 1. Tokens are embedded into a continuous space.
121
+ 2. Neighboring tokens are grouped into chunks.
122
+ 3. Chunk representations are passed through repeated hierarchical blocks.
123
+ 4. Inside each block, selective state space dynamics model ordered sequence behavior.
124
+ 5. Sparse expert routing expands feed-forward capacity using only a small subset of experts per step.
125
+ 6. Final logits are produced for autoregressive next-token generation.
126
+
127
+ This creates a hybrid inductive bias:
128
+
129
+ - **hierarchical** because tokens are compressed into chunk-level structure
130
+ - **state-space based** because sequential dynamics are modeled through learned latent state transitions
131
+ - **sparse expert based** because only a subset of experts is activated for each representation
132
+
133
+ ## Known Limitations
134
+
135
+ Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
136
+
137
+ - repetitive continuations
138
+ - weak dialogue alignment
139
+ - unstable chat behavior on open-ended prompts
140
+ - sensitivity to tokenizer choice
141
+
142
+ For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
143
+
144
+ ## Files in This Repository
145
+
146
+ - `hssm_fineweb_edu_final.pt` — pretrained HSSM checkpoint
147
+ - `simple_tokenizer_20k.json` — tokenizer file used with this release
148
+ - `HSSM.png` — architecture image shown in this model card
149
+
150
+ ## Example Loading (PyTorch)
151
+
152
+ ```python
153
+ import torch
154
+ from hssm_pretrained_chat import load_pretrained, generate_reply
155
+
156
+ tokenizer, model = load_pretrained(
157
+ "hssm_fineweb_edu_final.pt",
158
+ "simple_tokenizer_20k.json",
159
+ device="cpu",
160
+ )
161
+
162
+ reply = generate_reply(
163
+ model=model,
164
+ tokenizer=tokenizer,
165
+ prompt="What is machine learning?",
166
+ max_length=48,
167
+ temperature=0.3,
168
+ top_k=12,
169
+ top_p=0.78,
170
+ repetition_penalty=1.45,
171
+ no_repeat_ngram_size=4,
172
+ )
173
+
174
+ print(reply)
175
+ ```
176
+
177
+ ## Repository / Author
178
+
179
+ - **Model name**: `HSSM`
180
+ - **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
181
+ - **Checkpoint type**: pretrained public release
182
+
183
+ ## Citation
184
+
185
+ If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.