File size: 6,038 Bytes
f1f9bac e3e74c0 f1f9bac | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | ---
language:
- en
tags:
- pytorch
- hssm-v2
- hierarchical-state-space-model
- mixture-of-experts
- autoregressive
- text-generation
- fineweb-edu
- 250m-parameters
- 0.25B
datasets:
- HuggingFaceFW/fineweb-edu
pipeline_tag: text-generation
library_name: pytorch
---
# HSSM v2 250M
HSSM v2 is a hierarchical state-space language model with sparse Mixture-of-Experts routing for autoregressive text generation. This release contains the FineWeb-Edu pretrained checkpoint published by [DevHunterAI](https://huggingface.co/DevHunterAI).

## Model Summary
HSSM v2 combines local depthwise temporal mixing, chunk-level hierarchical state propagation, residual gating, and sparse Mixture-of-Experts feed-forward blocks in a single causal language model.
This release corresponds to the pretrained checkpoint:
- `hssm_v2_250m_fineweb_edu_final.pt`
Model scale:
- **Total parameters**: `250,040,256` (`~250M`)
- **Active parameters per token path**: `26,534,400` (`~26.5M`)
- **Architecture**: sparse MoE language model with top-1 expert routing in MoE layers
This checkpoint was pretrained on:
- `HuggingFaceFW/fineweb-edu`
- `1.25B` tokens
Training note:
- pretrained in approximately **2 hours** on an **NVIDIA RTX Pro 6000 Blackwell GPU**
## Intended Use
This model is intended for:
- research on hierarchical state-space language models
- experimentation with sparse expert routing for autoregressive text generation
- continued fine-tuning on dialogue, instruction, or domain datasets
- architecture analysis and comparison against transformer and recurrent baselines
This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
## Training Dataset
The pretraining data source for this release is:
- **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
- **Usage mode**: streaming pretraining pipeline
- **Token budget**: `1.25B` tokens
- **Domain**: educational and general web text
FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
## Architecture Overview
HSSM v2 is organized as a stacked hierarchical autoregressive architecture with token embeddings, ten HSSM blocks, final normalization, and a tied language modeling head.
### Core configuration
- `vocab_size = 50257`
- `d_model = 288`
- `n_layers = 10`
- `d_ff = 512`
- `state_rank = 128`
- `chunk_size = 8`
- `num_experts = 64`
- `experts_per_token = 1`
- `expert_dim = 2048`
- `moe_every = 4`
- `tie_embeddings = true`
### Block structure
Each HSSM v2 block follows this pattern:
1. `RMSNorm`
2. `HierarchicalStateMixer`
3. residual add
4. `RMSNorm`
5. `GatedMLP` or `SparseMoE`
6. residual add
Every 4th block uses `SparseMoE`, so with 10 layers this release contains 2 MoE blocks.
### HierarchicalStateMixer
The mixer replaces standard attention with a combination of:
- depthwise `Conv1d` local temporal mixing
- chunking with `chunk_size=8`
- mean pooling over chunk windows
- state compression `288 -> 128`
- state expansion `128 -> 288`
- repeat-interleave back to token length
- gated residual fusion followed by output projection
This gives the model a hybrid inductive bias with local token interaction and chunk-level state propagation.
### Sparse MoE
Sparse MoE blocks use:
- `64` experts
- top-`1` routing per token
- expert hidden size `2048`
- auxiliary load-balancing loss
Only one expert path is active per token in each MoE layer, which is why the active parameter count is much smaller than the total parameter count.
### Output head
After the final `RMSNorm`, the model projects hidden states to vocabulary logits using a tied LM head that shares weights with the token embedding matrix.
## Training Details
1. Tokens are embedded into a continuous space.
2. Local token interactions are modeled with depthwise convolution.
3. Chunk summaries are compressed into latent states and expanded back across token positions.
4. Sparse MoE blocks increase capacity with top-1 expert routing.
5. Final logits are produced for next-token prediction.
Additional training facts for this release:
- **Pretraining tokens**: `1.25B`
- **Training hardware**: `NVIDIA RTX Pro 6000 Blackwell`
- **Approximate pretraining duration**: `2 hours`
- **Objective**: autoregressive next-token prediction with auxiliary MoE load-balancing loss
## Known Limitations
Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
- repetitive continuations
- weak dialogue alignment
- unstable chat behavior on open-ended prompts
- sensitivity to tokenizer choice
For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
## Files in This Repository
- `hssm_v2_250m_fineweb_edu_final.pt` — pretrained HSSM v2 checkpoint
- `HSSM_v2_architecture.png` — architecture image shown in this model card
- `hssm_v2_gpu_pretrain.py` — training/model definition reference
- `hssm_pretrained_chat.py` — local loading and generation helper
## Example Loading (PyTorch)
```python
from hssm_pretrained_chat import load_pretrained, generate_reply
tokenizer, model = load_pretrained(
"hssm_v2_250m_fineweb_edu_final.pt",
"gpt2",
device="cpu",
)
reply = generate_reply(
model=model,
tokenizer=tokenizer,
prompt="What is machine learning?",
max_length=40,
temperature=0.0,
top_k=4,
top_p=0.65,
repetition_penalty=1.9,
no_repeat_ngram_size=6,
)
print(reply)
```
## Repository / Author
- **Model name**: `HSSM v2 250M`
- **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
- **Checkpoint type**: pretrained public release
## Citation
If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.
|