File size: 6,148 Bytes
8c65769 071f846 8c65769 071f846 8c65769 e7583d6 8c65769 071f846 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | ---
language:
- en
tags:
- pytorch
- hssm
- state-space-model
- mixture-of-experts
- autoregressive
- text-generation
- 73.8 M
- pretrained
datasets:
- HuggingFaceFW/fineweb-edu
pipeline_tag: text-generation
library_name: pytorch
model_type: custom
license: mit
---
# HSSM
HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by [DevHunterAI](https://huggingface.co/DevHunterAI).

## Model Summary
HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation.
This release corresponds to the pretrained checkpoint:
- `hssm_fineweb_edu_final.pt`
Parameter count:
- `73.8M` parameters
This checkpoint was pretrained on:
- `HuggingFaceFW/fineweb-edu`
## Intended Use
This model is intended for:
- research on hierarchical state space models
- experimentation with sparse expert routing for language modeling
- continued fine-tuning on dialogue, instruction, or domain datasets
- architecture analysis and comparison against transformer and recurrent baselines
This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
## Training Dataset
The pretraining data source selected for this release is:
- **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
- **Usage mode**: streaming pretraining pipeline
- **Selection**: first 1.5 million samples
- **Epochs**: 1
FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
## Architecture Overview
HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages.
### 1. Token Embedding Layer
Input token ids are mapped into a dense latent space of dimension `d_model=512`.
### 2. Hierarchical Chunker
The embedded token sequence is grouped into fixed-size chunks with:
- `chunk_size=4`
This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages.
### 3. Repeated HSSM Blocks
The model contains:
- `num_blocks=6`
Each HSSM block combines two complementary mechanisms:
#### a. Selective State Space Modeling
A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text.
Key state-space parameter:
- `d_state=32`
#### b. Sparse Mixture-of-Experts Feed-Forward Stage
Each block also contains a sparse mixture-of-experts module:
- `num_experts=8`
- `top_k=2`
- `expert_dim=1024`
For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time.
### 4. Final Normalization and Output Projection
After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction.
## Released Configuration
This release uses the larger Config A style setup:
- `vocab_size=20000`
- `d_model=512`
- `d_state=32`
- `num_blocks=6`
- `num_experts=8`
- `top_k=2`
- `chunk_size=4`
- `expert_dim=1024`
## How HSSM Works Internally
At a high level, HSSM processes text as follows:
1. Tokens are embedded into a continuous space.
2. Neighboring tokens are grouped into chunks.
3. Chunk representations are passed through repeated hierarchical blocks.
4. Inside each block, selective state space dynamics model ordered sequence behavior.
5. Sparse expert routing expands feed-forward capacity using only a small subset of experts per step.
6. Final logits are produced for autoregressive next-token generation.
This creates a hybrid inductive bias:
- **hierarchical** because tokens are compressed into chunk-level structure
- **state-space based** because sequential dynamics are modeled through learned latent state transitions
- **sparse expert based** because only a subset of experts is activated for each representation
## Known Limitations
Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
- repetitive continuations
- weak dialogue alignment
- unstable chat behavior on open-ended prompts
- sensitivity to tokenizer choice
For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
## Files in This Repository
- `hssm_fineweb_edu_final.pt` — pretrained HSSM checkpoint
- `simple_tokenizer_20k.json` — tokenizer file used with this release
- `HSSM.png` — architecture image shown in this model card
## Example Loading (PyTorch)
```python
import torch
from hssm_pretrained_chat import load_pretrained, generate_reply
tokenizer, model = load_pretrained(
"hssm_fineweb_edu_final.pt",
"simple_tokenizer_20k.json",
device="cpu",
)
reply = generate_reply(
model=model,
tokenizer=tokenizer,
prompt="What is machine learning?",
max_length=48,
temperature=0.3,
top_k=12,
top_p=0.78,
repetition_penalty=1.45,
no_repeat_ngram_size=4,
)
print(reply)
```
## Repository / Author
- **Model name**: `HSSM`
- **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
- **Checkpoint type**: pretrained public release
## Citation
If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source. |