| --- |
| language: |
| - en |
| tags: |
| - pytorch |
| - hssm |
| - state-space-model |
| - mixture-of-experts |
| - autoregressive |
| - text-generation |
| - 73.8 M |
| - pretrained |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| pipeline_tag: text-generation |
| library_name: pytorch |
| model_type: custom |
| license: mit |
| --- |
| |
| # HSSM |
|
|
| HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by [DevHunterAI](https://huggingface.co/DevHunterAI). |
|
|
|  |
|
|
| ## Model Summary |
|
|
| HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation. |
|
|
| This release corresponds to the pretrained checkpoint: |
|
|
| - `hssm_fineweb_edu_final.pt` |
|
|
| Parameter count: |
| - `73.8M` parameters |
|
|
| This checkpoint was pretrained on: |
|
|
| - `HuggingFaceFW/fineweb-edu` |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
|
|
| - research on hierarchical state space models |
| - experimentation with sparse expert routing for language modeling |
| - continued fine-tuning on dialogue, instruction, or domain datasets |
| - architecture analysis and comparison against transformer and recurrent baselines |
|
|
| This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage. |
|
|
| ## Training Dataset |
|
|
| The pretraining data source selected for this release is: |
|
|
| - **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
| - **Usage mode**: streaming pretraining pipeline |
| - **Selection**: first 1.5 million samples |
| - **Epochs**: 1 |
|
|
| FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks. |
|
|
| ## Architecture Overview |
|
|
| HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages. |
|
|
| ### 1. Token Embedding Layer |
|
|
| Input token ids are mapped into a dense latent space of dimension `d_model=512`. |
|
|
| ### 2. Hierarchical Chunker |
|
|
| The embedded token sequence is grouped into fixed-size chunks with: |
|
|
| - `chunk_size=4` |
|
|
| This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages. |
|
|
| ### 3. Repeated HSSM Blocks |
|
|
| The model contains: |
|
|
| - `num_blocks=6` |
|
|
| Each HSSM block combines two complementary mechanisms: |
|
|
| #### a. Selective State Space Modeling |
|
|
| A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text. |
|
|
| Key state-space parameter: |
|
|
| - `d_state=32` |
|
|
| #### b. Sparse Mixture-of-Experts Feed-Forward Stage |
|
|
| Each block also contains a sparse mixture-of-experts module: |
|
|
| - `num_experts=8` |
| - `top_k=2` |
| - `expert_dim=1024` |
|
|
| For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time. |
|
|
| ### 4. Final Normalization and Output Projection |
|
|
| After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction. |
|
|
| ## Released Configuration |
|
|
| This release uses the larger Config A style setup: |
|
|
| - `vocab_size=20000` |
| - `d_model=512` |
| - `d_state=32` |
| - `num_blocks=6` |
| - `num_experts=8` |
| - `top_k=2` |
| - `chunk_size=4` |
| - `expert_dim=1024` |
|
|
| ## How HSSM Works Internally |
|
|
| At a high level, HSSM processes text as follows: |
|
|
| 1. Tokens are embedded into a continuous space. |
| 2. Neighboring tokens are grouped into chunks. |
| 3. Chunk representations are passed through repeated hierarchical blocks. |
| 4. Inside each block, selective state space dynamics model ordered sequence behavior. |
| 5. Sparse expert routing expands feed-forward capacity using only a small subset of experts per step. |
| 6. Final logits are produced for autoregressive next-token generation. |
|
|
| This creates a hybrid inductive bias: |
|
|
| - **hierarchical** because tokens are compressed into chunk-level structure |
| - **state-space based** because sequential dynamics are modeled through learned latent state transitions |
| - **sparse expert based** because only a subset of experts is activated for each representation |
|
|
| ## Known Limitations |
|
|
| Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe: |
|
|
| - repetitive continuations |
| - weak dialogue alignment |
| - unstable chat behavior on open-ended prompts |
| - sensitivity to tokenizer choice |
|
|
| For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data. |
|
|
| ## Files in This Repository |
|
|
| - `hssm_fineweb_edu_final.pt` — pretrained HSSM checkpoint |
| - `simple_tokenizer_20k.json` — tokenizer file used with this release |
| - `HSSM.png` — architecture image shown in this model card |
|
|
| ## Example Loading (PyTorch) |
|
|
| ```python |
| import torch |
| from hssm_pretrained_chat import load_pretrained, generate_reply |
| |
| tokenizer, model = load_pretrained( |
| "hssm_fineweb_edu_final.pt", |
| "simple_tokenizer_20k.json", |
| device="cpu", |
| ) |
| |
| reply = generate_reply( |
| model=model, |
| tokenizer=tokenizer, |
| prompt="What is machine learning?", |
| max_length=48, |
| temperature=0.3, |
| top_k=12, |
| top_p=0.78, |
| repetition_penalty=1.45, |
| no_repeat_ngram_size=4, |
| ) |
| |
| print(reply) |
| ``` |
|
|
| ## Repository / Author |
|
|
| - **Model name**: `HSSM` |
| - **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI) |
| - **Checkpoint type**: pretrained public release |
|
|
| ## Citation |
|
|
| If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source. |