DevHunterAI
/

HSSM

+---
+language:
+- en
+tags:
+- pytorch
+- hssm
+- state-space-model
+- mixture-of-experts
+- autoregressive
+- text-generation
+datasets:
+- HuggingFaceFW/fineweb-edu
+pipeline_tag: text-generation
+library_name: pytorch
+---
+# HSSM
+HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by [DevHunterAI](https://huggingface.co/DevHunterAI).
+![HSSM architecture](./HSSM.png)
+## Model Summary
+HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation.
+This release corresponds to the pretrained checkpoint:
+- `hssm_fineweb_edu_final.pt`
+This checkpoint was pretrained on:
+- `HuggingFaceFW/fineweb-edu`
+## Intended Use
+This model is intended for:
+- research on hierarchical state space models
+- experimentation with sparse expert routing for language modeling
+- continued fine-tuning on dialogue, instruction, or domain datasets
+- architecture analysis and comparison against transformer and recurrent baselines
+This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
+## Training Dataset
+The pretraining data source selected for this release is:
+- **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
+- **Usage mode**: streaming pretraining pipeline
+- **Selection**: first 1.5 million samples
+- **Epochs**: 1
+FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
+## Architecture Overview
+HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages.
+### 1. Token Embedding Layer
+Input token ids are mapped into a dense latent space of dimension `d_model=512`.
+### 2. Hierarchical Chunker
+The embedded token sequence is grouped into fixed-size chunks with:
+- `chunk_size=4`
+This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages.
+### 3. Repeated HSSM Blocks
+The model contains:
+- `num_blocks=6`
+Each HSSM block combines two complementary mechanisms:
+#### a. Selective State Space Modeling
+A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text.
+Key state-space parameter:
+- `d_state=32`
+#### b. Sparse Mixture-of-Experts Feed-Forward Stage
+Each block also contains a sparse mixture-of-experts module:
+- `num_experts=8`
+- `top_k=2`
+- `expert_dim=1024`
+For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time.
+### 4. Final Normalization and Output Projection
+After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction.
+## Released Configuration
+This release uses the larger Config A style setup:
+- `vocab_size=20000`
+- `d_model=512`
+- `d_state=32`
+- `num_blocks=6`
+- `num_experts=8`
+- `top_k=2`
+- `chunk_size=4`
+- `expert_dim=1024`
+## How HSSM Works Internally
+At a high level, HSSM processes text as follows:
+1. Tokens are embedded into a continuous space.
+2. Neighboring tokens are grouped into chunks.
+3. Chunk representations are passed through repeated hierarchical blocks.
+4. Inside each block, selective state space dynamics model ordered sequence behavior.
+5. Sparse expert routing expands feed-forward capacity using only a small subset of experts per step.
+6. Final logits are produced for autoregressive next-token generation.
+This creates a hybrid inductive bias:
+- **hierarchical** because tokens are compressed into chunk-level structure
+- **state-space based** because sequential dynamics are modeled through learned latent state transitions
+- **sparse expert based** because only a subset of experts is activated for each representation
+## Known Limitations
+Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
+- repetitive continuations
+- weak dialogue alignment
+- unstable chat behavior on open-ended prompts
+- sensitivity to tokenizer choice
+For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
+## Files in This Repository
+- `hssm_fineweb_edu_final.pt` — pretrained HSSM checkpoint
+- `simple_tokenizer_20k.json` — tokenizer file used with this release
+- `HSSM.png` — architecture image shown in this model card
+## Example Loading (PyTorch)
+```python
+import torch
+from hssm_pretrained_chat import load_pretrained, generate_reply
+tokenizer, model = load_pretrained(
+    "hssm_fineweb_edu_final.pt",
+    "simple_tokenizer_20k.json",
+    device="cpu",
+)
+reply = generate_reply(
+    model=model,
+    tokenizer=tokenizer,
+    prompt="What is machine learning?",
+    max_length=48,
+    temperature=0.3,
+    top_k=12,
+    top_p=0.78,
+    repetition_penalty=1.45,
+    no_repeat_ngram_size=4,
+)
+print(reply)
+```
+## Repository / Author
+- **Model name**: `HSSM`
+- **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
+- **Checkpoint type**: pretrained public release
+## Citation
+If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.