---
language:
- en
tags:
- pytorch
- hssm
- state-space-model
- mixture-of-experts
- autoregressive
- text-generation
- 73.8 M
- pretrained
datasets:
- HuggingFaceFW/fineweb-edu
pipeline_tag: text-generation
library_name: pytorch
model_type: custom
license: mit
---

# HSSM

HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by [DevHunterAI](https://huggingface.co/DevHunterAI).

![HSSM architecture](./HSSM.png)

## Model Summary

HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation.

This release corresponds to the pretrained checkpoint:

- `hssm_fineweb_edu_final.pt`

Parameter count:
- `73.8M` parameters

This checkpoint was pretrained on:

- `HuggingFaceFW/fineweb-edu`

## Intended Use

This model is intended for:

- research on hierarchical state space models
- experimentation with sparse expert routing for language modeling
- continued fine-tuning on dialogue, instruction, or domain datasets
- architecture analysis and comparison against transformer and recurrent baselines

This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.

## Training Dataset

The pretraining data source selected for this release is:

- **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
- **Usage mode**: streaming pretraining pipeline
- **Selection**: first 1.5 million samples
- **Epochs**: 1

FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.

## Architecture Overview

HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages.

### 1. Token Embedding Layer

Input token ids are mapped into a dense latent space of dimension `d_model=512`.

### 2. Hierarchical Chunker

The embedded token sequence is grouped into fixed-size chunks with:

- `chunk_size=4`

This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages.

### 3. Repeated HSSM Blocks

The model contains:

- `num_blocks=6`

Each HSSM block combines two complementary mechanisms:

#### a. Selective State Space Modeling

A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text.

Key state-space parameter:

- `d_state=32`

#### b. Sparse Mixture-of-Experts Feed-Forward Stage

Each block also contains a sparse mixture-of-experts module:

- `num_experts=8`
- `top_k=2`
- `expert_dim=1024`

For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time.

### 4. Final Normalization and Output Projection

After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction.

## Released Configuration

This release uses the larger Config A style setup:

- `vocab_size=20000`
- `d_model=512`
- `d_state=32`
- `num_blocks=6`
- `num_experts=8`
- `top_k=2`
- `chunk_size=4`
- `expert_dim=1024`

## How HSSM Works Internally

At a high level, HSSM processes text as follows:

1. Tokens are embedded into a continuous space.
2. Neighboring tokens are grouped into chunks.
3. Chunk representations are passed through repeated hierarchical blocks.
4. Inside each block, selective state space dynamics model ordered sequence behavior.
5. Sparse expert routing expands feed-forward capacity using only a small subset of experts per step.
6. Final logits are produced for autoregressive next-token generation.

This creates a hybrid inductive bias:

- **hierarchical** because tokens are compressed into chunk-level structure
- **state-space based** because sequential dynamics are modeled through learned latent state transitions
- **sparse expert based** because only a subset of experts is activated for each representation

## Known Limitations

Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:

- repetitive continuations
- weak dialogue alignment
- unstable chat behavior on open-ended prompts
- sensitivity to tokenizer choice

For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.

## Files in This Repository

- `hssm_fineweb_edu_final.pt` — pretrained HSSM checkpoint
- `simple_tokenizer_20k.json` — tokenizer file used with this release
- `HSSM.png` — architecture image shown in this model card

## Example Loading (PyTorch)

```python
import torch
from hssm_pretrained_chat import load_pretrained, generate_reply

tokenizer, model = load_pretrained(
    "hssm_fineweb_edu_final.pt",
    "simple_tokenizer_20k.json",
    device="cpu",
)

reply = generate_reply(
    model=model,
    tokenizer=tokenizer,
    prompt="What is machine learning?",
    max_length=48,
    temperature=0.3,
    top_k=12,
    top_p=0.78,
    repetition_penalty=1.45,
    no_repeat_ngram_size=4,
)

print(reply)
```

## Repository / Author

- **Model name**: `HSSM`
- **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
- **Checkpoint type**: pretrained public release

## Citation

If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.