HSSM / README.md

Update README.md

071f846 verified 12 days ago

6.15 kB

	---
	language:
	- en
	tags:
	- pytorch
	- hssm
	- state-space-model
	- mixture-of-experts
	- autoregressive
	- text-generation
	- 73.8 M
	- pretrained
	datasets:
	- HuggingFaceFW/fineweb-edu
	pipeline_tag: text-generation
	library_name: pytorch
	model_type: custom
	license: mit
	---

	# HSSM

	HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by [DevHunterAI](https://huggingface.co/DevHunterAI).

	![HSSM architecture](./HSSM.png)

	## Model Summary

	HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation.

	This release corresponds to the pretrained checkpoint:

	- `hssm_fineweb_edu_final.pt`

	Parameter count:
	- `73.8M` parameters

	This checkpoint was pretrained on:

	- `HuggingFaceFW/fineweb-edu`

	## Intended Use

	This model is intended for:

	- research on hierarchical state space models
	- experimentation with sparse expert routing for language modeling
	- continued fine-tuning on dialogue, instruction, or domain datasets
	- architecture analysis and comparison against transformer and recurrent baselines

	This checkpoint is pretrained, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.

	## Training Dataset

	The pretraining data source selected for this release is:

	- Dataset: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
	- Usage mode: streaming pretraining pipeline
	- Selection: first 1.5 million samples
	- Epochs: 1

	FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.

	## Architecture Overview

	HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages.

	### 1. Token Embedding Layer

	Input token ids are mapped into a dense latent space of dimension `d_model=512`.

	### 2. Hierarchical Chunker

	The embedded token sequence is grouped into fixed-size chunks with:

	- `chunk_size=4`

	This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages.

	### 3. Repeated HSSM Blocks

	The model contains:

	- `num_blocks=6`

	Each HSSM block combines two complementary mechanisms:

	#### a. Selective State Space Modeling

	A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text.

	Key state-space parameter:

	- `d_state=32`

	#### b. Sparse Mixture-of-Experts Feed-Forward Stage

	Each block also contains a sparse mixture-of-experts module:

	- `num_experts=8`
	- `top_k=2`
	- `expert_dim=1024`

	For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time.

	### 4. Final Normalization and Output Projection

	After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction.

	## Released Configuration

	This release uses the larger Config A style setup:

	- `vocab_size=20000`
	- `d_model=512`
	- `d_state=32`
	- `num_blocks=6`
	- `num_experts=8`
	- `top_k=2`
	- `chunk_size=4`
	- `expert_dim=1024`

	## How HSSM Works Internally

	At a high level, HSSM processes text as follows:

	1. Tokens are embedded into a continuous space.
	2. Neighboring tokens are grouped into chunks.
	3. Chunk representations are passed through repeated hierarchical blocks.
	4. Inside each block, selective state space dynamics model ordered sequence behavior.
	5. Sparse expert routing expands feed-forward capacity using only a small subset of experts per step.
	6. Final logits are produced for autoregressive next-token generation.

	This creates a hybrid inductive bias:

	- hierarchical because tokens are compressed into chunk-level structure
	- state-space based because sequential dynamics are modeled through learned latent state transitions
	- sparse expert based because only a subset of experts is activated for each representation

	## Known Limitations

	Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:

	- repetitive continuations
	- weak dialogue alignment
	- unstable chat behavior on open-ended prompts
	- sensitivity to tokenizer choice

	For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.

	## Files in This Repository

	- `hssm_fineweb_edu_final.pt` — pretrained HSSM checkpoint
	- `simple_tokenizer_20k.json` — tokenizer file used with this release
	- `HSSM.png` — architecture image shown in this model card

	## Example Loading (PyTorch)

	```python
	import torch
	from hssm_pretrained_chat import load_pretrained, generate_reply

	tokenizer, model = load_pretrained(
	"hssm_fineweb_edu_final.pt",
	"simple_tokenizer_20k.json",
	device="cpu",
	)

	reply = generate_reply(
	model=model,
	tokenizer=tokenizer,
	prompt="What is machine learning?",
	max_length=48,
	temperature=0.3,
	top_k=12,
	top_p=0.78,
	repetition_penalty=1.45,
	no_repeat_ngram_size=4,
	)

	print(reply)
	```

	## Repository / Author

	- Model name: `HSSM`
	- Publisher: [DevHunterAI](https://huggingface.co/DevHunterAI)
	- Checkpoint type: pretrained public release

	## Citation

	If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.