DevHunterAI
/

HSSM-v2-250M

+---
+language:
+- en
+tags:
+- pytorch
+- hssm-v2
+- hierarchical-state-space-model
+- mixture-of-experts
+- autoregressive
+- text-generation
+- fineweb-edu
+- 250m-parameters
+datasets:
+- HuggingFaceFW/fineweb-edu
+pipeline_tag: text-generation
+library_name: pytorch
+---
+# HSSM v2 250M
+HSSM v2 is a hierarchical state-space language model with sparse Mixture-of-Experts routing for autoregressive text generation. This release contains the FineWeb-Edu pretrained checkpoint published by [DevHunterAI](https://huggingface.co/DevHunterAI).
+![HSSM v2 architecture](./HSSM_v2_architecture.png)
+## Model Summary
+HSSM v2 combines local depthwise temporal mixing, chunk-level hierarchical state propagation, residual gating, and sparse Mixture-of-Experts feed-forward blocks in a single causal language model.
+This release corresponds to the pretrained checkpoint:
+- `hssm_v2_250m_fineweb_edu_final.pt`
+Model scale:
+- **Total parameters**: `250,040,256` (`~250M`)
+- **Active parameters per token path**: `26,534,400` (`~26.5M`)
+- **Architecture**: sparse MoE language model with top-1 expert routing in MoE layers
+This checkpoint was pretrained on:
+- `HuggingFaceFW/fineweb-edu`
+- `1.25B` tokens
+Training note:
+- pretrained in approximately **2 hours** on an **NVIDIA RTX Pro 6000 Blackwell GPU**
+## Intended Use
+This model is intended for:
+- research on hierarchical state-space language models
+- experimentation with sparse expert routing for autoregressive text generation
+- continued fine-tuning on dialogue, instruction, or domain datasets
+- architecture analysis and comparison against transformer and recurrent baselines
+This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
+## Training Dataset
+The pretraining data source for this release is:
+- **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
+- **Usage mode**: streaming pretraining pipeline
+- **Token budget**: `1.25B` tokens
+- **Domain**: educational and general web text
+FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
+## Architecture Overview
+HSSM v2 is organized as a stacked hierarchical autoregressive architecture with token embeddings, ten HSSM blocks, final normalization, and a tied language modeling head.
+### Core configuration
+- `vocab_size = 50257`
+- `d_model = 288`
+- `n_layers = 10`
+- `d_ff = 512`
+- `state_rank = 128`
+- `chunk_size = 8`
+- `num_experts = 64`
+- `experts_per_token = 1`
+- `expert_dim = 2048`
+- `moe_every = 4`
+- `tie_embeddings = true`
+### Block structure
+Each HSSM v2 block follows this pattern:
+1. `RMSNorm`
+2. `HierarchicalStateMixer`
+3. residual add
+4. `RMSNorm`
+5. `GatedMLP` or `SparseMoE`
+6. residual add
+Every 4th block uses `SparseMoE`, so with 10 layers this release contains 2 MoE blocks.
+### HierarchicalStateMixer
+The mixer replaces standard attention with a combination of:
+- depthwise `Conv1d` local temporal mixing
+- chunking with `chunk_size=8`
+- mean pooling over chunk windows
+- state compression `288 -> 128`
+- state expansion `128 -> 288`
+- repeat-interleave back to token length
+- gated residual fusion followed by output projection
+This gives the model a hybrid inductive bias with local token interaction and chunk-level state propagation.
+### Sparse MoE
+Sparse MoE blocks use:
+- `64` experts
+- top-`1` routing per token
+- expert hidden size `2048`
+- auxiliary load-balancing loss
+Only one expert path is active per token in each MoE layer, which is why the active parameter count is much smaller than the total parameter count.
+### Output head
+After the final `RMSNorm`, the model projects hidden states to vocabulary logits using a tied LM head that shares weights with the token embedding matrix.
+## Training Details
+1. Tokens are embedded into a continuous space.
+2. Local token interactions are modeled with depthwise convolution.
+3. Chunk summaries are compressed into latent states and expanded back across token positions.
+4. Sparse MoE blocks increase capacity with top-1 expert routing.
+5. Final logits are produced for next-token prediction.
+Additional training facts for this release:
+- **Pretraining tokens**: `1.25B`
+- **Training hardware**: `NVIDIA RTX Pro 6000 Blackwell`
+- **Approximate pretraining duration**: `2 hours`
+- **Objective**: autoregressive next-token prediction with auxiliary MoE load-balancing loss
+## Known Limitations
+Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
+- repetitive continuations
+- weak dialogue alignment
+- unstable chat behavior on open-ended prompts
+- sensitivity to tokenizer choice
+For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
+## Files in This Repository
+- `hssm_v2_250m_fineweb_edu_final.pt` — pretrained HSSM v2 checkpoint
+- `HSSM_v2_architecture.png` — architecture image shown in this model card
+- `hssm_v2_gpu_pretrain.py` — training/model definition reference
+- `hssm_pretrained_chat.py` — local loading and generation helper
+## Example Loading (PyTorch)
+```python
+from hssm_pretrained_chat import load_pretrained, generate_reply
+tokenizer, model = load_pretrained(
+    "hssm_v2_250m_fineweb_edu_final.pt",
+    "gpt2",
+    device="cpu",
+)
+reply = generate_reply(
+    model=model,
+    tokenizer=tokenizer,
+    prompt="What is machine learning?",
+    max_length=40,
+    temperature=0.0,
+    top_k=4,
+    top_p=0.65,
+    repetition_penalty=1.9,
+    no_repeat_ngram_size=6,
+)
+print(reply)
+```
+## Repository / Author
+- **Model name**: `HSSM v2 250M`
+- **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI)
+- **Checkpoint type**: pretrained public release
+## Citation
+If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.