| --- |
| language: |
| - en |
| tags: |
| - pytorch |
| - hssm-v2 |
| - hierarchical-state-space-model |
| - mixture-of-experts |
| - autoregressive |
| - text-generation |
| - fineweb-edu |
| - 250m-parameters |
| - 0.25B |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| pipeline_tag: text-generation |
| library_name: pytorch |
| --- |
| |
| # HSSM v2 250M |
|
|
| HSSM v2 is a hierarchical state-space language model with sparse Mixture-of-Experts routing for autoregressive text generation. This release contains the FineWeb-Edu pretrained checkpoint published by [DevHunterAI](https://huggingface.co/DevHunterAI). |
|
|
|  |
|
|
| ## Model Summary |
|
|
| HSSM v2 combines local depthwise temporal mixing, chunk-level hierarchical state propagation, residual gating, and sparse Mixture-of-Experts feed-forward blocks in a single causal language model. |
|
|
| This release corresponds to the pretrained checkpoint: |
|
|
| - `hssm_v2_250m_fineweb_edu_final.pt` |
|
|
| Model scale: |
| - **Total parameters**: `250,040,256` (`~250M`) |
| - **Active parameters per token path**: `26,534,400` (`~26.5M`) |
| - **Architecture**: sparse MoE language model with top-1 expert routing in MoE layers |
|
|
| This checkpoint was pretrained on: |
|
|
| - `HuggingFaceFW/fineweb-edu` |
| - `1.25B` tokens |
|
|
| Training note: |
| - pretrained in approximately **2 hours** on an **NVIDIA RTX Pro 6000 Blackwell GPU** |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
|
|
| - research on hierarchical state-space language models |
| - experimentation with sparse expert routing for autoregressive text generation |
| - continued fine-tuning on dialogue, instruction, or domain datasets |
| - architecture analysis and comparison against transformer and recurrent baselines |
|
|
| This checkpoint is **pretrained**, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage. |
|
|
| ## Training Dataset |
|
|
| The pretraining data source for this release is: |
|
|
| - **Dataset**: [`HuggingFaceFW/fineweb-edu`](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
| - **Usage mode**: streaming pretraining pipeline |
| - **Token budget**: `1.25B` tokens |
| - **Domain**: educational and general web text |
|
|
| FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks. |
|
|
| ## Architecture Overview |
|
|
| HSSM v2 is organized as a stacked hierarchical autoregressive architecture with token embeddings, ten HSSM blocks, final normalization, and a tied language modeling head. |
|
|
| ### Core configuration |
|
|
| - `vocab_size = 50257` |
| - `d_model = 288` |
| - `n_layers = 10` |
| - `d_ff = 512` |
| - `state_rank = 128` |
| - `chunk_size = 8` |
| - `num_experts = 64` |
| - `experts_per_token = 1` |
| - `expert_dim = 2048` |
| - `moe_every = 4` |
| - `tie_embeddings = true` |
|
|
| ### Block structure |
|
|
| Each HSSM v2 block follows this pattern: |
|
|
| 1. `RMSNorm` |
| 2. `HierarchicalStateMixer` |
| 3. residual add |
| 4. `RMSNorm` |
| 5. `GatedMLP` or `SparseMoE` |
| 6. residual add |
|
|
| Every 4th block uses `SparseMoE`, so with 10 layers this release contains 2 MoE blocks. |
|
|
| ### HierarchicalStateMixer |
|
|
| The mixer replaces standard attention with a combination of: |
|
|
| - depthwise `Conv1d` local temporal mixing |
| - chunking with `chunk_size=8` |
| - mean pooling over chunk windows |
| - state compression `288 -> 128` |
| - state expansion `128 -> 288` |
| - repeat-interleave back to token length |
| - gated residual fusion followed by output projection |
|
|
| This gives the model a hybrid inductive bias with local token interaction and chunk-level state propagation. |
|
|
| ### Sparse MoE |
|
|
| Sparse MoE blocks use: |
|
|
| - `64` experts |
| - top-`1` routing per token |
| - expert hidden size `2048` |
| - auxiliary load-balancing loss |
|
|
| Only one expert path is active per token in each MoE layer, which is why the active parameter count is much smaller than the total parameter count. |
|
|
| ### Output head |
|
|
| After the final `RMSNorm`, the model projects hidden states to vocabulary logits using a tied LM head that shares weights with the token embedding matrix. |
|
|
| ## Training Details |
|
|
| 1. Tokens are embedded into a continuous space. |
| 2. Local token interactions are modeled with depthwise convolution. |
| 3. Chunk summaries are compressed into latent states and expanded back across token positions. |
| 4. Sparse MoE blocks increase capacity with top-1 expert routing. |
| 5. Final logits are produced for next-token prediction. |
|
|
| Additional training facts for this release: |
|
|
| - **Pretraining tokens**: `1.25B` |
| - **Training hardware**: `NVIDIA RTX Pro 6000 Blackwell` |
| - **Approximate pretraining duration**: `2 hours` |
| - **Objective**: autoregressive next-token prediction with auxiliary MoE load-balancing loss |
|
|
| ## Known Limitations |
|
|
| Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe: |
|
|
| - repetitive continuations |
| - weak dialogue alignment |
| - unstable chat behavior on open-ended prompts |
| - sensitivity to tokenizer choice |
|
|
| For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data. |
|
|
| ## Files in This Repository |
|
|
| - `hssm_v2_250m_fineweb_edu_final.pt` — pretrained HSSM v2 checkpoint |
| - `HSSM_v2_architecture.png` — architecture image shown in this model card |
| - `hssm_v2_gpu_pretrain.py` — training/model definition reference |
| - `hssm_pretrained_chat.py` — local loading and generation helper |
|
|
| ## Example Loading (PyTorch) |
|
|
| ```python |
| from hssm_pretrained_chat import load_pretrained, generate_reply |
| |
| tokenizer, model = load_pretrained( |
| "hssm_v2_250m_fineweb_edu_final.pt", |
| "gpt2", |
| device="cpu", |
| ) |
| |
| reply = generate_reply( |
| model=model, |
| tokenizer=tokenizer, |
| prompt="What is machine learning?", |
| max_length=40, |
| temperature=0.0, |
| top_k=4, |
| top_p=0.65, |
| repetition_penalty=1.9, |
| no_repeat_ngram_size=6, |
| ) |
| |
| print(reply) |
| ``` |
|
|
| ## Repository / Author |
|
|
| - **Model name**: `HSSM v2 250M` |
| - **Publisher**: [DevHunterAI](https://huggingface.co/DevHunterAI) |
| - **Checkpoint type**: pretrained public release |
|
|
| ## Citation |
|
|
| If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source. |
|
|