---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- pytorch
- research
- sparse-attention
- mixture-of-experts
---

# SHRAM — Sparse Hybrid Token Routed Attention Mixture

A research baseline implementing the SHRAM architecture from "An Examination of Sparse
Attention for Long Context Purposes." No pretrained weights — pull the architecture from
the Hub and instantiate a freshly initialised model from config. Every parameter is
overridable at instantiation time via kwargs.

> **Important:** `trust_remote_code=True` is required. It downloads the architecture
> source files from the Hub and imports them into your Python process. Review the
> source at [smithblack-0/SHRAM](https://huggingface.co/smithblack-0/SHRAM) before use. Those interested can also
> clone the git repository at https://github.com/smithblack-0/advanced-transformers-lib

## Architecture

SHRAM replaces every standard attention layer with a hybrid layer `H(x) = h_l(x) + h_s(x)`:

- **h_l** — local sliding-window causal attention path.
- **h_s** — MoSRAH sparse routed path. Each token selects K of L available expert heads
  via token-choice routing. Bottlenecked Ensemble Attention (BEA) is applied per head.

All other components follow the Llama 3 baseline (RMSNorm, SwiGLU FFN, RoPE).

## Usage

This repository contains no pretrained weights. The intended workflow is: pull the
architecture config from the Hub, instantiate a model with fresh random weights, then
train it yourself.

```python
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

# Step 1: pull the architecture config from the Hub.
# AutoConfig.from_pretrained downloads config.json only — no weights are loaded.
# Override any parameter via kwargs.
config = AutoConfig.from_pretrained(
    "smithblack-0/SHRAM",
    trust_remote_code=True,
    num_hidden_layers=16,       # example override
    num_mosrah_heads=32,        # example override
)

# Step 2: instantiate with fresh random weights.
# from_config never loads a checkpoint — it always produces a randomly initialised model.
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# Step 3: load the tokenizer.
tokenizer = AutoTokenizer.from_pretrained("smithblack-0/SHRAM")
```

After training your own checkpoint, save and reload it in the standard way:

```python
model.save_pretrained("./my-checkpoint")
model = AutoModelForCausalLM.from_pretrained("./my-checkpoint", trust_remote_code=True)
```

## Constructor Defaults

The values below are the defaults you get if you call `AutoConfig.from_pretrained` with
no overrides. They are not the parameters of a pretrained model — this repository
contains no weights. All values are overridable via kwargs.

| Parameter | Default |
|-----------|---------|
| `alpha` | 1.0 |
| `attention_dropout` | 0.0 |
| `beta` | 32.0 |
| `dtype` | None |
| `head_dim` | 16 |
| `hidden_size` | 512 |
| `inference_sequence_length` | 1024 |
| `intermediate_size` | 1366 |
| `local_rope_theta` | 10000.0 |
| `mosrah_rope_theta` | 10000.0 |
| `num_hidden_layers` | 12 |
| `num_mosrah_heads` | 16 |
| `num_selected_heads` | 16 |
| `num_sliding_window_heads` | 16 |
| `output_hidden_states` | False |
| `rms_norm_eps` | 1e-05 |
| `rope_mode` | main_sequence |
| `tie_word_embeddings` | False |
| `training_sequence_length` | 1024 |
| `use_cache` | True |
| `vocab_size` | 50277 |
| `window_size` | 128 |

## License

MIT. Clean-room synthesis informed by the reference paper. Tokenizer is GPT-NeoX
(`EleutherAI/gpt-neox-20b`, Apache 2.0).