File size: 3,578 Bytes
7bf638f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 | ---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- pytorch
- research
- sparse-attention
- mixture-of-experts
---
# SHRAM — Sparse Hybrid Token Routed Attention Mixture
A research baseline implementing the SHRAM architecture from "An Examination of Sparse
Attention for Long Context Purposes." No pretrained weights — pull the architecture from
the Hub and instantiate a freshly initialised model from config. Every parameter is
overridable at instantiation time via kwargs.
> **Important:** `trust_remote_code=True` is required. It downloads the architecture
> source files from the Hub and imports them into your Python process. Review the
> source at [smithblack-0/SHRAM](https://huggingface.co/smithblack-0/SHRAM) before use. Those interested can also
> clone the git repository at https://github.com/smithblack-0/advanced-transformers-lib
## Architecture
SHRAM replaces every standard attention layer with a hybrid layer `H(x) = h_l(x) + h_s(x)`:
- **h_l** — local sliding-window causal attention path.
- **h_s** — MoSRAH sparse routed path. Each token selects K of L available expert heads
via token-choice routing. Bottlenecked Ensemble Attention (BEA) is applied per head.
All other components follow the Llama 3 baseline (RMSNorm, SwiGLU FFN, RoPE).
## Usage
This repository contains no pretrained weights. The intended workflow is: pull the
architecture config from the Hub, instantiate a model with fresh random weights, then
train it yourself.
```python
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
# Step 1: pull the architecture config from the Hub.
# AutoConfig.from_pretrained downloads config.json only — no weights are loaded.
# Override any parameter via kwargs.
config = AutoConfig.from_pretrained(
"smithblack-0/SHRAM",
trust_remote_code=True,
num_hidden_layers=16, # example override
num_mosrah_heads=32, # example override
)
# Step 2: instantiate with fresh random weights.
# from_config never loads a checkpoint — it always produces a randomly initialised model.
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# Step 3: load the tokenizer.
tokenizer = AutoTokenizer.from_pretrained("smithblack-0/SHRAM")
```
After training your own checkpoint, save and reload it in the standard way:
```python
model.save_pretrained("./my-checkpoint")
model = AutoModelForCausalLM.from_pretrained("./my-checkpoint", trust_remote_code=True)
```
## Constructor Defaults
The values below are the defaults you get if you call `AutoConfig.from_pretrained` with
no overrides. They are not the parameters of a pretrained model — this repository
contains no weights. All values are overridable via kwargs.
| Parameter | Default |
|-----------|---------|
| `alpha` | 1.0 |
| `attention_dropout` | 0.0 |
| `beta` | 32.0 |
| `dtype` | None |
| `head_dim` | 16 |
| `hidden_size` | 512 |
| `inference_sequence_length` | 1024 |
| `intermediate_size` | 1366 |
| `local_rope_theta` | 10000.0 |
| `mosrah_rope_theta` | 10000.0 |
| `num_hidden_layers` | 12 |
| `num_mosrah_heads` | 16 |
| `num_selected_heads` | 16 |
| `num_sliding_window_heads` | 16 |
| `output_hidden_states` | False |
| `rms_norm_eps` | 1e-05 |
| `rope_mode` | main_sequence |
| `tie_word_embeddings` | False |
| `training_sequence_length` | 1024 |
| `use_cache` | True |
| `vocab_size` | 50277 |
| `window_size` | 128 |
## License
MIT. Clean-room synthesis informed by the reference paper. Tokenizer is GPT-NeoX
(`EleutherAI/gpt-neox-20b`, Apache 2.0).
|