SHRAM / README.md
smithblack-0's picture
Update architecture and tokenizer
7bf638f verified
---
language:
- en
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- pytorch
- research
- sparse-attention
- mixture-of-experts
---
# SHRAM β€” Sparse Hybrid Token Routed Attention Mixture
A research baseline implementing the SHRAM architecture from "An Examination of Sparse
Attention for Long Context Purposes." No pretrained weights β€” pull the architecture from
the Hub and instantiate a freshly initialised model from config. Every parameter is
overridable at instantiation time via kwargs.
> **Important:** `trust_remote_code=True` is required. It downloads the architecture
> source files from the Hub and imports them into your Python process. Review the
> source at [smithblack-0/SHRAM](https://huggingface.co/smithblack-0/SHRAM) before use. Those interested can also
> clone the git repository at https://github.com/smithblack-0/advanced-transformers-lib
## Architecture
SHRAM replaces every standard attention layer with a hybrid layer `H(x) = h_l(x) + h_s(x)`:
- **h_l** β€” local sliding-window causal attention path.
- **h_s** β€” MoSRAH sparse routed path. Each token selects K of L available expert heads
via token-choice routing. Bottlenecked Ensemble Attention (BEA) is applied per head.
All other components follow the Llama 3 baseline (RMSNorm, SwiGLU FFN, RoPE).
## Usage
This repository contains no pretrained weights. The intended workflow is: pull the
architecture config from the Hub, instantiate a model with fresh random weights, then
train it yourself.
```python
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
# Step 1: pull the architecture config from the Hub.
# AutoConfig.from_pretrained downloads config.json only β€” no weights are loaded.
# Override any parameter via kwargs.
config = AutoConfig.from_pretrained(
"smithblack-0/SHRAM",
trust_remote_code=True,
num_hidden_layers=16, # example override
num_mosrah_heads=32, # example override
)
# Step 2: instantiate with fresh random weights.
# from_config never loads a checkpoint β€” it always produces a randomly initialised model.
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# Step 3: load the tokenizer.
tokenizer = AutoTokenizer.from_pretrained("smithblack-0/SHRAM")
```
After training your own checkpoint, save and reload it in the standard way:
```python
model.save_pretrained("./my-checkpoint")
model = AutoModelForCausalLM.from_pretrained("./my-checkpoint", trust_remote_code=True)
```
## Constructor Defaults
The values below are the defaults you get if you call `AutoConfig.from_pretrained` with
no overrides. They are not the parameters of a pretrained model β€” this repository
contains no weights. All values are overridable via kwargs.
| Parameter | Default |
|-----------|---------|
| `alpha` | 1.0 |
| `attention_dropout` | 0.0 |
| `beta` | 32.0 |
| `dtype` | None |
| `head_dim` | 16 |
| `hidden_size` | 512 |
| `inference_sequence_length` | 1024 |
| `intermediate_size` | 1366 |
| `local_rope_theta` | 10000.0 |
| `mosrah_rope_theta` | 10000.0 |
| `num_hidden_layers` | 12 |
| `num_mosrah_heads` | 16 |
| `num_selected_heads` | 16 |
| `num_sliding_window_heads` | 16 |
| `output_hidden_states` | False |
| `rms_norm_eps` | 1e-05 |
| `rope_mode` | main_sequence |
| `tie_word_embeddings` | False |
| `training_sequence_length` | 1024 |
| `use_cache` | True |
| `vocab_size` | 50277 |
| `window_size` | 128 |
## License
MIT. Clean-room synthesis informed by the reference paper. Tokenizer is GPT-NeoX
(`EleutherAI/gpt-neox-20b`, Apache 2.0).