| --- |
| language: |
| - en |
| license: mit |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - pytorch |
| - research |
| - sparse-attention |
| - mixture-of-experts |
| --- |
| |
| # SHRAM β Sparse Hybrid Token Routed Attention Mixture |
|
|
| A research baseline implementing the SHRAM architecture from "An Examination of Sparse |
| Attention for Long Context Purposes." No pretrained weights β pull the architecture from |
| the Hub and instantiate a freshly initialised model from config. Every parameter is |
| overridable at instantiation time via kwargs. |
|
|
| > **Important:** `trust_remote_code=True` is required. It downloads the architecture |
| > source files from the Hub and imports them into your Python process. Review the |
| > source at [smithblack-0/SHRAM](https://huggingface.co/smithblack-0/SHRAM) before use. Those interested can also |
| > clone the git repository at https://github.com/smithblack-0/advanced-transformers-lib |
|
|
| ## Architecture |
|
|
| SHRAM replaces every standard attention layer with a hybrid layer `H(x) = h_l(x) + h_s(x)`: |
|
|
| - **h_l** β local sliding-window causal attention path. |
| - **h_s** β MoSRAH sparse routed path. Each token selects K of L available expert heads |
| via token-choice routing. Bottlenecked Ensemble Attention (BEA) is applied per head. |
|
|
| All other components follow the Llama 3 baseline (RMSNorm, SwiGLU FFN, RoPE). |
|
|
| ## Usage |
|
|
| This repository contains no pretrained weights. The intended workflow is: pull the |
| architecture config from the Hub, instantiate a model with fresh random weights, then |
| train it yourself. |
|
|
| ```python |
| from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer |
| |
| # Step 1: pull the architecture config from the Hub. |
| # AutoConfig.from_pretrained downloads config.json only β no weights are loaded. |
| # Override any parameter via kwargs. |
| config = AutoConfig.from_pretrained( |
| "smithblack-0/SHRAM", |
| trust_remote_code=True, |
| num_hidden_layers=16, # example override |
| num_mosrah_heads=32, # example override |
| ) |
| |
| # Step 2: instantiate with fresh random weights. |
| # from_config never loads a checkpoint β it always produces a randomly initialised model. |
| model = AutoModelForCausalLM.from_config(config, trust_remote_code=True) |
| |
| # Step 3: load the tokenizer. |
| tokenizer = AutoTokenizer.from_pretrained("smithblack-0/SHRAM") |
| ``` |
|
|
| After training your own checkpoint, save and reload it in the standard way: |
|
|
| ```python |
| model.save_pretrained("./my-checkpoint") |
| model = AutoModelForCausalLM.from_pretrained("./my-checkpoint", trust_remote_code=True) |
| ``` |
|
|
| ## Constructor Defaults |
|
|
| The values below are the defaults you get if you call `AutoConfig.from_pretrained` with |
| no overrides. They are not the parameters of a pretrained model β this repository |
| contains no weights. All values are overridable via kwargs. |
|
|
| | Parameter | Default | |
| |-----------|---------| |
| | `alpha` | 1.0 | |
| | `attention_dropout` | 0.0 | |
| | `beta` | 32.0 | |
| | `dtype` | None | |
| | `head_dim` | 16 | |
| | `hidden_size` | 512 | |
| | `inference_sequence_length` | 1024 | |
| | `intermediate_size` | 1366 | |
| | `local_rope_theta` | 10000.0 | |
| | `mosrah_rope_theta` | 10000.0 | |
| | `num_hidden_layers` | 12 | |
| | `num_mosrah_heads` | 16 | |
| | `num_selected_heads` | 16 | |
| | `num_sliding_window_heads` | 16 | |
| | `output_hidden_states` | False | |
| | `rms_norm_eps` | 1e-05 | |
| | `rope_mode` | main_sequence | |
| | `tie_word_embeddings` | False | |
| | `training_sequence_length` | 1024 | |
| | `use_cache` | True | |
| | `vocab_size` | 50277 | |
| | `window_size` | 128 | |
|
|
| ## License |
|
|
| MIT. Clean-room synthesis informed by the reference paper. Tokenizer is GPT-NeoX |
| (`EleutherAI/gpt-neox-20b`, Apache 2.0). |
|
|