Update architecture and tokenizer

102f1bb verified 13 days ago

3.62 kB

	---
	language:
	- en
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- pytorch
	- research
	- sparse-attention
	- mixture-of-experts
	---

	# SHRAM — Sparse Hybrid Token Routed Attention Mixture

	A research baseline implementing the SHRAM architecture from "An Examination of Sparse
	Attention for Long Context Purposes." No pretrained weights — pull the architecture from
	the Hub and instantiate a freshly initialised model from config. Every parameter is
	overridable at instantiation time via kwargs.

	> Important: `trust_remote_code=True` is required. It downloads the architecture
	> source files from the Hub and imports them into your Python process. Review the
	> source at [smithblack-0/SHRAM-dev](https://huggingface.co/smithblack-0/SHRAM-dev) before use. Those interested can also
	> clone the git repository at https://github.com/smithblack-0/advanced-transformers-lib

	## Architecture

	SHRAM replaces every standard attention layer with a hybrid layer `H(x) = h_l(x) + h_s(x)`:

	- h_l — local sliding-window causal attention path.
	- h_s — MoSRAH sparse routed path. Each token selects K of L available expert heads
	via token-choice routing. Bottlenecked Ensemble Attention (BEA) is applied per head.

	All other components follow the Llama 3 baseline (RMSNorm, SwiGLU FFN, RoPE).

	## Usage

	This repository contains no pretrained weights. The intended workflow is: pull the
	architecture config from the Hub, instantiate a model with fresh random weights, then
	train it yourself.

	```python
	from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

	# Step 1: pull the architecture config from the Hub.
	# AutoConfig.from_pretrained downloads config.json only — no weights are loaded.
	# Override any parameter via kwargs.
	config = AutoConfig.from_pretrained(
	"smithblack-0/SHRAM-dev",
	trust_remote_code=True,
	num_decoder_layers=16, # example override
	num_mosrah_heads=32, # example override
	)

	# Step 2: instantiate with fresh random weights.
	# from_config never loads a checkpoint — it always produces a randomly initialised model.
	model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

	# Step 3: load the tokenizer.
	tokenizer = AutoTokenizer.from_pretrained("smithblack-0/SHRAM-dev")
	```

	After training your own checkpoint, save and reload it in the standard way:

	```python
	model.save_pretrained("./my-checkpoint")
	model = AutoModelForCausalLM.from_pretrained("./my-checkpoint", trust_remote_code=True)
	```

	## Constructor Defaults

	The values below are the defaults you get if you call `AutoConfig.from_pretrained` with
	no overrides. They are not the parameters of a pretrained model — this repository
	contains no weights. All values are overridable via kwargs.

	\| Parameter \| Default \|
	\|-----------\|---------\|
	\| `alpha` \| 1.0 \|
	\| `attention_dropout` \| 0.0 \|
	\| `beta` \| 32.0 \|
	\| `dtype` \| None \|
	\| `embedding_width` \| 512 \|
	\| `head_dim` \| 16 \|
	\| `inference_sequence_length` \| 1024 \|
	\| `local_rope_theta` \| 10000.0 \|
	\| `mlp_width` \| 1366 \|
	\| `mosrah_rope_theta` \| 10000.0 \|
	\| `num_decoder_layers` \| 12 \|
	\| `num_mosrah_heads` \| 16 \|
	\| `num_selected_heads` \| 16 \|
	\| `num_sliding_window_heads` \| 16 \|
	\| `output_hidden_states` \| False \|
	\| `rms_norm_eps` \| 1e-05 \|
	\| `rope_mode` \| main_sequence \|
	\| `tie_word_embeddings` \| False \|
	\| `training_sequence_length` \| 1024 \|
	\| `use_cache` \| True \|
	\| `use_residual_gate` \| True \|
	\| `vocab_size` \| 50277 \|
	\| `window_size` \| 128 \|

	## License

	MIT. Clean-room synthesis informed by the reference paper. Tokenizer is GPT-NeoX
	(`EleutherAI/gpt-neox-20b`, Apache 2.0).