Text Generation
Transformers
PyTorch
English
shram
research
sparse-attention
mixture-of-experts
custom_code
Instructions to use smithblack-0/SHRAM-dev with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smithblack-0/SHRAM-dev with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="smithblack-0/SHRAM-dev", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smithblack-0/SHRAM-dev", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use smithblack-0/SHRAM-dev with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "smithblack-0/SHRAM-dev" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/smithblack-0/SHRAM-dev
- SGLang
How to use smithblack-0/SHRAM-dev with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use smithblack-0/SHRAM-dev with Docker Model Runner:
docker model run hf.co/smithblack-0/SHRAM-dev
| language: | |
| - en | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - pytorch | |
| - research | |
| - sparse-attention | |
| - mixture-of-experts | |
| # SHRAM β Sparse Hybrid Token Routed Attention Mixture | |
| A research baseline implementing the SHRAM architecture from "An Examination of Sparse | |
| Attention for Long Context Purposes." No pretrained weights β pull the architecture from | |
| the Hub and instantiate a freshly initialised model from config. Every parameter is | |
| overridable at instantiation time via kwargs. | |
| > **Important:** `trust_remote_code=True` is required. It downloads the architecture | |
| > source files from the Hub and imports them into your Python process. Review the | |
| > source at [smithblack-0/SHRAM-dev](https://huggingface.co/smithblack-0/SHRAM-dev) before use. Those interested can also | |
| > clone the git repository at https://github.com/smithblack-0/advanced-transformers-lib | |
| ## Architecture | |
| SHRAM replaces every standard attention layer with a hybrid layer `H(x) = h_l(x) + h_s(x)`: | |
| - **h_l** β local sliding-window causal attention path. | |
| - **h_s** β MoSRAH sparse routed path. Each token selects K of L available expert heads | |
| via token-choice routing. Bottlenecked Ensemble Attention (BEA) is applied per head. | |
| All other components follow the Llama 3 baseline (RMSNorm, SwiGLU FFN, RoPE). | |
| ## Usage | |
| This repository contains no pretrained weights. The intended workflow is: pull the | |
| architecture config from the Hub, instantiate a model with fresh random weights, then | |
| train it yourself. | |
| ```python | |
| from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer | |
| # Step 1: pull the architecture config from the Hub. | |
| # AutoConfig.from_pretrained downloads config.json only β no weights are loaded. | |
| # Override any parameter via kwargs. | |
| config = AutoConfig.from_pretrained( | |
| "smithblack-0/SHRAM-dev", | |
| trust_remote_code=True, | |
| num_decoder_layers=16, # example override | |
| num_mosrah_heads=32, # example override | |
| ) | |
| # Step 2: instantiate with fresh random weights. | |
| # from_config never loads a checkpoint β it always produces a randomly initialised model. | |
| model = AutoModelForCausalLM.from_config(config, trust_remote_code=True) | |
| # Step 3: load the tokenizer. | |
| tokenizer = AutoTokenizer.from_pretrained("smithblack-0/SHRAM-dev") | |
| ``` | |
| After training your own checkpoint, save and reload it in the standard way: | |
| ```python | |
| model.save_pretrained("./my-checkpoint") | |
| model = AutoModelForCausalLM.from_pretrained("./my-checkpoint", trust_remote_code=True) | |
| ``` | |
| ## Constructor Defaults | |
| The values below are the defaults you get if you call `AutoConfig.from_pretrained` with | |
| no overrides. They are not the parameters of a pretrained model β this repository | |
| contains no weights. All values are overridable via kwargs. | |
| | Parameter | Default | | |
| |-----------|---------| | |
| | `alpha` | 1.0 | | |
| | `attention_dropout` | 0.0 | | |
| | `beta` | 32.0 | | |
| | `dtype` | None | | |
| | `embedding_width` | 512 | | |
| | `head_dim` | 16 | | |
| | `inference_sequence_length` | 1024 | | |
| | `local_rope_theta` | 10000.0 | | |
| | `mlp_width` | 1366 | | |
| | `mosrah_rope_theta` | 10000.0 | | |
| | `num_decoder_layers` | 12 | | |
| | `num_mosrah_heads` | 16 | | |
| | `num_selected_heads` | 16 | | |
| | `num_sliding_window_heads` | 16 | | |
| | `output_hidden_states` | False | | |
| | `rms_norm_eps` | 1e-05 | | |
| | `rope_mode` | main_sequence | | |
| | `tie_word_embeddings` | False | | |
| | `training_sequence_length` | 1024 | | |
| | `use_cache` | True | | |
| | `use_residual_gate` | True | | |
| | `vocab_size` | 50277 | | |
| | `window_size` | 128 | | |
| ## License | |
| MIT. Clean-room synthesis informed by the reference paper. Tokenizer is GPT-NeoX | |
| (`EleutherAI/gpt-neox-20b`, Apache 2.0). | |