Text Generation
Transformers
PyTorch
English
shram
research
sparse-attention
mixture-of-experts
custom_code
Instructions to use smithblack-0/SHRAM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smithblack-0/SHRAM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="smithblack-0/SHRAM", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smithblack-0/SHRAM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use smithblack-0/SHRAM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "smithblack-0/SHRAM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/smithblack-0/SHRAM
- SGLang
How to use smithblack-0/SHRAM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use smithblack-0/SHRAM with Docker Model Runner:
docker model run hf.co/smithblack-0/SHRAM
File size: 6,653 Bytes
7bf638f a86502d 7bf638f 78610c2 7bf638f a86502d 7bf638f a86502d 7bf638f a86502d 7bf638f a86502d 7bf638f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | """Full MoSRAH sparse path for SHRAM.
This module coordinates the routed sparse attention path used inside the SHRAM
hybrid attention layer. The underlying mechanics already live in verified
subunits. The responsibility here is to connect those subunits without
corrupting their bridge contracts.
In particular, this path must preserve three architectural distinctions:
- selected head indices are not routing probabilities
- packed position semantics are chosen before BEA, not inside it
- weighted reduction must consume the router's unbiased renormalized
probabilities after token-choice order has been restored
"""
import torch
from torch import nn
from .__cache__mosrah_cache import MoSRAHCache
from .configuration import ShramConfig
from .__attention__bottlenecked_ensemble_attention import BottleneckedEnsembleAttention
from .__attention__expert_packing import (
pack_experts,
setup_packing,
unpack_experts,
)
from .__attention__router import MoSRAHRouter
from .__attention__positions_converter import SparseMoSRAHPositions
class MoSRAHLayer(nn.Module):
"""Full routed sparse attention path for SHRAM.
The MoSRAH path consumes model-space hidden states together with
authoritative per-token positions and returns the model-space sparse-path
contribution, the router's load-balance loss, and the router's MaxVio
routing-imbalance scalar.
"""
def __init__(self, config: ShramConfig) -> None:
super().__init__()
self.num_experts = config.num_mosrah_heads
self.packed_length = config.mosrah_packed_length
self.router = MoSRAHRouter(config)
self.positions = SparseMoSRAHPositions(config)
self.bea = BottleneckedEnsembleAttention(config)
def num_mosrah_parameters(self) -> int:
"""Return the total number of trainable parameters in this MoSRAH layer."""
return sum(p.numel() for p in self.parameters())
def forward(
self,
hidden_states: torch.Tensor,
position_ids: torch.Tensor,
active_mask: torch.Tensor,
cache: MoSRAHCache | None,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Run the full MoSRAH sparse path.
Args:
hidden_states: Model-space hidden states x of shape (B, N, d).
position_ids: Authoritative per-token positions of shape (B, N).
active_mask: Current-chunk active mask of shape (B, N), where True
means the token is semantically live. Forwarded to the router
so dead tokens are excluded from routing statistics, and to
pack_experts so dead outer tokens do not become semantically
active packed entries.
cache: Optional layer-local MoSRAH cache. Pass None for uncached
execution and the layer-local cache instance for cached execution.
Returns:
sparse_output: Model-space sparse-path output of shape (B, N, d).
load_balance_loss: Scalar router load-balance loss.
max_vio: Detached scalar routing-imbalance summary. Passed through
unchanged from the router; see MoSRAHRouter for semantics.
"""
# -------------------------------------------------------------------
# The first transition moves from model-space token-choice input into
# the packed expert-choice sparse-attention state. Routing decides both
# which experts each token uses and which unbiased probabilities must be
# reserved for the final reduction. The active mask is forwarded to the
# router so dead tokens are excluded from routing statistics, and to
# pack_experts so outer liveness is faithfully carried into the packed
# frame. Packing returns both the unpacking mask (slot occupancy, always
# B*N*K True entries) and the packed active mask (live slots only);
# active_mask is rebound to the packed form after this point.
# -------------------------------------------------------------------
selected_heads, routing_probs, load_balance_loss, max_vio = self.router(
hidden_states, active_mask
)
setup = setup_packing(selected_heads)
entries = {
"hidden_states": (hidden_states, 0.0),
"position_ids": (position_ids, 0),
"active_mask": (active_mask, False),
}
packed, unpacking_mask = pack_experts(entries, setup, selected_heads, self.num_experts, self.packed_length)
packed_hidden_states = packed["hidden_states"]
packed_positions = packed["position_ids"]
active_mask = packed["active_mask"]
# -------------------------------------------------------------------
# Sparse attention runs entirely in the packed expert-choice frame, so
# the RoPE position semantics must also be chosen in that frame. The
# position layer therefore decides whether BEA should see packed
# original-token positions or packed local-slot positions. BEA then
# consumes that packed position tensor together with the packed hidden
# states and the layer-local sparse cache, which it owns directly.
# -------------------------------------------------------------------
bea_positions = self.positions(
packed_positions=packed_positions,
active_mask=active_mask,
cache=cache,
)
packed_outputs = self.bea(
packed_embeddings=packed_hidden_states,
position_ids=bea_positions,
active_mask=active_mask,
cache=cache,
)
# -------------------------------------------------------------------
# The final transition restores token-choice meaning and only then
# collapses the K routed copies back into model space. This ordering is
# required because routing_probs live in token-choice space, whereas BEA
# returns expert-choice packed outputs. The reduction must therefore
# happen after unpacking, and it must use the router's unbiased
# renormalized probabilities rather than any biased selection scores.
# -------------------------------------------------------------------
token_choice_outputs = unpack_experts(
expert_outputs=packed_outputs,
setup=setup,
unpacking_mask=unpacking_mask,
selected_heads=selected_heads,
)
final_output = (
token_choice_outputs * routing_probs.unsqueeze(-1)
).sum(dim=2)
return final_output, load_balance_loss, max_vio
|