Text Generation
Transformers
PyTorch
English
shram
research
sparse-attention
mixture-of-experts
custom_code
Instructions to use smithblack-0/SHRAM-dev with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smithblack-0/SHRAM-dev with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="smithblack-0/SHRAM-dev", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smithblack-0/SHRAM-dev", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use smithblack-0/SHRAM-dev with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "smithblack-0/SHRAM-dev" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/smithblack-0/SHRAM-dev
- SGLang
How to use smithblack-0/SHRAM-dev with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use smithblack-0/SHRAM-dev with Docker Model Runner:
docker model run hf.co/smithblack-0/SHRAM-dev
Update architecture and tokenizer
Browse files- README.md +0 -4
- config.json +1 -5
- configuration.py +243 -284
- huggingface.py +0 -0
README.md
CHANGED
|
@@ -82,12 +82,8 @@ contains no weights. All values are overridable via kwargs.
|
|
| 82 |
| `embedding_width` | 512 |
|
| 83 |
| `head_dim` | 16 |
|
| 84 |
| `inference_sequence_length` | 1024 |
|
| 85 |
-
| `load_balance_loss_type` | causal_overcapacity |
|
| 86 |
| `local_rope_theta` | 10000.0 |
|
| 87 |
-
| `max_bid_rounds` | 10 |
|
| 88 |
-
| `maximum_expert_overclaim` | 20 |
|
| 89 |
| `mlp_width` | 1366 |
|
| 90 |
-
| `mosrah_overallocation_factor` | 2.0 |
|
| 91 |
| `mosrah_rope_theta` | 10000.0 |
|
| 92 |
| `num_decoder_layers` | 12 |
|
| 93 |
| `num_mosrah_heads` | 16 |
|
|
|
|
| 82 |
| `embedding_width` | 512 |
|
| 83 |
| `head_dim` | 16 |
|
| 84 |
| `inference_sequence_length` | 1024 |
|
|
|
|
| 85 |
| `local_rope_theta` | 10000.0 |
|
|
|
|
|
|
|
| 86 |
| `mlp_width` | 1366 |
|
|
|
|
| 87 |
| `mosrah_rope_theta` | 10000.0 |
|
| 88 |
| `num_decoder_layers` | 12 |
|
| 89 |
| `num_mosrah_heads` | 16 |
|
config.json
CHANGED
|
@@ -9,13 +9,9 @@
|
|
| 9 |
"embedding_width": 512,
|
| 10 |
"head_dim": 16,
|
| 11 |
"inference_sequence_length": 1024,
|
| 12 |
-
"load_balance_loss_type": "causal_overcapacity",
|
| 13 |
"local_rope_theta": 10000.0,
|
| 14 |
-
"max_bid_rounds": 10,
|
| 15 |
-
"maximum_expert_overclaim": 20,
|
| 16 |
"mlp_width": 1366,
|
| 17 |
"model_type": "shram",
|
| 18 |
-
"mosrah_overallocation_factor": 2.0,
|
| 19 |
"mosrah_rope_theta": 10000.0,
|
| 20 |
"num_decoder_layers": 12,
|
| 21 |
"num_mosrah_heads": 16,
|
|
@@ -25,7 +21,7 @@
|
|
| 25 |
"rope_mode": "main_sequence",
|
| 26 |
"tie_word_embeddings": false,
|
| 27 |
"training_sequence_length": 1024,
|
| 28 |
-
"transformers_version": "5.
|
| 29 |
"use_cache": true,
|
| 30 |
"vocab_size": 50277,
|
| 31 |
"window_size": 128
|
|
|
|
| 9 |
"embedding_width": 512,
|
| 10 |
"head_dim": 16,
|
| 11 |
"inference_sequence_length": 1024,
|
|
|
|
| 12 |
"local_rope_theta": 10000.0,
|
|
|
|
|
|
|
| 13 |
"mlp_width": 1366,
|
| 14 |
"model_type": "shram",
|
|
|
|
| 15 |
"mosrah_rope_theta": 10000.0,
|
| 16 |
"num_decoder_layers": 12,
|
| 17 |
"num_mosrah_heads": 16,
|
|
|
|
| 21 |
"rope_mode": "main_sequence",
|
| 22 |
"tie_word_embeddings": false,
|
| 23 |
"training_sequence_length": 1024,
|
| 24 |
+
"transformers_version": "5.12.0",
|
| 25 |
"use_cache": true,
|
| 26 |
"vocab_size": 50277,
|
| 27 |
"window_size": 128
|
configuration.py
CHANGED
|
@@ -1,284 +1,243 @@
|
|
| 1 |
-
"""Configuration for the SHRAM transformer.
|
| 2 |
-
|
| 3 |
-
All architectural parameters that vary across model scales or are meaningful research
|
| 4 |
-
variables are expressed here. Architectural constants (no bias in linear layers,
|
| 5 |
-
SwiGLU activation with SiLU gate) are implemented in the relevant modules and
|
| 6 |
-
documented at the point of use — they are not config parameters because they do not
|
| 7 |
-
vary and changing them produces a different architecture, not a different scale.
|
| 8 |
-
|
| 9 |
-
RoPE configuration is owned entirely by this config. Each attention path reads its
|
| 10 |
-
parameters directly and constructs its own RotaryEmbedding instance explicitly — no
|
| 11 |
-
HuggingFace rope infrastructure is used. See Unit 5.A design decisions in plan.md.
|
| 12 |
-
"""
|
| 13 |
-
|
| 14 |
-
import math
|
| 15 |
-
|
| 16 |
-
from transformers import PretrainedConfig
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
class ShramConfig(PretrainedConfig):
|
| 20 |
-
"""Configuration class for the SHRAM decoder-only transformer.
|
| 21 |
-
|
| 22 |
-
SHRAM (Sparse Hybrid Token Routed Attention Mixture) replaces every standard
|
| 23 |
-
attention layer with a hybrid layer H(x) = h_l(x) + h_s(x), where h_l is a
|
| 24 |
-
local sliding-window causal attention path and h_s is the MoSRAH sparse routed
|
| 25 |
-
path. All other components follow the Llama 3 baseline.
|
| 26 |
-
|
| 27 |
-
This config is the single source of truth for every architectural dimension of the
|
| 28 |
-
model. Nothing in the architecture may use a literal number that belongs here.
|
| 29 |
-
|
| 30 |
-
Two independent RoPE configurations exist — one per attention path:
|
| 31 |
-
|
| 32 |
-
- h_l always uses standard RoPE with ``local_rope_theta``.
|
| 33 |
-
- BEA always uses YaRN with ``mosrah_rope_theta``, ``training_sequence_length``,
|
| 34 |
-
``inference_sequence_length``, ``alpha``, and ``beta``. When
|
| 35 |
-
``inference_sequence_length == training_sequence_length`` the YaRN scale factor
|
| 36 |
-
``s = 1`` and YaRN reduces exactly to standard RoPE — this is the default state
|
| 37 |
-
and the correct setting for experiments that do not require context extension.
|
| 38 |
-
|
| 39 |
-
Registered with HuggingFace AutoClass via ``auto_map``. Instantiate from the Hub::
|
| 40 |
-
|
| 41 |
-
config = AutoConfig.from_pretrained(
|
| 42 |
-
"your-namespace/advanced-transformers-lib",
|
| 43 |
-
trust_remote_code=True,
|
| 44 |
-
num_decoder_layers=12,
|
| 45 |
-
)
|
| 46 |
-
model = AutoModelForCausalLM.from_config(config)
|
| 47 |
-
|
| 48 |
-
Args:
|
| 49 |
-
vocab_size: Vocabulary size. Controls the embedding table and output logits
|
| 50 |
-
dimension. Must match the tokenizer.
|
| 51 |
-
embedding_width: Model width ``d``. The dimension of the residual stream.
|
| 52 |
-
mlp_width: FFN hidden dimension.
|
| 53 |
-
num_decoder_layers: Number of transformer blocks stacked in sequence.
|
| 54 |
-
num_sliding_window_heads: Number of heads in the local sliding-window path h_l.
|
| 55 |
-
num_mosrah_heads: Total MoSRAH expert heads available ``L``.
|
| 56 |
-
num_selected_heads: MoSRAH heads each token selects ``K``.
|
| 57 |
-
head_dim: Per-head dimension, shared by both attention paths. Must be even
|
| 58 |
-
(RoPE rotates dimensions in pairs). Paper uses 16.
|
| 59 |
-
window_size: Sliding window size for h_l. Paper uses 128.
|
| 60 |
-
rope_mode: RoPE position encoding mode for BEA. ``"main_sequence"`` supplies
|
| 61 |
-
original sequence positions; ``"semantic_sequence"`` supplies local slot
|
| 62 |
-
indices. Both are required; experimentally correct mode is undetermined
|
| 63 |
-
(paper §4). Default ``"main_sequence"``.
|
| 64 |
-
rms_norm_eps: Epsilon for RMSNorm layers.
|
| 65 |
-
local_rope_theta: RoPE base frequency ``b`` for the local attention path h_l.
|
| 66 |
-
Paper uses b=10000.
|
| 67 |
-
mosrah_rope_theta: RoPE base frequency ``b`` for the BEA path. Paper uses
|
| 68 |
-
b=10000.
|
| 69 |
-
training_sequence_length: Context length ``C_train`` the model was or will be
|
| 70 |
-
trained at. Used to compute the YaRN scale factor for BEA.
|
| 71 |
-
inference_sequence_length: Context length ``C_target`` the model must support
|
| 72 |
-
at inference. Optional; defaults to ``training_sequence_length`` so that
|
| 73 |
-
``scale=1`` and YaRN reduces to standard RoPE unless explicitly extended.
|
| 74 |
-
alpha: YaRN ramp lower boundary α (paper §A.2). Frequency dimensions with
|
| 75 |
-
``r(d) < alpha`` are fully interpolated by scale s. Paper value: 1.0.
|
| 76 |
-
beta: YaRN ramp upper boundary β (paper §A.2). Frequency dimensions with
|
| 77 |
-
``r(d) > beta`` are left unscaled. Paper value: 32.0.
|
| 78 |
-
attention_dropout: Dropout probability on attention weights. Default 0.0.
|
| 79 |
-
use_cache: Whether to return past_key_values for KV caching.
|
| 80 |
-
output_hidden_states: Whether to return hidden states after each layer.
|
| 81 |
-
tie_word_embeddings: Whether input embedding and LM head share weights.
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
f"
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
@property
|
| 233 |
-
def
|
| 234 |
-
"""
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
The expected tokens per expert under perfectly balanced routing is
|
| 246 |
-
``training_sequence_length * num_selected_heads / num_mosrah_heads``.
|
| 247 |
-
Multiplying by ``mosrah_overallocation_factor`` provides a buffer above
|
| 248 |
-
that baseline. The ceiling ensures T is always an integer >= 1.
|
| 249 |
-
|
| 250 |
-
All consumers of the packed buffer size must read this property rather
|
| 251 |
-
than deriving T independently.
|
| 252 |
-
"""
|
| 253 |
-
return math.ceil(
|
| 254 |
-
self.training_sequence_length
|
| 255 |
-
* self.num_selected_heads
|
| 256 |
-
/ self.num_mosrah_heads
|
| 257 |
-
* self.mosrah_overallocation_factor
|
| 258 |
-
)
|
| 259 |
-
|
| 260 |
-
@property
|
| 261 |
-
def mosrah_cache_length(self) -> int:
|
| 262 |
-
"""Static per-(batch, head) slot capacity for the MoSRAH inference cache.
|
| 263 |
-
|
| 264 |
-
The expected tokens per expert over the full inference context under perfectly
|
| 265 |
-
balanced routing is ``inference_sequence_length * num_selected_heads /
|
| 266 |
-
num_mosrah_heads``. Multiplying by ``mosrah_overallocation_factor`` provides
|
| 267 |
-
a buffer above that baseline. The ceiling ensures the result is always an
|
| 268 |
-
integer >= 1.
|
| 269 |
-
|
| 270 |
-
Distinct from ``mosrah_packed_length``, which sizes the training packing buffer
|
| 271 |
-
using ``training_sequence_length``. This property uses
|
| 272 |
-
``inference_sequence_length`` because the cache must hold the full accumulated
|
| 273 |
-
token history across the entire inference run.
|
| 274 |
-
|
| 275 |
-
All consumers of the MoSRAH cache buffer size must read this property rather
|
| 276 |
-
than deriving the capacity independently.
|
| 277 |
-
"""
|
| 278 |
-
return math.ceil(
|
| 279 |
-
self.inference_sequence_length
|
| 280 |
-
* self.num_selected_heads
|
| 281 |
-
/ self.num_mosrah_heads
|
| 282 |
-
* self.mosrah_overallocation_factor
|
| 283 |
-
)
|
| 284 |
-
|
|
|
|
| 1 |
+
"""Configuration for the SHRAM transformer.
|
| 2 |
+
|
| 3 |
+
All architectural parameters that vary across model scales or are meaningful research
|
| 4 |
+
variables are expressed here. Architectural constants (no bias in linear layers,
|
| 5 |
+
SwiGLU activation with SiLU gate) are implemented in the relevant modules and
|
| 6 |
+
documented at the point of use — they are not config parameters because they do not
|
| 7 |
+
vary and changing them produces a different architecture, not a different scale.
|
| 8 |
+
|
| 9 |
+
RoPE configuration is owned entirely by this config. Each attention path reads its
|
| 10 |
+
parameters directly and constructs its own RotaryEmbedding instance explicitly — no
|
| 11 |
+
HuggingFace rope infrastructure is used. See Unit 5.A design decisions in plan.md.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import math
|
| 15 |
+
|
| 16 |
+
from transformers import PretrainedConfig
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
class ShramConfig(PretrainedConfig):
|
| 20 |
+
"""Configuration class for the SHRAM decoder-only transformer.
|
| 21 |
+
|
| 22 |
+
SHRAM (Sparse Hybrid Token Routed Attention Mixture) replaces every standard
|
| 23 |
+
attention layer with a hybrid layer H(x) = h_l(x) + h_s(x), where h_l is a
|
| 24 |
+
local sliding-window causal attention path and h_s is the MoSRAH sparse routed
|
| 25 |
+
path. All other components follow the Llama 3 baseline.
|
| 26 |
+
|
| 27 |
+
This config is the single source of truth for every architectural dimension of the
|
| 28 |
+
model. Nothing in the architecture may use a literal number that belongs here.
|
| 29 |
+
|
| 30 |
+
Two independent RoPE configurations exist — one per attention path:
|
| 31 |
+
|
| 32 |
+
- h_l always uses standard RoPE with ``local_rope_theta``.
|
| 33 |
+
- BEA always uses YaRN with ``mosrah_rope_theta``, ``training_sequence_length``,
|
| 34 |
+
``inference_sequence_length``, ``alpha``, and ``beta``. When
|
| 35 |
+
``inference_sequence_length == training_sequence_length`` the YaRN scale factor
|
| 36 |
+
``s = 1`` and YaRN reduces exactly to standard RoPE — this is the default state
|
| 37 |
+
and the correct setting for experiments that do not require context extension.
|
| 38 |
+
|
| 39 |
+
Registered with HuggingFace AutoClass via ``auto_map``. Instantiate from the Hub::
|
| 40 |
+
|
| 41 |
+
config = AutoConfig.from_pretrained(
|
| 42 |
+
"your-namespace/advanced-transformers-lib",
|
| 43 |
+
trust_remote_code=True,
|
| 44 |
+
num_decoder_layers=12,
|
| 45 |
+
)
|
| 46 |
+
model = AutoModelForCausalLM.from_config(config)
|
| 47 |
+
|
| 48 |
+
Args:
|
| 49 |
+
vocab_size: Vocabulary size. Controls the embedding table and output logits
|
| 50 |
+
dimension. Must match the tokenizer.
|
| 51 |
+
embedding_width: Model width ``d``. The dimension of the residual stream.
|
| 52 |
+
mlp_width: FFN hidden dimension.
|
| 53 |
+
num_decoder_layers: Number of transformer blocks stacked in sequence.
|
| 54 |
+
num_sliding_window_heads: Number of heads in the local sliding-window path h_l.
|
| 55 |
+
num_mosrah_heads: Total MoSRAH expert heads available ``L``.
|
| 56 |
+
num_selected_heads: MoSRAH heads each token selects ``K``.
|
| 57 |
+
head_dim: Per-head dimension, shared by both attention paths. Must be even
|
| 58 |
+
(RoPE rotates dimensions in pairs). Paper uses 16.
|
| 59 |
+
window_size: Sliding window size for h_l. Paper uses 128.
|
| 60 |
+
rope_mode: RoPE position encoding mode for BEA. ``"main_sequence"`` supplies
|
| 61 |
+
original sequence positions; ``"semantic_sequence"`` supplies local slot
|
| 62 |
+
indices. Both are required; experimentally correct mode is undetermined
|
| 63 |
+
(paper §4). Default ``"main_sequence"``.
|
| 64 |
+
rms_norm_eps: Epsilon for RMSNorm layers.
|
| 65 |
+
local_rope_theta: RoPE base frequency ``b`` for the local attention path h_l.
|
| 66 |
+
Paper uses b=10000.
|
| 67 |
+
mosrah_rope_theta: RoPE base frequency ``b`` for the BEA path. Paper uses
|
| 68 |
+
b=10000.
|
| 69 |
+
training_sequence_length: Context length ``C_train`` the model was or will be
|
| 70 |
+
trained at. Used to compute the YaRN scale factor for BEA.
|
| 71 |
+
inference_sequence_length: Context length ``C_target`` the model must support
|
| 72 |
+
at inference. Optional; defaults to ``training_sequence_length`` so that
|
| 73 |
+
``scale=1`` and YaRN reduces to standard RoPE unless explicitly extended.
|
| 74 |
+
alpha: YaRN ramp lower boundary α (paper §A.2). Frequency dimensions with
|
| 75 |
+
``r(d) < alpha`` are fully interpolated by scale s. Paper value: 1.0.
|
| 76 |
+
beta: YaRN ramp upper boundary β (paper §A.2). Frequency dimensions with
|
| 77 |
+
``r(d) > beta`` are left unscaled. Paper value: 32.0.
|
| 78 |
+
attention_dropout: Dropout probability on attention weights. Default 0.0.
|
| 79 |
+
use_cache: Whether to return past_key_values for KV caching.
|
| 80 |
+
output_hidden_states: Whether to return hidden states after each layer.
|
| 81 |
+
tie_word_embeddings: Whether input embedding and LM head share weights.
|
| 82 |
+
"""
|
| 83 |
+
|
| 84 |
+
model_type = "shram"
|
| 85 |
+
|
| 86 |
+
auto_map = {
|
| 87 |
+
"AutoConfig": "configuration.ShramConfig",
|
| 88 |
+
"AutoModelForCausalLM": "huggingface.ShramForCausalLM",
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
def __init__(
|
| 92 |
+
self,
|
| 93 |
+
vocab_size: int = 50277,
|
| 94 |
+
embedding_width: int = 512,
|
| 95 |
+
mlp_width: int = 1366,
|
| 96 |
+
num_decoder_layers: int = 12,
|
| 97 |
+
num_sliding_window_heads: int = 16,
|
| 98 |
+
num_mosrah_heads: int = 16,
|
| 99 |
+
num_selected_heads: int = 16,
|
| 100 |
+
head_dim: int = 16,
|
| 101 |
+
window_size: int = 128,
|
| 102 |
+
rope_mode: str = "main_sequence",
|
| 103 |
+
rms_norm_eps: float = 1e-5,
|
| 104 |
+
local_rope_theta: float = 10000.0,
|
| 105 |
+
mosrah_rope_theta: float = 10000.0,
|
| 106 |
+
training_sequence_length: int = 1024,
|
| 107 |
+
inference_sequence_length: int | None = None,
|
| 108 |
+
alpha: float = 1.0,
|
| 109 |
+
beta: float = 32.0,
|
| 110 |
+
attention_dropout: float = 0.0,
|
| 111 |
+
use_cache: bool = True,
|
| 112 |
+
output_hidden_states: bool = False,
|
| 113 |
+
tie_word_embeddings: bool = False,
|
| 114 |
+
**kwargs
|
| 115 |
+
):
|
| 116 |
+
if head_dim % 2 != 0:
|
| 117 |
+
raise ValueError(
|
| 118 |
+
f"head_dim must be even (RoPE rotates dimensions in pairs). "
|
| 119 |
+
f"Got head_dim={head_dim}."
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
if rope_mode not in {"main_sequence", "semantic_sequence"}:
|
| 123 |
+
raise ValueError(
|
| 124 |
+
f"rope_mode must be 'main_sequence' or 'semantic_sequence', "
|
| 125 |
+
f"got '{rope_mode}'."
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
if training_sequence_length <= 0:
|
| 129 |
+
raise ValueError(
|
| 130 |
+
f"training_sequence_length must be positive, "
|
| 131 |
+
f"got {training_sequence_length}."
|
| 132 |
+
)
|
| 133 |
+
|
| 134 |
+
if inference_sequence_length is None:
|
| 135 |
+
inference_sequence_length = training_sequence_length
|
| 136 |
+
if inference_sequence_length <= 0:
|
| 137 |
+
raise ValueError(
|
| 138 |
+
f"inference_sequence_length must be positive, "
|
| 139 |
+
f"got {inference_sequence_length}."
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
if num_mosrah_heads % num_selected_heads != 0:
|
| 143 |
+
raise ValueError(
|
| 144 |
+
f"num_mosrah_heads must be exactly divisible by num_selected_heads. "
|
| 145 |
+
f"Mechanical load balancing partitions the sequence into blocks of "
|
| 146 |
+
f"W = num_mosrah_heads // num_selected_heads tokens; each block covers "
|
| 147 |
+
f"every expert exactly once, which requires an integer W. "
|
| 148 |
+
f"Got num_mosrah_heads={num_mosrah_heads}, num_selected_heads={num_selected_heads}."
|
| 149 |
+
)
|
| 150 |
+
|
| 151 |
+
self.vocab_size = vocab_size
|
| 152 |
+
self.embedding_width = embedding_width
|
| 153 |
+
self.mlp_width = mlp_width
|
| 154 |
+
self.num_decoder_layers = num_decoder_layers
|
| 155 |
+
self.num_sliding_window_heads = num_sliding_window_heads
|
| 156 |
+
self.num_mosrah_heads = num_mosrah_heads
|
| 157 |
+
self.num_selected_heads = num_selected_heads
|
| 158 |
+
self.head_dim = head_dim
|
| 159 |
+
self.window_size = window_size
|
| 160 |
+
self.rope_mode = rope_mode
|
| 161 |
+
self.rms_norm_eps = rms_norm_eps
|
| 162 |
+
self.local_rope_theta = local_rope_theta
|
| 163 |
+
self.mosrah_rope_theta = mosrah_rope_theta
|
| 164 |
+
self.training_sequence_length = training_sequence_length
|
| 165 |
+
self.inference_sequence_length = inference_sequence_length
|
| 166 |
+
self.alpha = alpha
|
| 167 |
+
self.beta = beta
|
| 168 |
+
self.attention_dropout = attention_dropout
|
| 169 |
+
self.use_cache = use_cache
|
| 170 |
+
|
| 171 |
+
super().__init__(
|
| 172 |
+
tie_word_embeddings=tie_word_embeddings,
|
| 173 |
+
output_hidden_states=output_hidden_states,
|
| 174 |
+
**kwargs
|
| 175 |
+
)
|
| 176 |
+
|
| 177 |
+
# Promote auto_map to an instance attribute so PretrainedConfig.to_dict()
|
| 178 |
+
# serialises it into config.json.
|
| 179 |
+
self.auto_map = type(self).auto_map
|
| 180 |
+
|
| 181 |
+
@property
|
| 182 |
+
def scale(self) -> float:
|
| 183 |
+
"""YaRN context extension scale factor s = inference_sequence_length / training_sequence_length.
|
| 184 |
+
|
| 185 |
+
When scale == 1.0, YaRN reduces exactly to standard RoPE — all frequency
|
| 186 |
+
adjustments cancel and A_rope = 1. This is the default state.
|
| 187 |
+
"""
|
| 188 |
+
return self.inference_sequence_length / self.training_sequence_length
|
| 189 |
+
|
| 190 |
+
@property
|
| 191 |
+
def mosrah_packed_length(self) -> int:
|
| 192 |
+
"""Static packed time dimension T for expert packing.
|
| 193 |
+
|
| 194 |
+
Mechanical load balancing guarantees exactly
|
| 195 |
+
``training_sequence_length * num_selected_heads / num_mosrah_heads``
|
| 196 |
+
tokens per expert. The ceiling handles non-integer results when
|
| 197 |
+
training_sequence_length is not divisible by the block length W.
|
| 198 |
+
|
| 199 |
+
All consumers of the packed buffer size must read this property rather
|
| 200 |
+
than deriving T independently.
|
| 201 |
+
"""
|
| 202 |
+
return math.ceil(
|
| 203 |
+
self.training_sequence_length
|
| 204 |
+
* self.num_selected_heads
|
| 205 |
+
/ self.num_mosrah_heads
|
| 206 |
+
) + self.block_length
|
| 207 |
+
|
| 208 |
+
@property
|
| 209 |
+
def mosrah_cache_length(self) -> int:
|
| 210 |
+
"""Static per-(batch, head) slot capacity for the MoSRAH inference cache.
|
| 211 |
+
|
| 212 |
+
Mechanical load balancing guarantees exactly
|
| 213 |
+
``inference_sequence_length * num_selected_heads / num_mosrah_heads``
|
| 214 |
+
tokens per expert over the full inference context. The ceiling handles
|
| 215 |
+
non-integer results when inference_sequence_length is not divisible by
|
| 216 |
+
the block length W.
|
| 217 |
+
|
| 218 |
+
Distinct from ``mosrah_packed_length``, which sizes the training packing
|
| 219 |
+
buffer using ``training_sequence_length``. This property uses
|
| 220 |
+
``inference_sequence_length`` because the cache must hold the full
|
| 221 |
+
accumulated token history across the entire inference run.
|
| 222 |
+
|
| 223 |
+
All consumers of the MoSRAH cache buffer size must read this property
|
| 224 |
+
rather than deriving the capacity independently.
|
| 225 |
+
"""
|
| 226 |
+
return math.ceil(
|
| 227 |
+
self.inference_sequence_length
|
| 228 |
+
* self.num_selected_heads
|
| 229 |
+
/ self.num_mosrah_heads
|
| 230 |
+
) + self.block_length
|
| 231 |
+
|
| 232 |
+
@property
|
| 233 |
+
def block_length(self) -> int:
|
| 234 |
+
"""Routing block length W = num_mosrah_heads // num_selected_heads.
|
| 235 |
+
|
| 236 |
+
Within each block of W consecutive tokens every expert is used exactly once,
|
| 237 |
+
giving perfect load balance by construction. The E % K == 0 constraint
|
| 238 |
+
enforced at construction guarantees W is an exact integer.
|
| 239 |
+
|
| 240 |
+
All consumers of the routing block length must read this property rather
|
| 241 |
+
than deriving W independently.
|
| 242 |
+
"""
|
| 243 |
+
return self.num_mosrah_heads // self.num_selected_heads
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
huggingface.py
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|