Text Generation
Transformers
PyTorch
English
shram
research
sparse-attention
mixture-of-experts
custom_code
Instructions to use smithblack-0/SHRAM-dev with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smithblack-0/SHRAM-dev with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="smithblack-0/SHRAM-dev", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smithblack-0/SHRAM-dev", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use smithblack-0/SHRAM-dev with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "smithblack-0/SHRAM-dev" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/smithblack-0/SHRAM-dev
- SGLang
How to use smithblack-0/SHRAM-dev with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use smithblack-0/SHRAM-dev with Docker Model Runner:
docker model run hf.co/smithblack-0/SHRAM-dev
Update architecture and tokenizer
Browse files- huggingface.py +11 -9
huggingface.py
CHANGED
|
@@ -1736,16 +1736,17 @@ gated residual connections around both sublayers:
|
|
| 1736 |
|
| 1737 |
normed_attn = RMSNorm(x)
|
| 1738 |
attn_out, router_diagnostics = SHRAMHybridLayer(normed_attn, ...)
|
| 1739 |
-
h = x +
|
| 1740 |
|
| 1741 |
normed_mlp = RMSNorm(h)
|
| 1742 |
mlp_out = SwiGLUMLP(normed_mlp)
|
| 1743 |
-
out = h +
|
| 1744 |
|
| 1745 |
-
|
| 1746 |
-
|
| 1747 |
-
|
| 1748 |
-
|
|
|
|
| 1749 |
|
| 1750 |
Pre-norm keeps the residual stream unnormalised. Gradients flow more cleanly
|
| 1751 |
through unnormalised residuals at depth, and each sublayer receives a stable,
|
|
@@ -3746,7 +3747,8 @@ class DecoderLayer(nn.Module):
|
|
| 3746 |
self.mlp_norm = nn.RMSNorm(config.embedding_width, eps=config.rms_norm_eps)
|
| 3747 |
self.attention = SHRAMHybridLayer(config)
|
| 3748 |
self.mlp = SwiGLUMLP(config)
|
| 3749 |
-
self.
|
|
|
|
| 3750 |
def num_mosrah_parameters(self) -> int:
|
| 3751 |
"""Return the total number of trainable MoSRAH parameters in this decoder layer."""
|
| 3752 |
return self.attention.num_mosrah_parameters()
|
|
@@ -3780,8 +3782,8 @@ class DecoderLayer(nn.Module):
|
|
| 3780 |
active_mask=active_mask,
|
| 3781 |
cache=cache,
|
| 3782 |
)
|
| 3783 |
-
hidden_states = x + self.
|
| 3784 |
-
output = hidden_states + self.
|
| 3785 |
return output, router_diagnostics
|
| 3786 |
|
| 3787 |
|
|
|
|
| 1736 |
|
| 1737 |
normed_attn = RMSNorm(x)
|
| 1738 |
attn_out, router_diagnostics = SHRAMHybridLayer(normed_attn, ...)
|
| 1739 |
+
h = x + attn_residual_gate * attn_out
|
| 1740 |
|
| 1741 |
normed_mlp = RMSNorm(h)
|
| 1742 |
mlp_out = SwiGLUMLP(normed_mlp)
|
| 1743 |
+
out = h + mlp_residual_gate * mlp_out
|
| 1744 |
|
| 1745 |
+
Two independent residual gate vectors (shape: embedding_width, init: near-zero) gate
|
| 1746 |
+
the attention and MLP sublayer contributions separately. At initialisation the layer is
|
| 1747 |
+
a pure identity. The gates are independent trainable parameters so gradients from the
|
| 1748 |
+
two sublayers never accumulate into a shared parameter, preventing norm explosion at
|
| 1749 |
+
depth.
|
| 1750 |
|
| 1751 |
Pre-norm keeps the residual stream unnormalised. Gradients flow more cleanly
|
| 1752 |
through unnormalised residuals at depth, and each sublayer receives a stable,
|
|
|
|
| 3747 |
self.mlp_norm = nn.RMSNorm(config.embedding_width, eps=config.rms_norm_eps)
|
| 3748 |
self.attention = SHRAMHybridLayer(config)
|
| 3749 |
self.mlp = SwiGLUMLP(config)
|
| 3750 |
+
self.attn_residual_gate = nn.Parameter(1e-6*torch.randn([config.embedding_width]))
|
| 3751 |
+
self.mlp_residual_gate = nn.Parameter(1e-6*torch.randn([config.embedding_width]))
|
| 3752 |
def num_mosrah_parameters(self) -> int:
|
| 3753 |
"""Return the total number of trainable MoSRAH parameters in this decoder layer."""
|
| 3754 |
return self.attention.num_mosrah_parameters()
|
|
|
|
| 3782 |
active_mask=active_mask,
|
| 3783 |
cache=cache,
|
| 3784 |
)
|
| 3785 |
+
hidden_states = x + self.attn_residual_gate*attn_out
|
| 3786 |
+
output = hidden_states + self.mlp_residual_gate*self.mlp(self.mlp_norm(hidden_states))
|
| 3787 |
return output, router_diagnostics
|
| 3788 |
|
| 3789 |
|