Text Generation
Transformers
PyTorch
English
shram
research
sparse-attention
mixture-of-experts
custom_code
Instructions to use smithblack-0/SHRAM-dev with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use smithblack-0/SHRAM-dev with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="smithblack-0/SHRAM-dev", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("smithblack-0/SHRAM-dev", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use smithblack-0/SHRAM-dev with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "smithblack-0/SHRAM-dev" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/smithblack-0/SHRAM-dev
- SGLang
How to use smithblack-0/SHRAM-dev with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "smithblack-0/SHRAM-dev" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "smithblack-0/SHRAM-dev", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use smithblack-0/SHRAM-dev with Docker Model Runner:
docker model run hf.co/smithblack-0/SHRAM-dev
File size: 5,814 Bytes
1670228 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | """SHRAM top-level cache — model-wide owner for the full SHRAM decoder stack.
The HuggingFace Cache protocol expects a single top-level Cache object that owns one
CacheLayerMixin per decoder layer. The actual SHRAM caching responsibilities live one level
lower in ShramLayerCache — each of which owns a LocalSlidingWindowLayerCache and a MoSRAHCache.
ShramCache bridges those two levels: it constructs one ShramLayerCache per decoder layer,
presents them through the Cache interface, and transparently forwards model-wide operations
across all of them.
ShramCache does not define a composite update() interface. The two attention paths inside each
SHRAM layer have different update semantics, and neither the layer-level boundary (Unit 6.B)
nor the model-level boundary here can meaningfully unify them. Callers must reach down to the
relevant sub-cache directly. ShramCache's role is ownership, construction, and model-wide
coordination of the layer caches — not routing attention inputs.
Sequence length is reported by delegating to the local sliding-window sub-cache of the
specified layer, which tracks the cumulative count of token positions processed. This is
what HuggingFace generation reads through get_seq_length().
"""
import torch
from transformers.cache_utils import Cache
from .configuration import ShramConfig
from .__cache__shram_layer_cache import ShramLayerCache
class ShramCache(Cache):
"""Top-level cache for the full SHRAM model.
Owns one ShramLayerCache per decoder layer. Satisfies the HuggingFace top-level Cache
role and transparently forwards reset, reorder, and sequence-length queries across all
owned layer caches.
No composite update() interface is provided. The two attention paths inside each SHRAM
layer have materially different update semantics; callers must update sub-caches directly
via cache.layers[layer_idx].sliding_window_cache or cache.layers[layer_idx].mosrah_cache.
Args:
config: ShramConfig instance. All layer counts, buffer sizes, and sub-cache
dimensions are derived from config so that a single source of truth governs
every buffer size across the full cache stack.
batch_size: Number of sequences in the batch.
device: Device on which to allocate cache tensors.
"""
is_compileable = True
def __init__(
self,
config: ShramConfig,
batch_size: int,
device: torch.device,
) -> None:
layers = [
ShramLayerCache(
config=config,
batch_size=batch_size,
device=device,
)
for _ in range(config.num_decoder_layers)
]
super().__init__(layers=layers)
# ---------------------------------------------------------------------------
# Cache — composite-meaningful methods
# ---------------------------------------------------------------------------
#
# reset(): Inherited. Iterates all layer caches and calls reset() on each.
#
# reorder_cache(beam_idx): Inherited. Iterates all layer caches and reorders each.
#
# is_initialized: Inherited property. True iff all layer caches are initialized.
# Since ShramLayerCache.is_initialized is True from construction, this is True
# immediately after ShramCache.__init__ returns.
def get_seq_length(self, layer_idx: int = 0) -> int: # type: ignore[override]
"""Return the cumulative sequence length for the specified layer.
Delegates to the layer cache at layer_idx, which in turn delegates to the
local sliding-window sub-cache. That sub-cache is authoritative for sequence
progress: it sees every token presented to the layer and accumulates a truthful
total count. Defaults to layer 0, which is sufficient for HuggingFace generation.
"""
return self.layers[layer_idx].get_seq_length()
# ---------------------------------------------------------------------------
# Cache — unsupported methods
# ---------------------------------------------------------------------------
def update( # type: ignore[override]
self,
key_states: torch.Tensor,
value_states: torch.Tensor,
layer_idx: int,
cache_kwargs: dict | None = None,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Not supported — ShramCache has no composite update interface.
The two attention paths inside each SHRAM layer have different update semantics.
Callers must update sub-caches directly:
cache.layers[layer_idx].sliding_window_cache.update(key_states, value_states)
cache.layers[layer_idx].mosrah_cache.update(key_states, value_states, active_mask)
"""
raise NotImplementedError(
"ShramCache has no composite update interface. "
"Update sliding_window_cache or mosrah_cache on the relevant layer directly."
)
def crop(self, max_length: int) -> None:
"""Not supported — ShramCache layers do not implement crop()."""
raise NotImplementedError("ShramCache does not support crop().")
@property
def max_batch_size(self) -> int:
"""Not supported — ShramCache does not track a uniform batch size across layers."""
raise NotImplementedError("ShramCache does not expose max_batch_size.")
@property
def max_cache_len(self) -> int:
"""Return the maximum sequence length the cache can serve.
Delegates to layers[0].get_max_cache_shape(), which returns
config.inference_sequence_length. HuggingFace's static-cache machinery reads
this value to size generation loops and verify compileable cache contracts.
"""
return self.layers[0].get_max_cache_shape()
|