Instructions to use adamroberts/tinystories-5090-sdprelu with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adamroberts/tinystories-5090-sdprelu with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="adamroberts/tinystories-5090-sdprelu", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("adamroberts/tinystories-5090-sdprelu", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use adamroberts/tinystories-5090-sdprelu with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adamroberts/tinystories-5090-sdprelu"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamroberts/tinystories-5090-sdprelu",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/adamroberts/tinystories-5090-sdprelu

SGLang

How to use adamroberts/tinystories-5090-sdprelu with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adamroberts/tinystories-5090-sdprelu" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamroberts/tinystories-5090-sdprelu",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adamroberts/tinystories-5090-sdprelu" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamroberts/tinystories-5090-sdprelu",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use adamroberts/tinystories-5090-sdprelu with Docker Model Runner:
```
docker model run hf.co/adamroberts/tinystories-5090-sdprelu
```

llm.kittens TinyStories 124M BF16 — SD-PReLU

This is a 124M-parameter GPT-2-style causal language model trained from scratch on TinyStories with the llm.kittens C++/CUDA trainer, which is a fork of Karpathy's llm.c with some optimisations for SM120, and multi-stack kernel optimisations.

Unlike the GELU baseline (tinystories-5090), this checkpoint replaces the MLP's GELU nonlinearity with a learnable SD-PReLU activation (the llm.kittens -af sd-prelu activation). Because SD-PReLU is not part of stock Transformers or llama.cpp, the repo ships a small custom module (modeling_gpt_sdprelu.py) and must be loaded with trust_remote_code=True. See Activation: SD-PReLU below.

The model is published as a Hugging Face Transformers checkpoint with BF16 safetensors weights plus custom modeling code. It was trained on a single RTX 5090 in roughly 14 hours (2595.7 ms average iteration over 20,000 steps).

Result

Model weights: model.safetensors
Training step: 20000 / 20000
Final train loss: 0.781594
Final validation loss: 0.870315
Final throughput: 198871 tokens/s
Final step time: 2642.72 ms
Final reported BF16 MFU: 38.0%
Average iteration time: 2595.669013 ms
Safetensors size: 248,896,984 bytes
Parameter count: 124,475,904 base + 24 learnable SD-PReLU scalars (theta_a/theta_b, 2 per layer × 12 layers)

For reference, the GELU baseline reached 0.785740 train / 0.875080 validation loss on the same setup; SD-PReLU lands marginally lower on both (0.781594 / 0.870315) at near-identical throughput, with the activation adding only 24 scalar parameters.

The TinyStories paper reports eval losses of 1.33 to 1.58 for the 768-hidden-size 1- and 2-layer attention-head ablations in Figure 24. This run's 0.870315 validation loss is lower, but the comparison is not apples-to-apples: this model is a 12-layer GPT-2-style model using GPT-2 tokenization, a 1024-token context, and a different implementation/training setup.

Activation: SD-PReLU

This is the key difference from the GELU baseline. The MLP's GELU nonlinearity (applied between the c_fc up-projection and the c_proj down-projection) is replaced by SD-PReLU, a self-gated, damped PReLU. Each transformer block learns two scalars (theta_a, theta_b) that are mapped through a bounded reparameterization into a and b:

a = alpha_max * sigmoid(theta_a)      # in [0, alpha_max),  alpha_max = 0.30
b = beta_min  + softplus(theta_b)     # in (beta_min, inf), beta_min  = 0.50
phi(x) = x * (a + (1 - a) * sigmoid(b * x))

a is a learnable leak/floor (PReLU-like): the gate output never drops below a, so negative inputs are not fully zeroed.
b controls the sharpness of the sigmoid gate. With a = 0, phi reduces to a Swish/SiLU-style x * sigmoid(b * x).
For numerical parity with the CUDA kernel, the activation is computed in float32 and the reparameterization/gate arguments are clamped to [-20, 20]; inputs and outputs stay in the surrounding model dtype (BF16).

The activation is configured via two config fields, sdprelu_alpha_max (0.30) and sdprelu_beta_min (0.50). Only 24 extra scalar parameters are introduced over the GELU baseline, so parameter count and on-disk size are essentially unchanged.

Note: the config still carries "activation_function": "gelu_new" for GPT-2 compatibility, but it is inert — the custom MLP overrides the activation with SD-PReLU regardless of that field.

Custom activation inference code

The repo ships modeling_gpt_sdprelu.py, which defines:

GPTSDPReLUConfig (model_type = "gpt-sdprelu"), a GPT2Config with the two extra sdprelu_* fields.
GPTSDPReLUMLP, which reuses GPT-2's c_fc / c_proj / dropout submodules (so weight names stay mlp.c_fc.* / mlp.c_proj.*) and swaps GELU for SD-PReLU, adding theta_a / theta_b per layer.
GPTSDPReLULMHeadModel, a GPT2LMHeadModel that installs the SD-PReLU MLP into every block.

These are wired into config.json via auto_map, so trust_remote_code=True is required to load the model. Everything else (attention, layernorms, embeddings, tied head, tokenizer) is standard GPT-2.

Architecture

Family: GPT-2-style decoder-only Transformer
Descriptor: d12
Layers: 12
Attention heads: 12
Hidden size: 768
Context length: 1024
Vocabulary size: 50,257
MLP activation: SD-PReLU (learnable, per-layer) — replaces GELU
model_type: gpt-sdprelu (custom code, trust_remote_code=True)
Precision: BF16 weights

Training

The run used the TinyStories GPT-2 dataset files generated by dev/data/tinystories.py in llm.kittens. The only change from the GELU baseline is the -af sd-prelu activation flag.

./train_gpt2cu \
    -i "dev/data/tinystories/TinyStories_train.bin" \
    -j "dev/data/tinystories/TinyStories_val.bin" \
    -o "log124M/5090_S" \
    -v 250 -s 20000 -g 144 \
    -h 0 \
    -b 64 -t 1024 -d 524288 \
    -r 0 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 -q 0.0 -u 700 -n 5000 \
    -y 0 \
    -e "d12" \
    -af sd-prelu \
    -x 20000

Key settings:

Hardware target: RTX 5090 / SM120
MLP activation: sd-prelu (-af sd-prelu)
Micro batch: 64
Sequence length: 1024
Total desired batch size: 524,288 tokens
Max steps: 20,000
Optimizer: AdamW as implemented in llm.kittens
Peak learning rate: 6e-4
Scheduler: cosine
Warmup: 700 steps
Final LR fraction: 0.0
Weight decay: 0.1
Recompute: off
ZeRO stage: 1
Checkpoint interval: 5000 steps

Sample

Prompt/sample emitted at the final checkpoint (step 20000):

Once upon a time, there was a little girl named Lily. She loved to play in the park with her friends. One day, they saw a big, dark cloud in the sky. Lily's friend, Timmy, said, "I think it's going to rain soon."
Suddenly, they heard a loud noise. It was a big, scary dog! Lily felt very scared and her skin started to shake. But then, a brave man came and scared the dog away. "Thank you," said Lily. "You're welcome," said the man.
After the storm passed, Lily and her friends went to play on the swings. They saw a beautiful rainbow in the sky. "Look at the pretty colors!" said Lily. "It's so bright and colorful!" Her friends agreed and they all felt happy.

Files

model.safetensors: BF16 Transformers weights (including the per-layer theta_a / theta_b SD-PReLU scalars).
modeling_gpt_sdprelu.py: custom SD-PReLU model/config code (required, loaded via trust_remote_code=True).
config.json: model configuration, including sdprelu_alpha_max / sdprelu_beta_min and the auto_map wiring.
generation_config.json: default generation settings.
tokenizer.json: GPT-2 tokenizer.
vocab.json and merges.txt: GPT-2 BPE vocabulary files.

Loading

Because the SD-PReLU activation lives in custom code, you must pass trust_remote_code=True:

GGUF / llama.cpp

GGUF export is not available for this checkpoint. The SD-PReLU activation is a custom, learnable nonlinearity with no equivalent in llama.cpp's GPT-2 graph, so the model cannot be quantized to GGUF or run with llama.cpp / LM Studio without implementing the activation there. Use the Transformers loading path above instead. The GELU baseline (tinystories-5090) is available if you need a llama.cpp-compatible variant.

Recommended sampling settings (Transformers generate):

Temperature: 0.8
Top-p: 0.95
Top-k: 50
Repetition penalty: 1.05
Stop/EOS token: <|endoftext|> / token id 50256

This is a completion model, not a chat/instruction model: prompt it with the start of a story and always set a finite max_new_tokens, since it was trained for continuation and may not emit <|endoftext|> during normal generation.

Source implementation: https://github.com/adamdroberts/llm.kittens (SD-PReLU is in an unreleased branch)

TinyStories reference paper: https://arxiv.org/abs/2305.07759

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

BF16

Dataset used to train adamroberts/tinystories-5090-sdprelu

Paper for adamroberts/tinystories-5090-sdprelu

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 46