Stentor2-12M-Preview

License Model Size Training Time Hardware Context Length Vocab Size Hugging Face Status

🔬 Research Artifact — Not a Production Model. This is an early preview checkpoint released for research, experimentation, and community feedback. It is not suitable for deployment in any user-facing application. See Intended Uses for details.

⚠️ This is a preview release. Stentor2-12M-Preview is an early taste of the Stentor2 family — a substantially redesigned architecture over Stentor v1. Further improvements have already been identified and a refined final release is actively in progress. This checkpoint is not the ceiling of what Stentor2 will be.

🚫 A Stentor2-30M-Preview will NOT be released. This model exists solely to give the community an early look at the Stentor2 direction and design philosophy. It is not a stepping stone to larger preview drops. The next public release from StentorLabs will be the finished, polished Stentor2 model.

🙏 A sincere apology about the brief private period. Shortly after the initial release, the repo was temporarily made private. I want to be completely upfront about what happened: the AutoModelForCausalLM.from_pretrained() loading issue described in detail below was discovered after going public, and the repo needed to come down immediately to prevent more people from downloading a silently broken model. I'm a high school student working on this alone in my very limited free time, and tracking down exactly why the model was producing no output at all — or throwing an error — despite the weights loading without a visible crash took me an entire day of debugging. I know that if you downloaded the model before the fix, you may have spent hours staring at a prompt that returned nothing and had no idea where to even start. That's an awful experience and I'm genuinely sorry. The model is now fully public, stable, and loads correctly with the custom loader described in this README. Thank you for your patience. 🙏


Table of Contents

  1. What Is This?
  2. The Core Design Insight: Vocabulary Efficiency
  3. Head-to-Head: Stentor v1 vs Stentor2 Preview
  4. Quick Start
  5. Known Loading Issue — Please Read
  6. Important Limitations
  7. Model Architecture — Full Specification
  8. The Tokenizer: TokenMonster
  9. Training Infrastructure
  10. Training Hyperparameters — Complete Reference
  11. The T4 Mixed-Precision Recipe — Deep Dive
  12. Data Pipeline
  13. Weight Initialization
  14. Evaluation & Results
  15. Training Dynamics
  16. Use Cases & Intended Uses
  17. Out-of-Scope Uses
  18. Ethical Considerations & Societal Impact
  19. Inference Guide
  20. Real Model Responses
  21. Quantization
  22. Format Conversion
  23. Speculative Decoding
  24. Bias, Risks & Limitations
  25. Related Work
  26. What's Next
  27. Environmental Impact
  28. Citation

What Is This?

Stentor2-12M-Preview is the first public checkpoint from the Stentor2 model family — a ground-up redesign of the original Stentor v1 line. At ~12.3M parameters, it is a compact base language model (LLM) built entirely from scratch on free-tier Kaggle compute using two NVIDIA Tesla T4 GPUs.

Like all Stentor models, this is a base next-token predictor, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best used for research, prototyping, speculative decoding, and edge-deployment experimentation. The value of this model is not its conversational capability — it's what it represents architecturally: a dramatic efficiency gain over v1 at the same scale, achieved by fixing the root cause of v1's underperformance.


The Core Design Insight: Vocabulary Efficiency

The most consequential change in Stentor2 is the replacement of the standard Llama/Mistral 32,768-token vocabulary with a purpose-built 8,000-token English vocabulary from the TokenMonster project (english-8000-consistent-v1, padded to 8,064 for hardware alignment).

This is not a minor tweak — it is the entire architectural story of Stentor2.

Why Vocabulary Size Matters So Much at This Scale

In a transformer language model, the embedding table has shape [vocab_size × hidden_size]. When you tie word embeddings (share the embedding and output projection weights, which Stentor does), this table appears once in the parameter count. At 12M total parameters, the fraction consumed by this table dictates how much "brain" is left over for the actual transformer layers.

Stentor-12M (v1) used a 32,768-token vocabulary. At a hidden size of 192:

embedding_params = 32,768 × 192 = 6,291,456
total_params     = 12,047,040
embedding_share  = 52.2%

Over half of the model was a lookup table. The transformer stack — the part that actually learns language patterns — had fewer than 6 million parameters to work with. It was more dictionary than reasoner.

Stentor2-12M-Preview uses an 8,064-token vocabulary. At a hidden size of 256:

embedding_params = 8,064 × 256 = 2,064,384
total_params     = 12,294,400
embedding_share  = 16.8%

By shrinking the vocabulary, the embedding table was cut from 6.3M to 2.1M parameters — freeing up ~4.2M parameters that were redistributed into transformer depth (12 layers vs 9) and width (hidden size 256 vs 192), where they contribute directly to language modeling quality.

The result is a ~43.8% reduction in perplexity (89.01 → ~50.07) compared to Stentor-12M. Note that the comparison is close but not perfectly controlled — v1 trained on a mix of FineWeb-Edu and Cosmopedia v2, while Stentor2 trained on FineWeb-Edu only — making this an apple-to-apple-banana comparison rather than a pure ablation, but meaningful nonetheless.


Head-to-Head: Stentor v1 vs Stentor2 Preview

Property Stentor-12M (v1) Stentor2-12M-Preview
Vocabulary 32,768 (Mistral BPE) 8,064 (TokenMonster English)
Hidden Size 192 256
Intermediate Size 576 768
Num Layers 9 12
Attention Heads 3 4
Head Dimension 64 64
Context Length 512 tokens 1,024 tokens
Total Parameters 12,047,040 12,294,400
Embedding Share 52.2% 16.8%
Non-Embedding Params ~5.76M ~10.23M
Token Budget 200M 240M
Training Time ~1.3h ~4.4h
Best Perplexity 89.01 ~50.07
Perplexity Reduction ~43.8%
Tokenizer Mistral BPE TokenMonster
Architecture LlamaForCausalLM LlamaForCausalLM
Training Precision fp16 fp16 + INT8 forward

🚀 Quick Start

1. Install Dependencies

pip install transformers torch safetensors huggingface_hub

tokenmonster will be installed automatically by the loader — you don't need to install it yourself.

2. Load the Model

This model needs a small custom loader script because of a quirk in how the checkpoint was saved during training. The loader is just a Python file (load_stentor2.py) that lives in this repo. You have two options for using it — pick whichever is easier for you:


Option A — Pull it straight from the repo (easiest, no files to manage)

The repo is fully public — no token or authentication is required. This downloads the loader file from HuggingFace into your local cache automatically, then runs it. The file is cached after the first download so it's fast on every run after that.

from huggingface_hub import hf_hub_download
import importlib.util, sys, torch

# Download the loader from the HuggingFace repo (cached after first run)
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")

# Import it as a Python module
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

# Load the model
model, tokenizer = mod.load_stentor2()

The importlib lines are just Python's way of loading a .py file that isn't in your current folder. After those lines, mod behaves exactly like a normal imported module and mod.load_stentor2() works exactly like a normal function call.


Option B — Download the file once, import it normally

Download load_stentor2.py from the Files tab on this page and put it in the same folder as your script. Then just import it like any normal Python file:

from load_stentor2 import load_stentor2
import torch

model, tokenizer = load_stentor2()

If you move your project to a different folder, bring load_stentor2.py with it.


Which should I use?

Option A Option B
Manual file download needed? No Yes (once)
Best for Notebooks, Kaggle, Colab Local projects
Code complexity A few extra lines Simple import

GPU (FP16) — recommended if you have a CUDA GPU:

model, tokenizer = mod.load_stentor2(dtype=torch.float16)  # Option A
model, tokenizer = load_stentor2(dtype=torch.float16)       # Option B

3. Generate Text

Once loaded, the model works like any standard HuggingFace model. Because this is a base model, it continues text rather than answering questions — give it the beginning of a sentence and it will complete it.

input_ids      = torch.tensor([tokenizer.encode("The history of computing")], dtype=torch.long).to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)

with torch.inference_mode():
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=80,
        do_sample=True,
        temperature=1.1,
        top_p=0.55,
        repetition_penalty=1.15,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output[0].tolist()))

Why attention_mask? The model's pad token and EOS token are the same ID. Without an explicit attention mask, HuggingFace throws a warning because it can't tell which tokens are real vs padding. Passing torch.ones_like(input_ids) tells the model that every token in the input is real — which is always true here since we never pad single-sequence inference.

4. Recommended Generation Settings

These settings were validated through hands-on testing and produce the best results for a base model at this scale:

Parameter Recommended Range Notes
temperature 0.65 – 1.2 Lower = more focused, higher = more creative
top_p 0.5 – 0.8 Nucleus sampling cutoff
max_new_tokens 10 – 60 Keep outputs short to stay on topic
repetition_penalty 1.1 – 1.2 Helps prevent looping

⚠️ Keep max_new_tokens low. This is a 12M parameter base model — it does not have robust long-range coherence. Short completions are significantly more coherent than long ones. Going beyond ~60 tokens will often result in the model wandering off topic or repeating itself.


⚠️ Known Loading Issue — Please Read

AutoModelForCausalLM.from_pretrained() does NOT work with this model. This section explains exactly what goes wrong, why, and how the loader fixes it. This is a preview-only issue — it will not exist in the final Stentor2-12M release.

What Goes Wrong

If you try to load the model the normal way:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor2-12M-Preview")

You will get a load report showing a bunch of UNEXPECTED and MISSING keys for layers 2–8:

model.layers.{2,3,4,5,6,7,8}.self_attn.q_proj.weight_master  | UNEXPECTED
model.layers.{2,3,4,5,6,7,8}.self_attn.q_proj.weight         | MISSING

Those layers will be loaded with uninitialized weights — the checkpoint has the right data, it's just stored under the wrong name. The result is that the model either produces no output at all, or throws an error during generation. There is no clear indication of why. You can stare at a prompt that returns nothing and have no obvious place to start debugging — which is exactly what makes this failure so painful to track down.

Why It Happens

Think of a model checkpoint as a dictionary where every layer's weights are stored under a name, like a filing cabinet. The standard name for a weight is .weight. HuggingFace opens the filing cabinet, looks for files labeled .weight, and loads them.

During training, layers 2–8 used a special training wrapper called Int8LinearT4 that stored weights under .weight_master instead of .weight. When the training finished and the checkpoint was saved, those non-standard labels were written to disk exactly as-is.

So HuggingFace opens the filing cabinet, looks for .weight in layers 2–8, finds nothing (MISSING), then notices there are .weight_master labels it doesn't recognize (UNEXPECTED), and moves on — leaving those layers randomly initialized. The model runs. The output is meaningless. No error is ever raised.

How the Loader Fixes It

load_stentor2.py opens the raw checkpoint file itself before the model ever sees it, finds every .weight_master label, and renames it to .weight:

model.layers.3.self_attn.q_proj.weight_master  →  model.layers.3.self_attn.q_proj.weight

Here is exactly what that key-renaming logic looks like:

sd = {}
masters = {k for k in raw_sd if k.endswith(".weight_master")}
skip    = {k[:-len("_master")] for k in masters}
for k, v in raw_sd.items():
    if k.endswith(".weight_master"):
        sd[k[:-len("_master")]] = v   # rename: drop "_master"
    elif k not in skip:
        sd[k] = v                      # keep everything else unchanged

model.load_state_dict(sd, strict=False)

Then it hands the corrected checkpoint to the model. The model just sees normal .weight labels and loads fine. From that point on it is a completely standard LlamaForCausalLM — no special handling needed for anything else.

This will not be an issue in Stentor2-12M. The final release will save a clean checkpoint with standard key names that loads with AutoModelForCausalLM.from_pretrained() as normal. This is purely a preview artifact.


⚠️ Important Limitations

  • Not Instruction-Tuned: This is a base model. It will often ignore prompts, continue in unexpected directions, or respond off-topic. The chat template in the tokenizer config is present for structural compatibility, not because the model knows how to use it.
  • No Safety Tuning: No RLHF, no constitutional AI, no content filtering. Use with appropriate caution.
  • Limited World Knowledge: ~12M parameters cannot store meaningful world knowledge. Do not treat outputs as factual.
  • Context Window: Hard limit of 1,024 tokens. The model was trained exclusively on 1,024-token packed sequences; longer contexts are untested and likely to degrade.
  • English Only: The TokenMonster english-8000-consistent-v1 vocabulary is English-specific. Non-English text will tokenize very poorly.
  • Custom Tokenizer: This model uses a TokenMonster adapter, not a standard Hugging Face fast tokenizer. The tokenizer.json format differs from typical models. Make sure tokenmonster is installed before loading.
  • skip_special_tokens Not Supported: The TokenMonster tokenizer does not support the skip_special_tokens argument in its decode method. Calling tokenizer.decode(ids, skip_special_tokens=True) will raise an error. Strip special tokens manually if needed — see the Tokenizer section for details.
  • Preview Quality: Further architectural improvements have already been identified. This is not the final Stentor2 model.
  • Shared Tensor Warning: When saving or loading this model, you may see: Removed shared tensor {'lm_head.weight'} while saving. This is expected behavior from tied word embeddings and is safe to ignore.

Model Architecture — Full Specification

Stentor2-12M-Preview is a LlamaForCausalLM model. All architecture values below were derived directly from the training script and validated against the logged parameter counts.

Core Configuration

Component Value Derivation
Architecture LlamaForCausalLM Hard-coded in training script
Hidden Size 256 Inferred: embedding_params (2,064,384) ÷ vocab_size (8,064) = 256 ✓
Intermediate Size (FFN) 768 Hidden × 3 (verified via total param count)
Num Hidden Layers 12 Verified via total param count formula
Num Attention Heads 4 Hidden ÷ head_dim = 256 ÷ 64 = 4
Num Key/Value Heads 4 Full MHA (no GQA at this scale)
Head Dimension 64 Enforced by training script: head_dim must be 64
Vocab Size 8,064 TokenMonster 8K base + 62 padding tokens (multiple of 128)
Max Position Embeddings 1,024 block_size default in training script
Hidden Activation SiLU LlamaForCausalLM default
Positional Encoding RoPE rope_theta = 10,000.0
RMS Norm Epsilon 1e-5 Default in training script
Tie Word Embeddings True Shared embedding / LM head weights
Attention Implementation SDPA PyTorch Scaled Dot Product Attention
Attention Pattern Full causal No sliding window, no sparse patterns

Parameter Count Breakdown

The total parameter count can be reproduced exactly using the following formula from the training script:

def estimate_llama_params(vocab_size, hidden_size, intermediate_size,
                          num_hidden_layers, num_attention_heads, num_key_value_heads):
    kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
    # Q, K projections (hidden→hidden) + V, O projections (hidden→hidden for full MHA)
    attn = 2 * hidden_size * hidden_size + 2 * hidden_size * kv_dim
    # Gate, Up, Down projections
    mlp  = 3 * hidden_size * intermediate_size
    # Input norm + post-attention norm per layer
    norm = 2 * hidden_size
    # Embedding table + final RMS norm
    total = vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
    return total

Plugging in Stentor2 values:

kv_dim  = 256 * 4 / 4 = 256
attn    = 2×256×256 + 2×256×256 = 131,072 + 131,072 = 262,144
mlp     = 3×256×768 = 589,824
norm    = 2×256 = 512
per_layer = 262,144 + 589,824 + 512 = 852,480

embedding = 8,064 × 256  = 2,064,384
layers    = 12 × 852,480 = 10,229,760
final_norm = 256

total = 2,064,384 + 10,229,760 + 256 = 12,294,400 ✓
Component Parameters % of Total
Embedding Table (tied with LM Head) 2,064,384 16.8%
Transformer Layers × 12 10,229,760 83.2%
— Attention (per layer × 12) 3,145,728 25.6%
— FFN/MLP (per layer × 12) 7,077,888 57.5%
— Layer Norms (per layer × 12) 6,144 0.05%
Final RMS Norm 256 0.002%
Total 12,294,400 100%

Architecture Constraints Enforced by Training Script

The training pipeline enforces several hard constraints that directly shaped the final architecture:

  1. Head dimension must be exactly 64. The script raises a SystemExit if hidden_size / num_attention_heads ≠ 64. This is a T4 hardware efficiency constraint — 64 is the optimal head dim for the T4's tensor core utilization.

  2. KV heads ≤ attention heads, and attention heads divisible by KV heads. Standard MHA constraint (no GQA at this scale).

  3. Vocabulary padded to nearest multiple of 128. pad_vocab_to_multiple=128 for hardware alignment.


The Tokenizer: TokenMonster

Stentor2 uses a custom tokenizer adapter wrapping the TokenMonster english-8000-consistent-v1 vocabulary, rather than a standard BPE tokenizer from the Hugging Face ecosystem.

What Is TokenMonster?

TokenMonster (alasdairforsythe/tokenmonster) is an alternative tokenization approach optimized for compact English vocabulary sizes. The english-8000-consistent-v1 vocabulary is a purpose-built ~8,000-token English vocabulary designed for efficiency at small model scales.

⚠️ skip_special_tokens Is Not Supported

The TokenMonster tokenizer does not support the skip_special_tokens argument in its decode method. If you call tokenizer.decode(ids, skip_special_tokens=True) you will get an error. Always decode without it and strip special tokens manually if needed:

# ✅ Correct
text = tokenizer.decode(output_ids)

# ❌ This will raise an error
text = tokenizer.decode(output_ids, skip_special_tokens=True)

If you are using HuggingFace's TextIteratorStreamer or any wrapper that internally passes skip_special_tokens=True to the tokenizer, you will need to patch or replace that wrapper. The demo Space and the loader script both handle this correctly already.

Tokenizer Efficiency vs. v1

You may notice that this tokenizer produces more tokens per word compared to Stentor v1. This is expected and by design. The v1 models used a 32,768-token Mistral BPE vocabulary, which encodes common English words as single tokens very efficiently. Stentor2 uses an 8,064-token TokenMonster vocabulary — smaller vocabulary means more tokens per word on average. This is the direct tradeoff for freeing up ~4.2M parameters for the transformer layers. The ~43.8% perplexity improvement shows the tradeoff was worth it.

Vocabulary Construction

The tokenizer pipeline proceeds as follows:

  1. Base vocabulary is loaded from alasdairforsythe/tokenmonstervocabs/english-8000-consistent-v1.vocab via hf_hub_download.
  2. Special tokens are added: </s> (EOS), <s> (BOS), <pad> (set equal to EOS).
  3. A default chat template is injected for structural compatibility.
  4. The vocabulary is padded to the nearest multiple of 128 using dummy tokens <|extra_0|>, <|extra_1|>, ..., resulting in a final vocabulary size of 8,064 tokens.
Base TokenMonster vocab:  ~8,002 tokens (approx)
+ padding to 128-multiple: +62 tokens
= Final vocab size:         8,064 tokens

The TokenMonsterTokenizerAdapter

The training script wraps the TokenMonster vocabulary in a custom TokenMonsterTokenizerAdapter class that provides a HuggingFace-compatible interface. Key implementation details:

  • Tokenization: Calls vocab.tokenize(batch) — batch or single-string input
  • Decoding: Calls vocab.decode(token_ids)
  • No padding during tokenization itself — padding is handled by the data collator
  • EOS appended to each training sample in the tokenization function
  • is_fast = True flag set to satisfy the training script's fast-tokenizer requirement
  • save_pretrained saves a tokenmonster.vocab binary + tokenizer_config.json + special_tokens_map.json

Tokenizer Configuration

{
  "tokenizer_type": "tokenmonster",
  "vocab_file": "tokenmonster.vocab",
  "model_max_length": 1024,
  "eos_token": "</s>",
  "bos_token": "<s>",
  "pad_token": "</s>",
  "vocab_size": 8064
}

Chat Template

A simple chat template is injected during tokenizer setup for structural compatibility with chat formatting tools, though the base model is not trained to follow it:

{% for message in messages %}
<|{{ message['role'] }}|>
{{ message['content'] }}
{% endfor %}
{% if add_generation_prompt %}<|assistant|>
{% endif %}

Loading the Tokenizer in Inference

Because the tokenizer is a custom type, standard AutoTokenizer.from_pretrained may require the tokenmonster Python package:

pip install tokenmonster
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor2-12M-Preview",
    trust_remote_code=True  # may be needed depending on version
)

Training Infrastructure

Hardware

Component Specification
GPU Count 2× NVIDIA Tesla T4
VRAM per GPU 15.64 GB
Total VRAM ~31.3 GB
Platform Kaggle Notebooks (free tier)
Accelerator Library HuggingFace Accelerate
Active Processes 1 (single-process despite 2 GPUs; T4 recipe runs on device 0)

Note on Dual-GPU Setup: The training environment was configured with 2× T4 GPUs and the Accelerate library successfully initialized the dual-GPU pipeline. However, the training run executed as a single process (num_processes: 1), meaning only one GPU was used for the actual compute. The second GPU was available but not utilized for this run. The device_map="auto" infrastructure was fully primed.

Software Stack

Package Role
PyTorch Core tensor operations and autograd
HuggingFace Transformers Model architecture (LlamaForCausalLM)
HuggingFace Accelerate Training loop and device management
HuggingFace Datasets Streaming data loading
bitsandbytes INT8 quantization primitives
tokenmonster Custom vocabulary
safetensors Model serialization

Training Hyperparameters — Complete Reference

The following table represents the exact configuration used for this training run, sourced directly from the training script defaults and confirmed against the training logs.

Core Training Parameters

Hyperparameter Value Notes
learning_rate 2e-4 AdamW LR for all parameters
weight_decay 0.01 Applied to non-embedding, non-norm, non-bias params
max_grad_norm 1.0 Gradient clipping threshold
optimizer AdamW With betas=(0.9, 0.95), eps=1e-8
scheduler Cosine Cosine decay with linear warmup
warmup_ratio 0.05 → 732 warmup steps
stable_ratio 0.8 → 11,719 stable steps (cosine)
token_budget 240,000,000 Hard stop at 240M tokens seen
max_train_steps 14,649 Computed from token budget
seed 42 Reproducibility seed
mixed_precision fp16 All activations/gradients in FP16

Batch & Sequence Parameters

Hyperparameter Value Notes
per_device_train_batch_size 4 Per GPU per gradient accumulation step
per_device_eval_batch_size 4 Evaluation batch size
gradient_accumulation_steps 4 Effective optimizer steps every 4 forward passes
total_batch_size 16 per_device × processes × grad_accum = 4×1×4
block_size 1,024 Sequence length; training packed to this size
tokens_per_optimizer_step 16,384 total_batch_size × block_size

Evaluation & Checkpointing

Hyperparameter Value
eval_steps 375
save_every_minutes 30
save_total_limit 2
save_epochs 1
logging_steps 125
max_eval_samples 2,000

AdamW Optimizer — Detailed

The optimizer uses a decoupled parameter group strategy:

  • Decay group: All nn.Linear weight matrices (excludes bias, norm weights, embedding)
    • weight_decay = 0.01
  • No-decay group: Bias terms, normalization parameters, embedding parameters
    • weight_decay = 0.0
  • Betas: (0.9, 0.95) — the 0.95 β₂ is a modern LLM default (vs the 0.999 PyTorch default)
  • Epsilon: 1e-8
  • Fused kernel: Enabled if available (torch.optim.AdamW(fused=True) when CUDA is present)

Learning Rate Schedule

The cosine schedule with warmup proceeds through three phases:

Phase 1 — Warmup (steps 0–732):
  LR ramps linearly from 0 → 2e-4

Phase 2 — Stable / Cosine Decay (steps 732–14,649):
  LR follows cosine curve from 2e-4 → 0

Phase 3 — (N/A for cosine; WSD decay phase only applies if scheduler=wsd)

Implemented via HuggingFace get_cosine_schedule_with_warmup.


The T4 Mixed-Precision Recipe — Deep Dive

The most technically interesting aspect of Stentor2's training pipeline is its custom T4 Mixed-Precision Recipe — a bespoke approach to stable mixed-precision training on NVIDIA Tesla T4 GPUs, which lack BF16 support and have known numerical instability issues with FP16 on certain operations.

This recipe involves four distinct techniques applied simultaneously:

1. INT8 Simulated-Quantization Linear (49 modules)

All non-critical transformer linear layers are wrapped in a custom Int8LinearT4 module that performs quantization-aware training (QAT) with a straight-through estimator (STE).

How it works:

  • The module stores a FP32 master weight (weight_master) — gradients always flow back to this full-precision copy
  • On each forward pass, the weight is quantized to INT8 (simulated, not actual int8 memory layout) and then dequantized back to FP16 for the matmul
  • Both weights and activations are independently quantized using a per-row/per-token absolute-max scale: scale = abs(x).amax(dim=-1, keepdim=True).clamp_min(1e-8) / 127.0
  • Stochastic rounding is used during training (disabled at eval): instead of round(x), each fractional part is probabilistically rounded up or down — this reduces systematic quantization bias
  • The STE ensures gradients pass through the non-differentiable rounding operation unchanged

Why this matters: The quantization error acts as a regularizer and forces the model to learn representations that are robust to 8-bit precision — a desirable property for downstream deployment on quantized hardware.

INT8 QAT forward pass (simplified):
  scale  = |W|.row_max / 127
  W_q    = round(W / scale).clamp(-127, 127)   ← stochastic
  W_dq   = W_q × scale                          ← dequantize
  W_ste  = W + (W_dq - W).detach()             ← STE: gradient sees full W
  output = x_ste @ W_ste.T + bias

2. FP32 Critical Layers (5 layers: first 2 + last 3)

The first 2 and last 3 transformer layers are designated as critical layers and run entirely in FP32:

  • Their weights are cast to .float() at setup time
  • Their forward() method is monkey-patched to cast all inputs to FP32 before the call and cast outputs back to the original dtype afterward
  • torch.amp.autocast("cuda", enabled=False) context is used to prevent autocast from re-downcasting inside the layer

Rationale: The first layers are responsible for embedding projection and initial feature extraction; instability here corrupts the entire forward pass. The last layers handle final token prediction; numerical errors here directly impact loss. Running these in FP32 provides a stability floor at minimal compute cost.

3. FP32 Normalization Layers (25 modules)

All RMSNorm and LayerNorm modules are monkey-patched to run their computation in FP32 regardless of input dtype:

def _fp32_norm_forward(hidden_states, *args, **kwargs):
    input_dtype = hidden_states.dtype
    output = original_forward(hidden_states.float().contiguous(), *args, **kwargs)
    return output.clone().to(input_dtype)

The .clone() call is critical: it prevents returning graph-managed buffers that can be overwritten across CUDAGraph replay steps under torch.compile. The inputs are also .contiguous() to prevent strided-tensor issues in FP32 norm ops.

Why 25 modules: With 12 transformer layers × 2 norms each (input norm + post-attention norm) + 1 final norm = 25 total.

This is why torch.compile is disabled. The FP32 norm wrappers are incompatible with CUDAGraph replay, which torch.compile uses under reduce-overhead mode. Enabling both would cause silent correctness errors.

4. FP32 Attention Softmax (12 modules)

Each attention module's forward() is monkey-patched to replace torch.nn.functional.softmax with a version that upcasts FP16/BF16 inputs to FP32 before computing the softmax, then downcasts the result:

def _softmax_fp32(input_tensor, *args, **kwargs):
    if input_tensor.dtype in (torch.float16, torch.bfloat16):
        output = original_softmax(input_tensor.float(), *args, **kwargs)
        return output.to(input_tensor.dtype)
    return original_softmax(input_tensor, *args, **kwargs)

Why this matters: Softmax over large attention weight matrices in FP16 frequently produces NaN or Inf values due to numerical overflow in the exp() operation. Running the softmax itself in FP32 eliminates this instability entirely, which is essential for stable long-context attention (1024 tokens).

T4 Recipe Summary Table

Technique Count Scope
INT8 QAT linear modules 49 All non-critical linear layers
FP32 critical layers 5 Layers {0, 1, 9, 10, 11}
FP32 norm modules 25 All RMSNorm / LayerNorm
FP32 softmax modules 12 All attention modules

Gradient Checkpointing

Gradient checkpointing is enabled using the non_reentrant=True path (preferred for modern PyTorch) to reduce activation memory. model.config.use_cache = False is set to prevent KV cache allocation during training. model.enable_input_require_grads() is called to ensure gradients can flow through checkpoint boundaries.


Data Pipeline

Dataset

The model was trained exclusively on FineWeb-Edu (HuggingFaceFW/fineweb-edu) — a large-scale web corpus filtered for educational content quality. Cosmopedia v2 was available in the pipeline (configurable via --cosmopedia_weight) but the default weight of 0.0 means it was not used in this run.

Total tokens processed: 240,001,024 (budget-limited run)

Streaming Mode

The dataset was loaded in streaming mode (streaming=True), meaning:

  • No data was pre-downloaded or pre-tokenized to disk
  • Samples were tokenized on-the-fly during training
  • num_workers=0 was enforced (IterableDataset + multiprocessing causes deadlocks in notebook environments)
  • Shuffle buffer of 20,000 samples was applied

Text Preprocessing

Each raw text sample undergoes the following cleaning pipeline before tokenization:

def clean_text(text: str) -> str:
    text = unicodedata.normalize("NFKC", text)   # normalize unicode
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    text = " ".join(lines)                         # collapse newlines
    text = " ".join(text.split())                  # normalize whitespace
    return text

Why these specific steps:

  • NFKC normalization maps visually equivalent Unicode characters to a single canonical form (e.g., full-width A, ligature fi, superscript ²2). This is the standard choice for LLM preprocessing — used in T5 (Raffel et al., 2020, arXiv:1910.10683), BERT (Devlin et al., 2019, arXiv:1810.04805), and the Unicode standard itself (Unicode Technical Report #15). Without it, the model would see dozens of token IDs for what is semantically one character.

  • Whitespace collapse (join lines, collapse spaces) ensures consistent tokenization of the same content regardless of how it was originally formatted. Web-scraped text commonly contains inconsistent line breaks, multiple spaces, and mixed newline styles. This is also standard practice in GPT-style pretraining pipelines. No ablation was performed on this step — it was adopted from established practice rather than experimentally derived.

Tokenization

Each cleaned sample is tokenized using the TokenMonster adapter:

  • add_special_tokens=False during tokenization
  • EOS token (</s>) appended to every sample
  • Attention mask generated (all 1s for real tokens)

Sequence Packing

After tokenization, samples are packed into fixed 1,024-token blocks using a stateful packing function:

Sample 1:  [tok, tok, tok, ..., </s>]   (e.g., 347 tokens)
Sample 2:  [tok, tok, tok, ..., </s>]   (e.g., 891 tokens)
                                         ↓
Block 1:   [<sample1...>, <first 677 tokens of sample2>]   (1024 tokens)
Block 2:   [<remaining 214 tokens of sample2>, <sample3...>]

Packing eliminates all padding waste and ensures every training token is a real content token. The remainder buffer carries leftover tokens between batch iterations. At the end of the dataset, any leftover tokens are padded to 1,024 with the EOS token and labels masked (-100) for the padded positions.

Labels for packed sequences are identical to input_ids (causal LM: predict each token from all preceding tokens). There is no special boundary masking between packed samples in this pipeline — the model learns to cross document boundaries, which is standard practice.

Validation Split

A held-out validation set of 2,000 samples was used for evaluation, drawn from the streaming dataset via .take(2000) before training data was streamed.

Data Collation

The packed collator pads batches to the longest sequence in the batch (rounded up to the nearest multiple of 8 for hardware alignment):

  • input_ids: padded with pad_token_id
  • labels: padded with -100 (ignored in loss computation)
  • attention_mask: padded with 0

Weight Initialization

All parameters are initialized using a truncated normal distribution with std=0.02 — the same initialization used in GPT-2 and most modern LLMs:

def initialize_weights(model, std=0.02):
    for module in model.modules():
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif "layernorm" in type(module).__name__.lower() or \
             "rmsnorm"   in type(module).__name__.lower():
            if module.weight is not None:
                module.weight.data.fill_(1.0)   # scale initialized to 1
            if module.bias is not None:
                module.bias.data.zero_()

Key points:

  • Linear layers: normal(0, 0.02)
  • Embeddings: normal(0, 0.02) — same as linear
  • RMSNorm scale weights: initialized to 1.0 (identity transform at start)
  • All biases: zero

This initialization is applied before the T4 recipe is applied. The T4 recipe then copies nn.Linear.weight into Int8LinearT4.weight_master as FP32, preserving the initialization.


Evaluation & Results

Training Curves

The charts below show validation loss and perplexity over the course of the training run. Both are plotted against optimizer steps. The best checkpoint (step 11,625) is visible as the lowest point before the slight uptick in the tail phase.

Validation loss over training steps

Perplexity over training steps

Metrics

  • Validation Loss: Cross-entropy loss over the held-out validation split (lower = better)
  • Perplexity (PPL): exp(loss) — lower means the model is less "surprised" by unseen text

Results Summary

Checkpoint Step Eval Loss Perplexity
Initial 375 7.1108 ~1,228
Early 1,500 5.4646 ~236
Mid 3,375 4.6069 ~100
Mid-Late 6,750 4.1789 ~65
Late 9,375 4.0686 ~58
Best Checkpoint 11,625 3.9145 ~50.1
Final Epoch 14,649 4.0083 55.05

Comparison to Stentor v1

Model Best Eval Loss Best Perplexity Improvement
Stentor-12M (v1) 4.4887 89.01
Stentor2-12M-Preview 3.9145 ~50.1 ↓43.8% perplexity

The ~43.8% perplexity reduction is a close but not perfectly controlled comparison: v1 was trained on a mix of FineWeb-Edu and Cosmopedia v2, while Stentor2 was trained on FineWeb-Edu only. Both use educational-quality text at the same parameter count — an apple-to-apple-banana comparison. The vocabulary size, architecture configuration, and token budget (200M → 240M) all differ.


Training Dynamics

The training run proceeded for a single epoch over 14,649 optimizer steps, consuming exactly 240,001,024 tokens (budget-limited). Several observations from the training curve are worth noting for researchers:

Early Phase (steps 0–2,250): Loss drops rapidly from ~8.36 → ~4.97. The model quickly learns basic token co-occurrence statistics. Best eval checkpoints update frequently (steps 375, 750, 1125, 1500, 1875, 2250).

Middle Phase (steps 2,250–8,625): Loss continues declining but with more noise. Individual batch losses oscillate significantly (3.7–5.5 range) while eval loss steadily improves. This is characteristic of a model encountering varied document types in a shuffled stream.

Late Phase (steps 8,625–11,625): Eval loss reaches its lowest point at step 11,625 (3.9145). The model's best checkpoint is saved here.

Tail Phase (steps 11,625–14,649): Eval loss increases slightly to 4.0083 at the final epoch eval. This is consistent with cosine schedule tail behavior — the learning rate approaches zero and the model may slightly overfit to recent batches or experience minor distribution drift near the end of the dataset.


Use Cases & Intended Uses

🔬 Reminder: This is a research artifact. It is a base language model with no safety tuning, no instruction following, and no factual grounding. Every intended use below assumes a researcher or developer context, not an end user.

Intended Uses

Use Case Suitability Notes
Studying transformer training dynamics ✅ High Small enough to train/fine-tune on free compute
Tokenization efficiency research ✅ High 8K vs 32K vocab tradeoff is directly observable
Speculative decoding experiments ✅ High Fast enough to serve as a draft model
Benchmarking CPU/edge inference latency ✅ High ~12MB in FP16, runs on any hardware
Testing quantization/conversion pipelines ✅ High GGUF, ONNX, INT8 pipeline validation
Teaching material for LLM courses ✅ High Architecture is simple enough to trace by hand
LoRA / QLoRA fine-tuning experiments ✅ Moderate Base model only; start from scratch for any task
Text continuation / creative prompting ✅ Moderate Works best on short completions ≤60 tokens
Domain-specific fine-tuning research ✅ Moderate Small enough to iterate rapidly
Factual Q&A ❌ Not suitable Model has no reliable world knowledge
Production deployment ❌ Not suitable No safety tuning; preview quality only
Non-English text ❌ Not suitable TokenMonster vocab is English-only
Long-document tasks (>512 tokens of coherent output) ❌ Not suitable Coherence degrades quickly

Out-of-Scope Uses

The following uses are explicitly out of scope and should not be attempted:

  • User-facing applications of any kind — This model has no safety filtering, no alignment, and no factual reliability. Deploying it in a context where a real user receives its output without expert review is inappropriate regardless of the domain.
  • Medical, legal, or financial advice — Even if prompted carefully, 12M parameters cannot store or reason over specialized knowledge reliably. All outputs should be treated as potentially wrong.
  • Generating content about real people — The model has no awareness of who real people are or what they have said/done. Outputs mentioning real people are likely to be fabricated.
  • Automated content pipelines — Do not use this model to generate content at scale without human review. The output quality and coherence are not sufficient for unreviewed publication.
  • Non-English use — The 8,064-token TokenMonster vocabulary is built exclusively for English. Prompts in other languages will be tokenized very poorly and outputs will be unreliable.
  • Instruction following — This is a base model. It does not reliably follow instructions, answer questions, or complete structured tasks. Prompting it as if it were a chat assistant will not work.

Ethical Considerations & Societal Impact

Inherited Data Biases

Stentor2-12M-Preview was trained on FineWeb-Edu, a filtered subset of Common Crawl. Despite quality filtering, this data inherits the biases present in English-language web text:

  • Western-centric perspective — Educational content on the web skews heavily toward Western, primarily American and European, viewpoints and examples.
  • English monolingualism — The training data and vocabulary are both English-only. The model has no meaningful capability in other languages.
  • Demographic underrepresentation — Groups that are underrepresented in English-language educational web content will be underrepresented in the model's outputs.
  • Temporal cutoff — FineWeb-Edu's data has a cutoff; the model has no knowledge of recent events.

No Safety Tuning

This model has received no safety training of any kind — no RLHF, no DPO, no constitutional AI, no content filtering. It is a raw base model that predicts the next token based on statistical patterns. It should not be used in any context where harmful outputs would cause real-world harm.

Positive Societal Aspects

  • Democratizing AI research — Trained entirely on free-tier Kaggle compute, this model demonstrates that meaningful LLM research does not require significant financial resources. Students and independent researchers can reproduce, study, and build on this work.
  • Transparency — Full training hyperparameters, architecture details, and training script are published. This is a contribution to reproducible ML research.
  • Minimal environmental footprint — ~4.4 hours of single-GPU compute. Estimated carbon footprint under 0.5 kg CO₂e.

Responsible Use Reminder

If you use this model in research, please document clearly that it is an unaligned base model and include appropriate caveats when reporting results. Do not present outputs from this model as factual without verification.


Inference Guide

⚠️ All examples below use the custom loader. See the Known Loading Issue section for why AutoModelForCausalLM.from_pretrained() cannot be used directly. Use either Option A (call from repo) or Option B (local file) from the Quick Start section to get model and tokenizer, then the code below works identically either way.

Basic Generation

# Load using Option A or B from Quick Start first, then:
import torch

device = next(model.parameters()).device

def generate(prompt, max_new_tokens=50, temperature=0.9, top_p=0.65):
    input_ids      = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long).to(device)
    attention_mask = torch.ones_like(input_ids)
    with torch.inference_mode():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.15,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_ids = output[0][input_ids.shape[1]:].tolist()
    return tokenizer.decode(new_ids).strip()

print(generate("The history of computing began"))

CPU (FP32)

model, tokenizer = mod.load_stentor2(dtype=torch.float32)   # Option A
model, tokenizer = load_stentor2(dtype=torch.float32)        # Option B
model = model.to("cpu")

GPU (FP16)

model, tokenizer = mod.load_stentor2(dtype=torch.float16)   # Option A
model, tokenizer = load_stentor2(dtype=torch.float16)        # Option B
model = model.to("cuda")

From a Local Checkpoint

model, tokenizer = mod.load_stentor2("./path/to/local/checkpoint")   # Option A
model, tokenizer = load_stentor2("./path/to/local/checkpoint")        # Option B

Real Model Responses

These are actual unedited outputs from the model. All examples use the custom loader described above.


Prompt: Some sicknesses are Settings: max_new_tokens=50, temperature=0.7, top_p=0.65 Output:

often associated with high blood pressure. The cause of depression is associated with a decrease in blood pressure, and may increase infections such as atrophy. The symptoms may also include: - The symptom

(Stopped at the 50-token limit, not because the model ran out of ideas) Stats: 50 tokens · 1.06s · 47.2 t/s


Prompt: In the early 20th century Settings: max_new_tokens=45, temperature=0.85, top_p=0.75 Output:

, the Middle Ages had become popularized by many, thought to be the most prominent and most popular world. In the midst of the 20th century, a study of the Western Pyrami

(Cut off by the token limit) Stats: 43 tokens · 0.91s · 47.4 t/s


Prompt: In Egypt there were massive sand cones called Settings: max_new_tokens=10, temperature=0.65, top_p=0.6 Output:

Pyramids (which

(Cut off at 10 tokens — the model correctly identified Pyramids immediately) Stats: 10 tokens · 0.14s · 71.4 t/s


Key observations from testing:

  • The model responds best to prompts that are the beginning of a sentence or paragraph — it is a text continuer, not a question answerer. Give it a strong opening and it will follow the pattern.
  • Speed on CPU is approximately 47–71 t/s depending on prompt length and hardware.
  • Keeping max_new_tokens at 60 or below produces noticeably more coherent completions.
  • The TokenMonster tokenizer is less efficient per word than the 32K BPE vocabulary used in v1 — this is expected given the smaller vocab size and is the direct cost of the ~43.8% perplexity improvement.

Quantization

⚠️ Critical note for this preview: AutoModelForCausalLM.from_pretrained() with BitsAndBytesConfig does not work for this checkpoint due to the weight_master key issue described in the Known Loading Issue section. You must load with the custom loader first, then apply quantization afterward. The standard from_pretrained() + BitsAndBytesConfig pattern will work normally in the final Stentor2-12M release.

Despite the model already being small (~49 MB in FP32, ~25 MB in FP16), quantization can further reduce memory for extremely constrained environments.

FP16 — Recommended First Step

For GPU deployment, loading in FP16 halves memory to ~25 MB and is the simplest effective "quantization":

model, tokenizer = mod.load_stentor2(dtype=torch.float16)  # Option A
model = model.to("cuda")

Dynamic INT8 Quantization (CPU, PyTorch native — no extra install)

For CPU deployment, PyTorch's built-in dynamic quantization works after loading with the custom loader and requires no additional packages:

import torch
from huggingface_hub import hf_hub_download
import importlib.util, sys

# Step 1: Load with custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

model, tokenizer = mod.load_stentor2(dtype=torch.float32)
model = model.to("cpu").eval()

# Step 2: Apply dynamic INT8 quantization (CPU only)
model_int8 = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8,
)
# Approximate memory: ~12 MB — 75% reduction from FP32
# Note: dynamic quantization only affects inference; model stays on CPU

Manual 8-bit via bitsandbytes (GPU)

For GPU deployment with bitsandbytes INT8, apply the conversion after loading:

import torch
import bitsandbytes as bnb
from huggingface_hub import hf_hub_download
import importlib.util, sys

# Step 1: Load with custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

model, tokenizer = mod.load_stentor2(dtype=torch.float16)
model = model.to("cuda").eval()

# Step 2: Replace linear layers with INT8 equivalents
def replace_with_bnb_int8(module):
    for name, child in list(module.named_children()):
        if isinstance(child, torch.nn.Linear):
            new_layer = bnb.nn.Linear8bitLt(
                child.in_features,
                child.out_features,
                bias=child.bias is not None,
                has_fp16_weights=False,
                threshold=6.0,
            )
            new_layer.weight = bnb.nn.Int8Params(
                child.weight.data.cpu(),
                requires_grad=False,
            )
            if child.bias is not None:
                new_layer.bias = torch.nn.Parameter(child.bias.data)
            setattr(module, name, new_layer)
        else:
            replace_with_bnb_int8(child)

replace_with_bnb_int8(model)
# Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)

Requires: pip install bitsandbytes

Practical note: Given that FP16 is already only ~25 MB and the model runs at 47–71 t/s on CPU, aggressive quantization may not be necessary for most use cases. Dynamic INT8 is most useful when targeting microcontrollers or very constrained embedded environments.


Format Conversion

Convert to GGUF (for llama.cpp)

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

# Download model
huggingface-cli download StentorLabs/Stentor2-12M-Preview --local-dir stentor2-12m-preview

# Convert to GGUF (FP16)
python convert_hf_to_gguf.py stentor2-12m-preview/ \
  --outfile stentor2-12m-preview.gguf \
  --outtype f16

# Quantize to Q4_0 (optional, smallest file)
./llama-quantize stentor2-12m-preview.gguf stentor2-12m-preview-q4_0.gguf q4_0

# Run
./llama-cli -m stentor2-12m-preview-q4_0.gguf -p "The science of" -n 50

Note on GGUF + TokenMonster: The custom TokenMonster tokenizer may require manual vocabulary mapping when using llama.cpp. The standard convert_hf_to_gguf.py script expects a HuggingFace tokenizer format. You may need to convert the vocabulary to a compatible format first.

Convert to ONNX

pip install optimum[exporters]

optimum-cli export onnx \
  --model StentorLabs/Stentor2-12M-Preview \
  --task text-generation-with-past \
  stentor2-12m-onnx/
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model = ORTModelForCausalLM.from_pretrained("stentor2-12m-onnx")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor2-12M-Preview")

inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Speculative Decoding

Stentor2-12M-Preview can serve as a fast draft model to accelerate inference from larger Llama-family target models.

from huggingface_hub import hf_hub_download
import importlib.util, sys, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Stentor2 as draft model using the custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

draft_model, _ = mod.load_stentor2(dtype=torch.float16)
draft_model     = draft_model.to("cuda")

# Load target model normally
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    torch_dtype=torch.float16,
    device_map="auto"
)
target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

prompt  = "Explain the concept of recursion"
inputs  = target_tokenizer(prompt, return_tensors="pt")

outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    do_sample=True,
    max_new_tokens=100
)

print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))

Important caveat: Stentor2 uses a different vocabulary (8,064-token TokenMonster) than standard Llama models (32,000-token BPE). This vocabulary mismatch means the target model's acceptance rate may be lower than it would be with a vocabulary-compatible draft model. In practice, speedups depend heavily on how similar the generated text distribution is between draft and target.

For best results with speculative decoding, a vocabulary-matched draft model is preferable. If you need a drop-in speculative draft for a standard Llama target, Stentor v1 (with its 32,768-token Mistral vocabulary) may provide better token acceptance rates despite its higher perplexity.


Bias, Risks & Limitations

Known Limitations

The following limitations were observed and confirmed through hands-on testing:

  • Prompt Relevance: Outputs are frequently off-topic for complex prompts. The model is pattern-completing, not comprehending.
  • Factual Accuracy: All factual claims from this model should be treated as unreliable. 12M parameters cannot store meaningful world knowledge.
  • Context Boundary: Hard limit of 1,024 tokens. Sequences approaching this limit may degrade in coherence.
  • Short Output Window for Coherence: Even within the 1,024-token context limit, outputs beyond ~60 tokens tend to wander off-topic or become repetitive. Keeping max_new_tokens at 60 or below is strongly recommended.
  • English Bias: The TokenMonster English vocabulary is optimized for English. Other languages will tokenize to many rare/unknown tokens and likely produce poor output.
  • Training Data Bias: Inherits biases present in FineWeb-Edu filtered web data — primarily English-language, Western-centric educational content.
  • Hallucination: Like all LLMs, this model may confidently produce plausible-sounding but entirely fabricated content.
  • No Alignment: No RLHF, no DPO, no constitutional training. Raw base model behavior.
  • Preview Status: This is not the final Stentor2 architecture. Known improvements are pending.
  • Tokenizer Efficiency: The 8K TokenMonster vocabulary produces more tokens per word than standard 32K BPE vocabularies. This is expected given the architecture tradeoff and is not a bug.

Shared Tensor Warning

When saving or reloading this model, you will see:

Removed shared tensor {'lm_head.weight'} while saving.

This is expected. The model uses tie_word_embeddings=True, meaning model.embed_tokens.weight and model.lm_head.weight point to the same tensor. The safetensors format removes the duplicate during serialization and reconstructs it on load. This is safe and produces no accuracy difference.

This is a separate and unrelated issue from the weight_master loading problem. See the Known Loading Issue section for that.


What's Next

This is a preview. The training run for Stentor2-12M-Preview revealed several clear paths to further improvement that have not yet been implemented. Those improvements are the focus of the next training run, and when that model is ready, it will be released as Stentor2-12M.

If you find bugs, unexpected behavior, or have benchmarks or use cases worth sharing, please open a discussion on the model repository — community input before the final release is welcome.

🚫 There will be no Stentor2-30M-Preview. This preview exists to share the architectural direction of the Stentor2 family, not to establish a preview release cadence for every size. The next public drop from StentorLabs will be the finished Stentor2 model.


Environmental Impact

Factor Value
Hardware 2× NVIDIA Tesla T4 (1 active)
Active Training Duration ~4.37 hours
Cloud Provider Kaggle (free tier)
Compute Region Western USA
Estimated Carbon Minimal (< 0.5 kg CO₂e estimated)

Training on free-tier cloud compute demonstrates that meaningful SLM research is accessible to independent researchers and students without significant hardware investment or carbon cost.


Citation

If you use this model in research or a project, please cite it as follows. Note that this is a HuggingFace model card, not an arXiv paper, so there is no arXiv ID — the howpublished URL is the canonical reference.

@misc{izumoto2026stentor2_12m_preview,
  title        = {Stentor2-12M-Preview},
  author       = {Kai Izumoto},
  year         = {2026},
  publisher    = {StentorLabs},
  howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}},
  note         = {Preview checkpoint of the Stentor2 model family.
                  12.3M parameter LlamaForCausalLM base model trained on
                  FineWeb-Edu with a TokenMonster 8K vocabulary.
                  Apache 2.0 license.}
}

Related Work

This section compares Stentor2-12M-Preview to other publicly available models in the sub-50M parameter range, and to relevant research that informed design decisions.

Comparable Sub-50M Models

Model Parameters Perplexity Vocab Training Data Notes
Stentor2-12M-Preview (this model) 12.3M ~50.1 (FineWeb-Edu val) 8,064 FineWeb-Edu 240M tokens Base model, TokenMonster vocab
Stentor-12M (v1) 12.0M 89.01 (FineWeb-Edu val) 32,768 FineWeb-Edu + Cosmopedia 200M Baseline this model improves on
Stentor-30M (v1) 30.4M 33.02 (FineWeb-Edu val) 32,768 FineWeb-Edu + Cosmopedia 600M Larger v1 model
TinyStories-33M ~33M ~varies ~50K TinyStories (synthetic) Eldan & Li, 2023 — focused on story generation
TinyStories-1M ~1M very high ~50K TinyStories (synthetic) Demonstrates 1M param story capability
Pythia-14M 14M ~varies (Pile) 50,254 The Pile 300B tokens EleutherAI; well-studied scaling baseline
Pythia-70M 70M ~varies (Pile) 50,254 The Pile 300B tokens Closest Pythia model above this size
BabyLlama 58M ~varies ~32K TinyStories + Wikitext BabyLM challenge submission

Comparison caveats: Perplexity numbers are not directly comparable across models — different validation sets, vocabularies, and tokenizers all affect the number. The table is a rough orientation, not a rigorous benchmark. Stentor2's perplexity is measured on the FineWeb-Edu validation split using its own 8K TokenMonster tokenizer.

Key differentiators of Stentor2 vs. comparable models:

  • Vocabulary efficiency focus — The deliberate reduction to 8K tokens to maximize non-embedding parameter budget is a distinguishing design choice not seen in most small models.
  • T4-specific training recipe — The INT8 QAT + FP32 critical layer + FP32 norm combination is a novel stability recipe specifically designed for consumer-grade GPU training.
  • Educational data — Unlike TinyStories models (trained on synthetic children's stories) or Pythia (trained on the general-domain Pile), Stentor2 is trained on quality-filtered educational web text.

Related Research Papers

Paper Relevance
TinyStories — Eldan & Li, 2023 Demonstrates meaningful language generation from 1M–33M parameter models; closest comparator in scale
Pythia — Biderman et al., 2023 Systematic study of small model scaling; Pythia-14M is a well-documented baseline
Scaling Laws — Kaplan et al., 2020 Foundational work on compute-optimal training; informs token budget decisions
Chinchilla — Hoffmann et al., 2022 Revised scaling laws; 240M tokens for 12M params is approximately compute-optimal under this analysis
Model Cards — Mitchell et al., 2018 Methodology underlying this model card
RoPE — Su et al., 2021 Positional encoding used in this model
Speculative Decoding — Leviathan et al., 2023 Primary use case for a fast draft model like Stentor2
T5 — Raffel et al., 2020 Source of NFKC text normalization approach used in data pipeline

Related Resources

StentorLabs Models

Referenced Tools & Datasets


Model Card Contact

Questions, benchmarks, or feedback: StentorLabs@gmail.com or open a discussion.


Made with ❤️ by StentorLabs
Democratizing AI through accessible, efficient models

Downloads last month
91
Safetensors
Model size
12.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train StentorLabs/Stentor2-12M-Preview

Space using StentorLabs/Stentor2-12M-Preview 1

Papers for StentorLabs/Stentor2-12M-Preview

Evaluation results

  • Best Validation Loss on FineWeb-Edu (validation split)
    self-reported
    3.914
  • Best Perplexity (at best checkpoint) on FineWeb-Edu (validation split)
    self-reported
    50.070
  • Final Epoch Validation Loss on FineWeb-Edu (validation split)
    self-reported
    4.008
  • Final Epoch Perplexity on FineWeb-Edu (validation split)
    self-reported
    55.050