Instructions to use StentorLabs/Stentor-12M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use StentorLabs/Stentor-12M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="StentorLabs/Stentor-12M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-12M")
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor-12M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use StentorLabs/Stentor-12M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "StentorLabs/Stentor-12M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Stentor-12M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/StentorLabs/Stentor-12M

SGLang

How to use StentorLabs/Stentor-12M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "StentorLabs/Stentor-12M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Stentor-12M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "StentorLabs/Stentor-12M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "StentorLabs/Stentor-12M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use StentorLabs/Stentor-12M with Docker Model Runner:
```
docker model run hf.co/StentorLabs/Stentor-12M
```

Stentor-12M

Stentor-12M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this 12M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a base next-token predictor (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is not instruction-tuned and will often generate plausible but off-topic text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.

⚠️ Important Limitations

Context Window: Maximum 512 tokens (very short)

Not Instruction-Tuned: May ignore prompts or respond off-topic

Stopping / EOS: Sometimes stops on its own, but it's rare; always set max_new_tokens

Tokenizer ≠ Capability: "tool/function" tokens do not imply real tool use

No Safety Tuning: Base model without RLHF or safety alignment

Limited Knowledge: 12M parameters = minimal world knowledge

Proof-of-Concept: Not suitable for production without fine-tuning

Educational Focus: Trained on synthetic textbooks, not diverse real-world data

Recommended generation settings (based on manual testing):

Max new tokens: 10–80 (highly recommended)
Temperature: 0.9–1.4
Top-p: 0.4–0.65

Real interactions (sampling is non-deterministic; your outputs may vary):

Max New Tokens: 30
Temp: 1.2
Top p: 0.55
User:
Hello my name is kai. Who are you?
Generated Text:
Hello my name is kai. Who are you? Is the case, that you have heard of? You may not do something about it, and what they do you know. So what is this?

Max New Tokens: 40
Temp: 1.1
Top p: 0.45
User:
The story of my life is
Generated text:
The story of my life is a very important step to our understanding of the world, we have a deeper understanding of the world, and how to make a sense of what we are doing. This is an opportunity to explore and create

Max New Tokens: 30
Temp: 1
Top p: 0.55
User:
Phycology is the understanding of
Text Generated:
Phycology is the understanding of the physical sciences, the evolution of the language and the brain. The evolution of the human body, the evolution of the human brain, and the nature

🚀 Quick Start

Get up and running in 3 simple steps:

1. Install

pip install transformers torch

2. Load & Generate

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor-12M")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-12M")

prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=50,  # always set this; the model may not stop on its own
    do_sample=True,
    temperature=1.1,
    top_p=0.55,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

3. Explore!

Try different prompts
Adjust max_new_tokens, temperature, and top_p

Model Details

Model Description

Stentor-12M is a lightweight LlamaForCausalLM model designed to bring the architectural benefits of Llama to a fraction of the size. With a hidden size of 192 and a tiny parameter budget, this model is optimized for rapid inference and edge-deployment scenarios where memory is at a premium.

The tokenizer configuration may include control tokens commonly used in instruction/tool-call formatting (for experimentation), but these tokens do not make the base model instruction-following or tool-using. If you need reliable instruction following or structured tool calls, you will need additional fine-tuning / alignment.

Developed by: Kai Izumoto (StentorLabs)
Funded by: Self-funded
Shared by: StentorLabs
Model type: LlamaForCausalLM (Auto-regressive Language Model)
Language(s): English
License: Apache-2.0
Finetuned from model: None (Base model trained from scratch)

Uses

Direct Use

Low-Latency Text Generation: Due to its small size (approx. 12M parameters), Stentor-12M is suitable for real-time applications on CPU or mobile devices.
Instruction-Style Prompting (Limited): You can format prompts using tags like [INST], but the model is not instruction-tuned and will often fail to follow the request.
Tool-Call Formatting Tokens (Limited): The tokenizer may include tool-related tokens, but the model is not trained to reliably emit valid tool calls/JSON or to "use tools".
Edge Deployment: Ideal for resource-constrained environments including mobile devices, IoT, and embedded systems.

Downstream Use

Speculative Decoding (Experimental): Stentor-12M can be used as a fast draft model for larger Llama-based models, but speedups depend on how often the larger model accepts the draft tokens (quality limits may reduce gains).
Educational/Research: A perfect "petri dish" model for studying attention mechanics (3 attention heads) and training dynamics without requiring massive compute.
Prototyping: Quick, low-cost experiments focused on latency, sampling behavior, and failure modes before scaling up.

Out-of-Scope Use

Complex Reasoning: As a 12M parameter model, users should not expect high-level reasoning or deep knowledge retrieval comparable to multi-billion parameter models.
Instruction-Following Chatbots: This is a base model and is not reliably conversational or on-task.
Long Context: The model is optimized for short-context tasks with a maximum position embedding of 512 tokens.
Production-Critical Applications: This is a research/proof-of-concept model and should not be used for mission-critical applications without thorough testing.

Bias, Risks, and Limitations

Context Window: The model has a hard limit of 512 tokens for context length.
Prompt Relevance: Outputs are often generic or unrelated to the prompt, even when they sound fluent.
Knowledge Base: Limited parameter count restricts the amount of world knowledge the model can store.
Training Data Bias: The model inherits any biases present in the FineWeb-Edu and Cosmopedia v2 datasets.
Hallucinations: Like all language models, Stentor-12M may generate plausible-sounding but factually incorrect information.
No Safety Tuning: This is a base model without safety alignment or RLHF.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. This model is best used for specific, narrow tasks or as a component in a larger system (e.g., speculative decoding) rather than a general-purpose assistant.

How to Get Started with the Model

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "StentorLabs/Stentor-12M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# The repo may provide a chat template, but this is still a base model.
# Do not expect reliable instruction following just because you use chat formatting.
messages = [
    {"role": "user", "content": "Hello, what are you?"}
]

inputs = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt", 
    add_generation_prompt=True
)
outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage with Tool-Call Formatting (Educational)

# The tokenizer may include tokens that resemble tool/function calling formats.
# The base model is not trained to reliably emit valid tool calls or structured JSON.
messages = [
    {"role": "system", "content": "You are a tiny base language model. You do not have tool access."},
    {"role": "user", "content": "What's the weather like?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=100)

Detailed Use Cases

1. Speculative Decoding with Llama 3

Potentially speed up larger model inference by using Stentor-12M as a draft model (results vary):

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load draft model (Stentor-12M)
draft_model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor-12M")
draft_tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-12M")

# Load target model
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Use speculative decoding (requires a recent Transformers version that supports `assistant_model`)
prompt = "Explain machine learning"
inputs = target_tokenizer(prompt, return_tensors="pt")

outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,  # Stentor-12M as draft
    do_sample=True,
    max_new_tokens=100
)

print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))

2. Edge Deployment with ONNX

Convert to ONNX for mobile/edge deployment:

# Install dependencies
pip install optimum[exporters]

# Export to ONNX
optimum-cli export onnx \
  --model StentorLabs/Stentor-12M \
  --task text-generation-with-past \
  stentor-12m-onnx/

# Use with ONNX Runtime
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model = ORTModelForCausalLM.from_pretrained("stentor-12m-onnx")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-12M")

inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

3. Rapid Prototyping

Quick experimentation before scaling:

# These "tasks" are intentionally broad: this tiny base model will often fail.
# The point is to observe latency, failure modes, and sampling behavior.
from transformers import pipeline

generator = pipeline("text-generation", model="StentorLabs/Stentor-12M")

test_prompts = [
    "Summarize this: [long text]",
    "Translate to French: Hello",
    "Answer: What is 2+2?"
]

for prompt in test_prompts:
    result = generator(prompt, max_new_tokens=30)[0]['generated_text']
    print(f"Prompt: {prompt}\nResult: {result}\n")

Quantization Options

Reduce memory footprint even further with quantization:

8-bit Quantization

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor-12M",
    quantization_config=quantization_config,
    device_map="auto"
)
# Memory: ~12 MB (75% reduction)

4-bit Quantization

quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "StentorLabs/Stentor-12M",
    quantization_config=quantization_config,
    device_map="auto"
)
# Memory: ~6 MB (87% reduction)

Note: Requires bitsandbytes library: pip install bitsandbytes

Model Format Conversions

Convert to GGUF (for llama.cpp)

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install dependencies
pip install -r requirements.txt

# Download model
huggingface-cli download StentorLabs/Stentor-12M --local-dir stentor-12m

# Convert to GGUF
python convert_hf_to_gguf.py stentor-12m/ \
  --outfile stentor-12m.gguf \
  --outtype f16

# Quantize (optional)
./llama-quantize stentor-12m.gguf stentor-12m-q4_0.gguf q4_0

# Run with llama.cpp
./llama-cli -m stentor-12m-q4_0.gguf -p "Hello world" -n 50

Convert to ONNX

# Install optimum
pip install optimum[exporters]

# Export to ONNX
optimum-cli export onnx \
  --model StentorLabs/Stentor-12M \
  --task text-generation-with-past \
  stentor-12m-onnx/

# Use with ONNX Runtime (C++/Python/JS)
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("stentor-12m-onnx")

Convert to TensorFlow Lite (Mobile)

# Install dependencies
pip install tensorflow tf2onnx

# First convert to ONNX (see above)
# Then convert ONNX to TFLite
python -m tf2onnx.convert \
  --onnx stentor-12m-onnx/model.onnx \
  --output stentor-12m.tflite \
  --opset 13

Use cases:

GGUF: C++ applications, maximum performance
ONNX: Cross-platform (Windows/Linux/Mac/Web)
TFLite: Android/iOS mobile apps

Training Details

Training Data

The model was trained on a high-quality mixed dataset focused on educational content and synthetic textbook data:

FineWeb-Edu (HuggingFaceFW/fineweb-edu): A dataset filtered for educational quality.
Cosmopedia v2 (HuggingFaceTB/smollm-corpus): A corpus of synthetic textbooks and stories.

Total tokens processed: 200,015,872 tokens

Training Procedure

The model was trained using a custom script in a Kaggle Jupyter environment, demonstrating the accessibility of training efficient models on free-tier compute.

Preprocessing

The training pipeline utilized lightweight but effective preprocessing steps:

Cleaning: Unicode normalization (NFKC) and whitespace stripping/normalization.
Formatting: Optional wrapping for chat formats or <think> tokens.
Packing: Sequence packing into fixed block_size chunks to maximize training efficiency.
Tokenization: Standard Llama tokenization with EOS tokens appended.

Training Hyperparameters

Click to view full training configuration

Hyperparameter	Value
Precision	fp16 mixed precision
Optimizer	AdamW
Scheduler	Cosine
Learning Rate	0.0008
Weight Decay	0.01
Warmup Ratio	0.02
Stable Ratio	0.8
Total Batch Size	256
Max Train Steps	1,526
Evaluation Steps	100
Gradient Accumulation	Enabled

Speeds, Sizes, Times

Training Time: 4,698.6 seconds (~1.3 hours)
Hardware: 2x Tesla T4 GPUs (Kaggle)
Vocab Size: 32,768 (padded to multiple of 128)
Sequence Length: 512 tokens
Tokens per Second (avg): ~43,000 TPS
Total Parameters: 12,047,040
Embedding Parameters: 6,291,456 (52.2% of total)

Note: A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was performed on a held-out validation split of the mixed FineWeb-Edu and Cosmopedia dataset.

Metrics

Validation Loss: Measures how well the model predicts the next token (lower is better).
Perplexity (PPL): The exponential of the loss, indicating how "surprised" the model is by new text (lower is better).

Results

Metric	Value
Validation Loss	4.4887
Perplexity	89.01

Training Progress

The model showed steady improvement throughout training:

Initial loss (step 25): 8.8138
Mid-training loss (step 750): 4.2778
Final loss (step 1526): 4.4887
Total improvement: 34.8%

Note: As a 12M parameter base model trained for < 2 hours, these metrics represent a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.

Technical Specifications

Model Architecture and Objective

Click to view full architecture specifications

Stentor-12M utilizes the Llama architecture with the following specific configuration:

Component	Value
Hidden Size	192
Intermediate Size	576
Num Hidden Layers	9
Attention Heads	3
Key/Value Heads	3
Hidden Activation	SiLU
RoPE Theta	10000.0
Max Position Embeddings	512
Vocab Size	32,768
Tie Word Embeddings	True

Architecture Note: The number of layers was reduced from the initially planned 12 to 9 to maintain parameter count within the target range while accommodating the large vocabulary size.

Compute Infrastructure

The model was trained using standard cloud infrastructure available to researchers and students.

Hardware

GPUs: 2x NVIDIA Tesla T4 (16GB each)
Platform: Kaggle Notebooks (free tier)
Compute Type: Cloud-based

Software

Transformers Version: 4.57.1
PyTorch Version: Latest stable
Torch Compile: False (disabled for notebook stability)
Accelerate: Enabled for multi-GPU training

Environmental Impact

Hardware Type: 2x NVIDIA Tesla T4
Hours used: ~1.3 hours
Cloud Provider: Kaggle
Compute Region: Unknown
Carbon Emitted: Minimal due to short training time

Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.

Related Resources

Official Resources

📊 Training Logs - Detailed training metrics and loss curves
🎓 Model Card Methodology - Mitchell et al., 2018

Related Models

TinyLlama-1.1B - Larger alternative (1.1B params)
SmolLM-135M - Similar size category
Llama-3.2-1B - Target model for speculative decoding

Research Papers

Speculative Decoding - Leviathan et al., 2023
Small Language Models Survey - Survey on efficient LLMs

Citation

@misc{izumoto2024stentor12m,
      title={Stentor-12M: A Compact Llama-based Language Model}, 
      author={Kai Izumoto},
      year={2026},
      publisher={StentorLabs},
      howpublished={\url{https://huggingface.co/StentorLabs/Stentor-12M}}
}

Glossary

NLP (Natural Language Processing): The field of AI focused on the interaction between computers and human language.
PPL (Perplexity): A measurement of how well a probability model predicts a sample. Lower is generally better.
Speculative Decoding: A technique where a small "draft" model (like Stentor-12M) quickly generates tokens that are then verified by a larger model, speeding up the overall process.
SLM (Small Language Model): Language models with parameters typically under 1B, designed for efficiency and specific tasks.
RoPE (Rotary Position Embedding): A method for encoding position information in transformer models.
Edge Deployment: Running models on resource-constrained devices like mobile phones or IoT devices.

Model Card Contact

For questions, please contact StentorLabs@gmail.com or open an issue on the model repository.

Acknowledgments

Special thanks to:

Hugging Face for the transformers library and dataset hosting
The creators of FineWeb-Edu and Cosmopedia v2 datasets
Kaggle for providing free GPU compute resources
The open-source community for making accessible AI research possible

Connect & Community

Stay Updated

📧 Email - Direct contact
💬 HuggingFace Discussions - Questions and community chat

Model tree for StentorLabs/Stentor-12M

Adapters

1 model

Datasets used to train StentorLabs/Stentor-12M

Space using StentorLabs/Stentor-12M 1

Collection including StentorLabs/Stentor-12M

Stentor1

Collection

The Stentor1 series is StentorLabs’ first family of transparent 12M- and 30M-parameter Llama-style models trained from scratch on free-tier hardware. • 4 items • Updated Feb 28 • 2

Papers for StentorLabs/Stentor-12M

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Paper • 2402.14848 • Published Feb 19, 2024 • 19

Fast Inference from Transformers via Speculative Decoding

Paper • 2211.17192 • Published Nov 30, 2022 • 11

Model Cards for Model Reporting

Paper • 1810.03993 • Published Oct 5, 2018 • 7

Evaluation results

Validation Loss on FineWeb-Edu + Cosmopedia v2 (validation split)
self-reported

4.489
Perplexity on FineWeb-Edu + Cosmopedia v2 (validation split)
self-reported

89.010