Update README.md

Browse files

Files changed (1) hide show

README.md +217 -75

README.md CHANGED Viewed

@@ -32,8 +32,8 @@ widget:
   example_title: "Toy Explanation (Often Wrong)"
 - text: "def fibonacci(n):"
   example_title: "Code Continuation"
-- text: "[INST]What is machine learning?[/INST]"
-  example_title: "Instruction-Style Prompt (Not Tuned)"
 model_card_authors:
 - StentorLabs
 model-index:
@@ -68,6 +68,9 @@ model-index:
 ![Context Length](https://img.shields.io/badge/context-1024%20tokens-purple.svg)
 ![Vocab Size](https://img.shields.io/badge/vocab-8064%20tokens-blue.svg)
 [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs)
 > ⚠️ **This is a preview release.** Stentor2-12M-Preview is an early taste of the Stentor2 family — a substantially redesigned architecture over Stentor v1. Further improvements have already been identified and a refined final release is actively in progress. This checkpoint is **not** the ceiling of what Stentor2 will be.
 >
@@ -94,16 +97,19 @@ model-index:
 13. [Weight Initialization](#weight-initialization)
 14. [Evaluation & Results](#evaluation--results)
 15. [Training Dynamics](#training-dynamics)
-16. [Use Cases](#use-cases)
-17. [Inference Guide](#inference-guide)
-18. [Real Model Responses](#real-model-responses)
-19. [Quantization](#quantization)
-20. [Format Conversion](#format-conversion)
-21. [Speculative Decoding](#speculative-decoding)
-22. [Bias, Risks & Limitations](#bias-risks--limitations)
-23. [What's Next](#whats-next)
-24. [Environmental Impact](#environmental-impact)
-25. [Citation](#citation)
 ---
@@ -751,7 +757,11 @@ def clean_text(text: str) -> str:
     return text
 ```
-This normalizes Unicode (e.g., ligatures, half-width characters), removes blank lines, and collapses all whitespace to single spaces.
 ### Tokenization
@@ -820,6 +830,14 @@ This initialization is applied **before** the T4 recipe is applied. The T4 recip
 ## Evaluation & Results
 ### Metrics
 - **Validation Loss:** Cross-entropy loss over the held-out validation split (lower = better)
@@ -862,37 +880,67 @@ The training run proceeded for a single epoch over 14,649 optimizer steps, consu
 ---
-## Use Cases
-### Suitable Uses
-**Research & Education**
-- Studying transformer training dynamics at accessible compute cost
-- Investigating attention head behavior at scale (4 heads, 12 layers)
-- Tokenization efficiency experiments (comparing 8K vs 32K vocab at fixed params)
-- Testing training pipeline components on a real-but-cheap model
-- Teaching material for LLM courses
-**Edge Deployment Prototyping**
-- Benchmarking inference latency on CPU / mobile
-- Testing ONNX, TFLite, GGUF conversion pipelines
-- Validating quantization toolchains before scaling up
-**Speculative Decoding**
-- Draft model for larger Llama-family models
-- Acceptance rate experiments under vocabulary mismatch conditions
-**Base for Fine-Tuning**
-- Starting point for domain-specific instruction tuning
-- LoRA / QLoRA experiments
-### Not Suitable For
-- Production chatbots or user-facing conversational systems
-- Tasks requiring factual accuracy or reliable reasoning
-- Long-context documents (>1,024 tokens)
-- Non-English text
-- Any safety-critical application
 ---
@@ -999,43 +1047,97 @@ These are actual unedited outputs from the model. All examples use the custom lo
 ## Quantization
-Despite the model already being small, quantization can further reduce memory footprint for extremely constrained environments.
-> ⚠️ **Note:** The code examples below use `AutoModelForCausalLM.from_pretrained()` with a `BitsAndBytesConfig`. Due to the `weight_master` key issue described in the [Known Loading Issue](#known-loading-issue--please-read) section, this may not load the weights correctly for this preview checkpoint. If quantization via `from_pretrained()` produces no output or errors, load the model using the custom loader first, then apply quantization manually afterward. This will be a non-issue in the final Stentor2-12M release.
-### 8-bit Quantization (bitsandbytes)
 ```python
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-model = AutoModelForCausalLM.from_pretrained(
-    "StentorLabs/Stentor2-12M-Preview",
-    quantization_config=quantization_config,
-    device_map="auto"
 )
-# Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)
 ```
-### 4-bit Quantization (bitsandbytes)
 ```python
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_compute_dtype=torch.float16,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_quant_type="nf4"
-)
-model = AutoModelForCausalLM.from_pretrained(
-    "StentorLabs/Stentor2-12M-Preview",
-    quantization_config=quantization_config,
-    device_map="auto"
-)
-# Approximate memory: ~6 MB (87% reduction from FP32)
 ```
 Requires: `pip install bitsandbytes`
 ---
 ## Format Conversion
@@ -1193,22 +1295,70 @@ Training on free-tier cloud compute demonstrates that meaningful SLM research is
 ## Citation
 ```bibtex
 @misc{izumoto2026stentor2_12m_preview,
-  title   = {Stentor2-12M-Preview},
-  author  = {Kai Izumoto},
-  year    = {2026},
-  publisher = {StentorLabs},
-  howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}}
 }
 ```
 ---
 ## Related Resources
 ### StentorLabs Models
-- [Stentor-12M](https://huggingface.co/StentorLabs/Stentor-12M) — The v1 baseline this model improves upon
 - [StentorLabs Collection](https://huggingface.co/StentorLabs) — All models from StentorLabs
 ### Referenced Tools & Datasets
@@ -1216,15 +1366,7 @@ Training on free-tier cloud compute demonstrates that meaningful SLM research is
 - [TokenMonster](https://huggingface.co/alasdairforsythe/tokenmonster) — Tokenizer vocabulary
 - [HuggingFace Accelerate](https://github.com/huggingface/accelerate) — Training framework
 - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) — Quantization library
-### Related Models (Comparable Scale)
-- [SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) — Larger, highly capable SLM
-- [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) — Larger alternative
-### Research Papers
-- [Speculative Decoding](https://arxiv.org/abs/2211.17192) — Leviathan et al., 2023
-- [Model Card Methodology](https://arxiv.org/abs/1810.03993) — Mitchell et al., 2018
-- [RoPE Positional Embeddings](https://arxiv.org/abs/2104.09864) — Su et al., 2021
 ---

   example_title: "Toy Explanation (Often Wrong)"
 - text: "def fibonacci(n):"
   example_title: "Code Continuation"
+- text: "The laws of thermodynamics describe"
+  example_title: "Science Continuation"
 model_card_authors:
 - StentorLabs
 model-index:
 ![Context Length](https://img.shields.io/badge/context-1024%20tokens-purple.svg)
 ![Vocab Size](https://img.shields.io/badge/vocab-8064%20tokens-blue.svg)
 [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs)
+![Status](https://img.shields.io/badge/status-Research%20Artifact%20Only-red.svg)
+> 🔬 **Research Artifact — Not a Production Model.** This is an early preview checkpoint released for research, experimentation, and community feedback. It is not suitable for deployment in any user-facing application. See [Intended Uses](#use-cases--intended-uses) for details.
 > ⚠️ **This is a preview release.** Stentor2-12M-Preview is an early taste of the Stentor2 family — a substantially redesigned architecture over Stentor v1. Further improvements have already been identified and a refined final release is actively in progress. This checkpoint is **not** the ceiling of what Stentor2 will be.
 >
 13. [Weight Initialization](#weight-initialization)
 14. [Evaluation & Results](#evaluation--results)
 15. [Training Dynamics](#training-dynamics)
+16. [Use Cases & Intended Uses](#use-cases--intended-uses)
+17. [Out-of-Scope Uses](#out-of-scope-uses)
+18. [Ethical Considerations & Societal Impact](#ethical-considerations--societal-impact)
+19. [Inference Guide](#inference-guide)
+20. [Real Model Responses](#real-model-responses)
+21. [Quantization](#quantization)
+22. [Format Conversion](#format-conversion)
+23. [Speculative Decoding](#speculative-decoding)
+24. [Bias, Risks & Limitations](#bias-risks--limitations)
+25. [Related Work](#related-work)
+26. [What's Next](#whats-next)
+27. [Environmental Impact](#environmental-impact)
+28. [Citation](#citation)
 ---
     return text
 ```
+**Why these specific steps:**
+- **NFKC normalization** maps visually equivalent Unicode characters to a single canonical form (e.g., full-width `Ａ` → `A`, ligature `ﬁ` → `fi`, superscript `²` → `2`). This is the standard choice for LLM preprocessing — used in T5 (Raffel et al., 2020, [arXiv:1910.10683](https://arxiv.org/abs/1910.10683)), BERT (Devlin et al., 2019, [arXiv:1810.04805](https://arxiv.org/abs/1810.04805)), and the Unicode standard itself (Unicode Technical Report #15). Without it, the model would see dozens of token IDs for what is semantically one character.
+- **Whitespace collapse** (join lines, collapse spaces) ensures consistent tokenization of the same content regardless of how it was originally formatted. Web-scraped text commonly contains inconsistent line breaks, multiple spaces, and mixed newline styles. This is also standard practice in GPT-style pretraining pipelines. No ablation was performed on this step — it was adopted from established practice rather than experimentally derived.
 ### Tokenization
 ## Evaluation & Results
+### Training Curves
+The charts below show validation loss and perplexity over the course of the training run. Both are plotted against optimizer steps. The best checkpoint (step 11,625) is visible as the lowest point before the slight uptick in the tail phase.
+![Validation loss over training steps](loss_chart.png)
+![Perplexity over training steps](perplexity_chart.png)
 ### Metrics
 - **Validation Loss:** Cross-entropy loss over the held-out validation split (lower = better)
 ---
+## Use Cases & Intended Uses
+> 🔬 **Reminder:** This is a **research artifact**. It is a base language model with no safety tuning, no instruction following, and no factual grounding. Every intended use below assumes a researcher or developer context, not an end user.
+### Intended Uses
+| Use Case | Suitability | Notes |
+|---|---|---|
+| Studying transformer training dynamics | ✅ High | Small enough to train/fine-tune on free compute |
+| Tokenization efficiency research | ✅ High | 8K vs 32K vocab tradeoff is directly observable |
+| Speculative decoding experiments | ✅ High | Fast enough to serve as a draft model |
+| Benchmarking CPU/edge inference latency | ✅ High | ~12MB in FP16, runs on any hardware |
+| Testing quantization/conversion pipelines | ✅ High | GGUF, ONNX, INT8 pipeline validation |
+| Teaching material for LLM courses | ✅ High | Architecture is simple enough to trace by hand |
+| LoRA / QLoRA fine-tuning experiments | ✅ Moderate | Base model only; start from scratch for any task |
+| Text continuation / creative prompting | ✅ Moderate | Works best on short completions ≤60 tokens |
+| Domain-specific fine-tuning research | ✅ Moderate | Small enough to iterate rapidly |
+| Factual Q&A | ❌ Not suitable | Model has no reliable world knowledge |
+| Production deployment | ❌ Not suitable | No safety tuning; preview quality only |
+| Non-English text | ❌ Not suitable | TokenMonster vocab is English-only |
+| Long-document tasks (>512 tokens of coherent output) | ❌ Not suitable | Coherence degrades quickly |
+---
+## Out-of-Scope Uses
+The following uses are explicitly out of scope and should not be attempted:
+- **User-facing applications of any kind** — This model has no safety filtering, no alignment, and no factual reliability. Deploying it in a context where a real user receives its output without expert review is inappropriate regardless of the domain.
+- **Medical, legal, or financial advice** — Even if prompted carefully, 12M parameters cannot store or reason over specialized knowledge reliably. All outputs should be treated as potentially wrong.
+- **Generating content about real people** — The model has no awareness of who real people are or what they have said/done. Outputs mentioning real people are likely to be fabricated.
+- **Automated content pipelines** — Do not use this model to generate content at scale without human review. The output quality and coherence are not sufficient for unreviewed publication.
+- **Non-English use** — The 8,064-token TokenMonster vocabulary is built exclusively for English. Prompts in other languages will be tokenized very poorly and outputs will be unreliable.
+- **Instruction following** — This is a base model. It does not reliably follow instructions, answer questions, or complete structured tasks. Prompting it as if it were a chat assistant will not work.
+---
+## Ethical Considerations & Societal Impact
+### Inherited Data Biases
+Stentor2-12M-Preview was trained on FineWeb-Edu, a filtered subset of Common Crawl. Despite quality filtering, this data inherits the biases present in English-language web text:
+- **Western-centric perspective** — Educational content on the web skews heavily toward Western, primarily American and European, viewpoints and examples.
+- **English monolingualism** — The training data and vocabulary are both English-only. The model has no meaningful capability in other languages.
+- **Demographic underrepresentation** — Groups that are underrepresented in English-language educational web content will be underrepresented in the model's outputs.
+- **Temporal cutoff** — FineWeb-Edu's data has a cutoff; the model has no knowledge of recent events.
+### No Safety Tuning
+This model has received **no safety training of any kind** — no RLHF, no DPO, no constitutional AI, no content filtering. It is a raw base model that predicts the next token based on statistical patterns. It should not be used in any context where harmful outputs would cause real-world harm.
+### Positive Societal Aspects
+- **Democratizing AI research** — Trained entirely on free-tier Kaggle compute, this model demonstrates that meaningful LLM research does not require significant financial resources. Students and independent researchers can reproduce, study, and build on this work.
+- **Transparency** — Full training hyperparameters, architecture details, and training script are published. This is a contribution to reproducible ML research.
+- **Minimal environmental footprint** — ~4.4 hours of single-GPU compute. Estimated carbon footprint under 0.5 kg CO₂e.
+### Responsible Use Reminder
+If you use this model in research, please document clearly that it is an unaligned base model and include appropriate caveats when reporting results. Do not present outputs from this model as factual without verification.
 ---
 ## Quantization
+> ⚠️ **Critical note for this preview:** `AutoModelForCausalLM.from_pretrained()` with `BitsAndBytesConfig` does **not** work for this checkpoint due to the `weight_master` key issue described in the [Known Loading Issue](#known-loading-issue--please-read) section. You must load with the custom loader first, then apply quantization afterward. The standard `from_pretrained()` + `BitsAndBytesConfig` pattern will work normally in the final Stentor2-12M release.
+Despite the model already being small (~49 MB in FP32, ~25 MB in FP16), quantization can further reduce memory for extremely constrained environments.
+### FP16 — Recommended First Step
+For GPU deployment, loading in FP16 halves memory to ~25 MB and is the simplest effective "quantization":
 ```python
+model, tokenizer = mod.load_stentor2(dtype=torch.float16)  # Option A
+model = model.to("cuda")
+```
+### Dynamic INT8 Quantization (CPU, PyTorch native — no extra install)
+For CPU deployment, PyTorch's built-in dynamic quantization works after loading with the custom loader and requires no additional packages:
+```python
+import torch
+from huggingface_hub import hf_hub_download
+import importlib.util, sys
+# Step 1: Load with custom loader
+path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
+spec = importlib.util.spec_from_file_location("load_stentor2", path)
+mod  = importlib.util.module_from_spec(spec)
+sys.modules["load_stentor2"] = mod
+spec.loader.exec_module(mod)
+model, tokenizer = mod.load_stentor2(dtype=torch.float32)
+model = model.to("cpu").eval()
+# Step 2: Apply dynamic INT8 quantization (CPU only)
+model_int8 = torch.quantization.quantize_dynamic(
+    model,
+    {torch.nn.Linear},
+    dtype=torch.qint8,
 )
+# Approximate memory: ~12 MB — 75% reduction from FP32
+# Note: dynamic quantization only affects inference; model stays on CPU
 ```
+### Manual 8-bit via bitsandbytes (GPU)
+For GPU deployment with bitsandbytes INT8, apply the conversion after loading:
 ```python
+import torch
+import bitsandbytes as bnb
+from huggingface_hub import hf_hub_download
+import importlib.util, sys
+# Step 1: Load with custom loader
+path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
+spec = importlib.util.spec_from_file_location("load_stentor2", path)
+mod  = importlib.util.module_from_spec(spec)
+sys.modules["load_stentor2"] = mod
+spec.loader.exec_module(mod)
+model, tokenizer = mod.load_stentor2(dtype=torch.float16)
+model = model.to("cuda").eval()
+# Step 2: Replace linear layers with INT8 equivalents
+def replace_with_bnb_int8(module):
+    for name, child in list(module.named_children()):
+        if isinstance(child, torch.nn.Linear):
+            new_layer = bnb.nn.Linear8bitLt(
+                child.in_features,
+                child.out_features,
+                bias=child.bias is not None,
+                has_fp16_weights=False,
+                threshold=6.0,
+            )
+            new_layer.weight = bnb.nn.Int8Params(
+                child.weight.data.cpu(),
+                requires_grad=False,
+            )
+            if child.bias is not None:
+                new_layer.bias = torch.nn.Parameter(child.bias.data)
+            setattr(module, name, new_layer)
+        else:
+            replace_with_bnb_int8(child)
+replace_with_bnb_int8(model)
+# Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)
 ```
 Requires: `pip install bitsandbytes`
+> **Practical note:** Given that FP16 is already only ~25 MB and the model runs at 47–71 t/s on CPU, aggressive quantization may not be necessary for most use cases. Dynamic INT8 is most useful when targeting microcontrollers or very constrained embedded environments.
 ---
 ## Format Conversion
 ## Citation
+If you use this model in research or a project, please cite it as follows. Note that this is a HuggingFace model card, not an arXiv paper, so there is no arXiv ID — the `howpublished` URL is the canonical reference.
 ```bibtex
 @misc{izumoto2026stentor2_12m_preview,
+  title        = {Stentor2-12M-Preview},
+  author       = {Kai Izumoto},
+  year         = {2026},
+  publisher    = {StentorLabs},
+  howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}},
+  note         = {Preview checkpoint of the Stentor2 model family.
+                  12.3M parameter LlamaForCausalLM base model trained on
+                  FineWeb-Edu with a TokenMonster 8K vocabulary.
+                  Apache 2.0 license.}
 }
 ```
 ---
+## Related Work
+This section compares Stentor2-12M-Preview to other publicly available models in the sub-50M parameter range, and to relevant research that informed design decisions.
+### Comparable Sub-50M Models
+| Model | Parameters | Perplexity | Vocab | Training Data | Notes |
+|---|---|---|---|---|---|
+| **Stentor2-12M-Preview** (this model) | 12.3M | ~50.1 (FineWeb-Edu val) | 8,064 | FineWeb-Edu 240M tokens | Base model, TokenMonster vocab |
+| Stentor-12M (v1) | 12.0M | 89.01 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 200M | Baseline this model improves on |
+| Stentor-30M (v1) | 30.4M | 33.02 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 600M | Larger v1 model |
+| TinyStories-33M | ~33M | ~varies | ~50K | TinyStories (synthetic) | Eldan & Li, 2023 — focused on story generation |
+| TinyStories-1M | ~1M | very high | ~50K | TinyStories (synthetic) | Demonstrates 1M param story capability |
+| Pythia-14M | 14M | ~varies (Pile) | 50,254 | The Pile 300B tokens | EleutherAI; well-studied scaling baseline |
+| Pythia-70M | 70M | ~varies (Pile) | 50,254 | The Pile 300B tokens | Closest Pythia model above this size |
+| BabyLlama | 58M | ~varies | ~32K | TinyStories + Wikitext | BabyLM challenge submission |
+> **Comparison caveats:** Perplexity numbers are not directly comparable across models — different validation sets, vocabularies, and tokenizers all affect the number. The table is a rough orientation, not a rigorous benchmark. Stentor2's perplexity is measured on the FineWeb-Edu validation split using its own 8K TokenMonster tokenizer.
+**Key differentiators of Stentor2 vs. comparable models:**
+- **Vocabulary efficiency focus** — The deliberate reduction to 8K tokens to maximize non-embedding parameter budget is a distinguishing design choice not seen in most small models.
+- **T4-specific training recipe** — The INT8 QAT + FP32 critical layer + FP32 norm combination is a novel stability recipe specifically designed for consumer-grade GPU training.
+- **Educational data** — Unlike TinyStories models (trained on synthetic children's stories) or Pythia (trained on the general-domain Pile), Stentor2 is trained on quality-filtered educational web text.
+### Related Research Papers
+| Paper | Relevance |
+|---|---|
+| [TinyStories](https://arxiv.org/abs/2305.07759) — Eldan & Li, 2023 | Demonstrates meaningful language generation from 1M–33M parameter models; closest comparator in scale |
+| [Pythia](https://arxiv.org/abs/2304.01373) — Biderman et al., 2023 | Systematic study of small model scaling; Pythia-14M is a well-documented baseline |
+| [Scaling Laws](https://arxiv.org/abs/2001.08361) — Kaplan et al., 2020 | Foundational work on compute-optimal training; informs token budget decisions |
+| [Chinchilla](https://arxiv.org/abs/2203.15556) — Hoffmann et al., 2022 | Revised scaling laws; 240M tokens for 12M params is approximately compute-optimal under this analysis |
+| [Model Cards](https://arxiv.org/abs/1810.03993) — Mitchell et al., 2018 | Methodology underlying this model card |
+| [RoPE](https://arxiv.org/abs/2104.09864) — Su et al., 2021 | Positional encoding used in this model |
+| [Speculative Decoding](https://arxiv.org/abs/2211.17192) — Leviathan et al., 2023 | Primary use case for a fast draft model like Stentor2 |
+| [T5](https://arxiv.org/abs/1910.10683) — Raffel et al., 2020 | Source of NFKC text normalization approach used in data pipeline |
+---
 ## Related Resources
 ### StentorLabs Models
+- [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) — Larger v1 base model
+- [Stentor-12M](https://huggingface.co/StentorLabs/Stentor-12M) — v1 baseline this model improves upon
+- [Stentor-30M-Instruct](https://huggingface.co/StentorLabs/Stentor-30M-Instruct) — Instruction-tuned v1 model
+- [Stentor-12M-Instruct](https://huggingface.co/StentorLabs/Stentor-12M-Instruct) — Instruction-tuned v1 model
 - [StentorLabs Collection](https://huggingface.co/StentorLabs) — All models from StentorLabs
 ### Referenced Tools & Datasets
 - [TokenMonster](https://huggingface.co/alasdairforsythe/tokenmonster) — Tokenizer vocabulary
 - [HuggingFace Accelerate](https://github.com/huggingface/accelerate) — Training framework
 - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) — Quantization library
+- [mradermacher GGUF quantizations of Stentor-30M](https://huggingface.co/mradermacher/Stentor-30M-GGUF) — Community quantizations of v1
 ---