StentorLabs commited on
Commit
dd5d5a5
Β·
verified Β·
1 Parent(s): 77443e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -75
README.md CHANGED
@@ -32,8 +32,8 @@ widget:
32
  example_title: "Toy Explanation (Often Wrong)"
33
  - text: "def fibonacci(n):"
34
  example_title: "Code Continuation"
35
- - text: "[INST]What is machine learning?[/INST]"
36
- example_title: "Instruction-Style Prompt (Not Tuned)"
37
  model_card_authors:
38
  - StentorLabs
39
  model-index:
@@ -68,6 +68,9 @@ model-index:
68
  ![Context Length](https://img.shields.io/badge/context-1024%20tokens-purple.svg)
69
  ![Vocab Size](https://img.shields.io/badge/vocab-8064%20tokens-blue.svg)
70
  [![Hugging Face](https://img.shields.io/badge/πŸ€—-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs)
 
 
 
71
 
72
  > ⚠️ **This is a preview release.** Stentor2-12M-Preview is an early taste of the Stentor2 family β€” a substantially redesigned architecture over Stentor v1. Further improvements have already been identified and a refined final release is actively in progress. This checkpoint is **not** the ceiling of what Stentor2 will be.
73
  >
@@ -94,16 +97,19 @@ model-index:
94
  13. [Weight Initialization](#weight-initialization)
95
  14. [Evaluation & Results](#evaluation--results)
96
  15. [Training Dynamics](#training-dynamics)
97
- 16. [Use Cases](#use-cases)
98
- 17. [Inference Guide](#inference-guide)
99
- 18. [Real Model Responses](#real-model-responses)
100
- 19. [Quantization](#quantization)
101
- 20. [Format Conversion](#format-conversion)
102
- 21. [Speculative Decoding](#speculative-decoding)
103
- 22. [Bias, Risks & Limitations](#bias-risks--limitations)
104
- 23. [What's Next](#whats-next)
105
- 24. [Environmental Impact](#environmental-impact)
106
- 25. [Citation](#citation)
 
 
 
107
 
108
  ---
109
 
@@ -751,7 +757,11 @@ def clean_text(text: str) -> str:
751
  return text
752
  ```
753
 
754
- This normalizes Unicode (e.g., ligatures, half-width characters), removes blank lines, and collapses all whitespace to single spaces.
 
 
 
 
755
 
756
  ### Tokenization
757
 
@@ -820,6 +830,14 @@ This initialization is applied **before** the T4 recipe is applied. The T4 recip
820
 
821
  ## Evaluation & Results
822
 
 
 
 
 
 
 
 
 
823
  ### Metrics
824
 
825
  - **Validation Loss:** Cross-entropy loss over the held-out validation split (lower = better)
@@ -862,37 +880,67 @@ The training run proceeded for a single epoch over 14,649 optimizer steps, consu
862
 
863
  ---
864
 
865
- ## Use Cases
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
866
 
867
- ### Suitable Uses
 
 
 
868
 
869
- **Research & Education**
870
- - Studying transformer training dynamics at accessible compute cost
871
- - Investigating attention head behavior at scale (4 heads, 12 layers)
872
- - Tokenization efficiency experiments (comparing 8K vs 32K vocab at fixed params)
873
- - Testing training pipeline components on a real-but-cheap model
874
- - Teaching material for LLM courses
875
 
876
- **Edge Deployment Prototyping**
877
- - Benchmarking inference latency on CPU / mobile
878
- - Testing ONNX, TFLite, GGUF conversion pipelines
879
- - Validating quantization toolchains before scaling up
880
 
881
- **Speculative Decoding**
882
- - Draft model for larger Llama-family models
883
- - Acceptance rate experiments under vocabulary mismatch conditions
884
 
885
- **Base for Fine-Tuning**
886
- - Starting point for domain-specific instruction tuning
887
- - LoRA / QLoRA experiments
888
 
889
- ### Not Suitable For
890
 
891
- - Production chatbots or user-facing conversational systems
892
- - Tasks requiring factual accuracy or reliable reasoning
893
- - Long-context documents (>1,024 tokens)
894
- - Non-English text
895
- - Any safety-critical application
896
 
897
  ---
898
 
@@ -999,43 +1047,97 @@ These are actual unedited outputs from the model. All examples use the custom lo
999
 
1000
  ## Quantization
1001
 
1002
- Despite the model already being small, quantization can further reduce memory footprint for extremely constrained environments.
1003
 
1004
- > ⚠️ **Note:** The code examples below use `AutoModelForCausalLM.from_pretrained()` with a `BitsAndBytesConfig`. Due to the `weight_master` key issue described in the [Known Loading Issue](#known-loading-issue--please-read) section, this may not load the weights correctly for this preview checkpoint. If quantization via `from_pretrained()` produces no output or errors, load the model using the custom loader first, then apply quantization manually afterward. This will be a non-issue in the final Stentor2-12M release.
1005
 
1006
- ### 8-bit Quantization (bitsandbytes)
 
 
1007
 
1008
  ```python
1009
- from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 
 
1010
 
1011
- quantization_config = BitsAndBytesConfig(load_in_8bit=True)
1012
- model = AutoModelForCausalLM.from_pretrained(
1013
- "StentorLabs/Stentor2-12M-Preview",
1014
- quantization_config=quantization_config,
1015
- device_map="auto"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1016
  )
1017
- # Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)
 
1018
  ```
1019
 
1020
- ### 4-bit Quantization (bitsandbytes)
 
 
1021
 
1022
  ```python
1023
- quantization_config = BitsAndBytesConfig(
1024
- load_in_4bit=True,
1025
- bnb_4bit_compute_dtype=torch.float16,
1026
- bnb_4bit_use_double_quant=True,
1027
- bnb_4bit_quant_type="nf4"
1028
- )
1029
- model = AutoModelForCausalLM.from_pretrained(
1030
- "StentorLabs/Stentor2-12M-Preview",
1031
- quantization_config=quantization_config,
1032
- device_map="auto"
1033
- )
1034
- # Approximate memory: ~6 MB (87% reduction from FP32)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1035
  ```
1036
 
1037
  Requires: `pip install bitsandbytes`
1038
 
 
 
1039
  ---
1040
 
1041
  ## Format Conversion
@@ -1193,22 +1295,70 @@ Training on free-tier cloud compute demonstrates that meaningful SLM research is
1193
 
1194
  ## Citation
1195
 
 
 
1196
  ```bibtex
1197
  @misc{izumoto2026stentor2_12m_preview,
1198
- title = {Stentor2-12M-Preview},
1199
- author = {Kai Izumoto},
1200
- year = {2026},
1201
- publisher = {StentorLabs},
1202
- howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}}
 
 
 
 
1203
  }
1204
  ```
1205
 
1206
  ---
1207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1208
  ## Related Resources
1209
 
1210
  ### StentorLabs Models
1211
- - [Stentor-12M](https://huggingface.co/StentorLabs/Stentor-12M) β€” The v1 baseline this model improves upon
 
 
 
1212
  - [StentorLabs Collection](https://huggingface.co/StentorLabs) β€” All models from StentorLabs
1213
 
1214
  ### Referenced Tools & Datasets
@@ -1216,15 +1366,7 @@ Training on free-tier cloud compute demonstrates that meaningful SLM research is
1216
  - [TokenMonster](https://huggingface.co/alasdairforsythe/tokenmonster) β€” Tokenizer vocabulary
1217
  - [HuggingFace Accelerate](https://github.com/huggingface/accelerate) β€” Training framework
1218
  - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) β€” Quantization library
1219
-
1220
- ### Related Models (Comparable Scale)
1221
- - [SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) β€” Larger, highly capable SLM
1222
- - [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) β€” Larger alternative
1223
-
1224
- ### Research Papers
1225
- - [Speculative Decoding](https://arxiv.org/abs/2211.17192) β€” Leviathan et al., 2023
1226
- - [Model Card Methodology](https://arxiv.org/abs/1810.03993) β€” Mitchell et al., 2018
1227
- - [RoPE Positional Embeddings](https://arxiv.org/abs/2104.09864) β€” Su et al., 2021
1228
 
1229
  ---
1230
 
 
32
  example_title: "Toy Explanation (Often Wrong)"
33
  - text: "def fibonacci(n):"
34
  example_title: "Code Continuation"
35
+ - text: "The laws of thermodynamics describe"
36
+ example_title: "Science Continuation"
37
  model_card_authors:
38
  - StentorLabs
39
  model-index:
 
68
  ![Context Length](https://img.shields.io/badge/context-1024%20tokens-purple.svg)
69
  ![Vocab Size](https://img.shields.io/badge/vocab-8064%20tokens-blue.svg)
70
  [![Hugging Face](https://img.shields.io/badge/πŸ€—-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs)
71
+ ![Status](https://img.shields.io/badge/status-Research%20Artifact%20Only-red.svg)
72
+
73
+ > πŸ”¬ **Research Artifact β€” Not a Production Model.** This is an early preview checkpoint released for research, experimentation, and community feedback. It is not suitable for deployment in any user-facing application. See [Intended Uses](#use-cases--intended-uses) for details.
74
 
75
  > ⚠️ **This is a preview release.** Stentor2-12M-Preview is an early taste of the Stentor2 family β€” a substantially redesigned architecture over Stentor v1. Further improvements have already been identified and a refined final release is actively in progress. This checkpoint is **not** the ceiling of what Stentor2 will be.
76
  >
 
97
  13. [Weight Initialization](#weight-initialization)
98
  14. [Evaluation & Results](#evaluation--results)
99
  15. [Training Dynamics](#training-dynamics)
100
+ 16. [Use Cases & Intended Uses](#use-cases--intended-uses)
101
+ 17. [Out-of-Scope Uses](#out-of-scope-uses)
102
+ 18. [Ethical Considerations & Societal Impact](#ethical-considerations--societal-impact)
103
+ 19. [Inference Guide](#inference-guide)
104
+ 20. [Real Model Responses](#real-model-responses)
105
+ 21. [Quantization](#quantization)
106
+ 22. [Format Conversion](#format-conversion)
107
+ 23. [Speculative Decoding](#speculative-decoding)
108
+ 24. [Bias, Risks & Limitations](#bias-risks--limitations)
109
+ 25. [Related Work](#related-work)
110
+ 26. [What's Next](#whats-next)
111
+ 27. [Environmental Impact](#environmental-impact)
112
+ 28. [Citation](#citation)
113
 
114
  ---
115
 
 
757
  return text
758
  ```
759
 
760
+ **Why these specific steps:**
761
+
762
+ - **NFKC normalization** maps visually equivalent Unicode characters to a single canonical form (e.g., full-width `οΌ‘` β†’ `A`, ligature `fi` β†’ `fi`, superscript `Β²` β†’ `2`). This is the standard choice for LLM preprocessing β€” used in T5 (Raffel et al., 2020, [arXiv:1910.10683](https://arxiv.org/abs/1910.10683)), BERT (Devlin et al., 2019, [arXiv:1810.04805](https://arxiv.org/abs/1810.04805)), and the Unicode standard itself (Unicode Technical Report #15). Without it, the model would see dozens of token IDs for what is semantically one character.
763
+
764
+ - **Whitespace collapse** (join lines, collapse spaces) ensures consistent tokenization of the same content regardless of how it was originally formatted. Web-scraped text commonly contains inconsistent line breaks, multiple spaces, and mixed newline styles. This is also standard practice in GPT-style pretraining pipelines. No ablation was performed on this step β€” it was adopted from established practice rather than experimentally derived.
765
 
766
  ### Tokenization
767
 
 
830
 
831
  ## Evaluation & Results
832
 
833
+ ### Training Curves
834
+
835
+ The charts below show validation loss and perplexity over the course of the training run. Both are plotted against optimizer steps. The best checkpoint (step 11,625) is visible as the lowest point before the slight uptick in the tail phase.
836
+
837
+ ![Validation loss over training steps](loss_chart.png)
838
+
839
+ ![Perplexity over training steps](perplexity_chart.png)
840
+
841
  ### Metrics
842
 
843
  - **Validation Loss:** Cross-entropy loss over the held-out validation split (lower = better)
 
880
 
881
  ---
882
 
883
+ ## Use Cases & Intended Uses
884
+
885
+ > πŸ”¬ **Reminder:** This is a **research artifact**. It is a base language model with no safety tuning, no instruction following, and no factual grounding. Every intended use below assumes a researcher or developer context, not an end user.
886
+
887
+ ### Intended Uses
888
+
889
+ | Use Case | Suitability | Notes |
890
+ |---|---|---|
891
+ | Studying transformer training dynamics | βœ… High | Small enough to train/fine-tune on free compute |
892
+ | Tokenization efficiency research | βœ… High | 8K vs 32K vocab tradeoff is directly observable |
893
+ | Speculative decoding experiments | βœ… High | Fast enough to serve as a draft model |
894
+ | Benchmarking CPU/edge inference latency | βœ… High | ~12MB in FP16, runs on any hardware |
895
+ | Testing quantization/conversion pipelines | βœ… High | GGUF, ONNX, INT8 pipeline validation |
896
+ | Teaching material for LLM courses | βœ… High | Architecture is simple enough to trace by hand |
897
+ | LoRA / QLoRA fine-tuning experiments | βœ… Moderate | Base model only; start from scratch for any task |
898
+ | Text continuation / creative prompting | βœ… Moderate | Works best on short completions ≀60 tokens |
899
+ | Domain-specific fine-tuning research | βœ… Moderate | Small enough to iterate rapidly |
900
+ | Factual Q&A | ❌ Not suitable | Model has no reliable world knowledge |
901
+ | Production deployment | ❌ Not suitable | No safety tuning; preview quality only |
902
+ | Non-English text | ❌ Not suitable | TokenMonster vocab is English-only |
903
+ | Long-document tasks (>512 tokens of coherent output) | ❌ Not suitable | Coherence degrades quickly |
904
+
905
+ ---
906
+
907
+ ## Out-of-Scope Uses
908
+
909
+ The following uses are explicitly out of scope and should not be attempted:
910
+
911
+ - **User-facing applications of any kind** β€” This model has no safety filtering, no alignment, and no factual reliability. Deploying it in a context where a real user receives its output without expert review is inappropriate regardless of the domain.
912
+ - **Medical, legal, or financial advice** β€” Even if prompted carefully, 12M parameters cannot store or reason over specialized knowledge reliably. All outputs should be treated as potentially wrong.
913
+ - **Generating content about real people** β€” The model has no awareness of who real people are or what they have said/done. Outputs mentioning real people are likely to be fabricated.
914
+ - **Automated content pipelines** β€” Do not use this model to generate content at scale without human review. The output quality and coherence are not sufficient for unreviewed publication.
915
+ - **Non-English use** β€” The 8,064-token TokenMonster vocabulary is built exclusively for English. Prompts in other languages will be tokenized very poorly and outputs will be unreliable.
916
+ - **Instruction following** β€” This is a base model. It does not reliably follow instructions, answer questions, or complete structured tasks. Prompting it as if it were a chat assistant will not work.
917
+
918
+ ---
919
+
920
+ ## Ethical Considerations & Societal Impact
921
+
922
+ ### Inherited Data Biases
923
+
924
+ Stentor2-12M-Preview was trained on FineWeb-Edu, a filtered subset of Common Crawl. Despite quality filtering, this data inherits the biases present in English-language web text:
925
 
926
+ - **Western-centric perspective** β€” Educational content on the web skews heavily toward Western, primarily American and European, viewpoints and examples.
927
+ - **English monolingualism** β€” The training data and vocabulary are both English-only. The model has no meaningful capability in other languages.
928
+ - **Demographic underrepresentation** β€” Groups that are underrepresented in English-language educational web content will be underrepresented in the model's outputs.
929
+ - **Temporal cutoff** β€” FineWeb-Edu's data has a cutoff; the model has no knowledge of recent events.
930
 
931
+ ### No Safety Tuning
 
 
 
 
 
932
 
933
+ This model has received **no safety training of any kind** β€” no RLHF, no DPO, no constitutional AI, no content filtering. It is a raw base model that predicts the next token based on statistical patterns. It should not be used in any context where harmful outputs would cause real-world harm.
 
 
 
934
 
935
+ ### Positive Societal Aspects
 
 
936
 
937
+ - **Democratizing AI research** β€” Trained entirely on free-tier Kaggle compute, this model demonstrates that meaningful LLM research does not require significant financial resources. Students and independent researchers can reproduce, study, and build on this work.
938
+ - **Transparency** β€” Full training hyperparameters, architecture details, and training script are published. This is a contribution to reproducible ML research.
939
+ - **Minimal environmental footprint** β€” ~4.4 hours of single-GPU compute. Estimated carbon footprint under 0.5 kg COβ‚‚e.
940
 
941
+ ### Responsible Use Reminder
942
 
943
+ If you use this model in research, please document clearly that it is an unaligned base model and include appropriate caveats when reporting results. Do not present outputs from this model as factual without verification.
 
 
 
 
944
 
945
  ---
946
 
 
1047
 
1048
  ## Quantization
1049
 
1050
+ > ⚠️ **Critical note for this preview:** `AutoModelForCausalLM.from_pretrained()` with `BitsAndBytesConfig` does **not** work for this checkpoint due to the `weight_master` key issue described in the [Known Loading Issue](#known-loading-issue--please-read) section. You must load with the custom loader first, then apply quantization afterward. The standard `from_pretrained()` + `BitsAndBytesConfig` pattern will work normally in the final Stentor2-12M release.
1051
 
1052
+ Despite the model already being small (~49 MB in FP32, ~25 MB in FP16), quantization can further reduce memory for extremely constrained environments.
1053
 
1054
+ ### FP16 β€” Recommended First Step
1055
+
1056
+ For GPU deployment, loading in FP16 halves memory to ~25 MB and is the simplest effective "quantization":
1057
 
1058
  ```python
1059
+ model, tokenizer = mod.load_stentor2(dtype=torch.float16) # Option A
1060
+ model = model.to("cuda")
1061
+ ```
1062
 
1063
+ ### Dynamic INT8 Quantization (CPU, PyTorch native β€” no extra install)
1064
+
1065
+ For CPU deployment, PyTorch's built-in dynamic quantization works after loading with the custom loader and requires no additional packages:
1066
+
1067
+ ```python
1068
+ import torch
1069
+ from huggingface_hub import hf_hub_download
1070
+ import importlib.util, sys
1071
+
1072
+ # Step 1: Load with custom loader
1073
+ path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
1074
+ spec = importlib.util.spec_from_file_location("load_stentor2", path)
1075
+ mod = importlib.util.module_from_spec(spec)
1076
+ sys.modules["load_stentor2"] = mod
1077
+ spec.loader.exec_module(mod)
1078
+
1079
+ model, tokenizer = mod.load_stentor2(dtype=torch.float32)
1080
+ model = model.to("cpu").eval()
1081
+
1082
+ # Step 2: Apply dynamic INT8 quantization (CPU only)
1083
+ model_int8 = torch.quantization.quantize_dynamic(
1084
+ model,
1085
+ {torch.nn.Linear},
1086
+ dtype=torch.qint8,
1087
  )
1088
+ # Approximate memory: ~12 MB β€” 75% reduction from FP32
1089
+ # Note: dynamic quantization only affects inference; model stays on CPU
1090
  ```
1091
 
1092
+ ### Manual 8-bit via bitsandbytes (GPU)
1093
+
1094
+ For GPU deployment with bitsandbytes INT8, apply the conversion after loading:
1095
 
1096
  ```python
1097
+ import torch
1098
+ import bitsandbytes as bnb
1099
+ from huggingface_hub import hf_hub_download
1100
+ import importlib.util, sys
1101
+
1102
+ # Step 1: Load with custom loader
1103
+ path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
1104
+ spec = importlib.util.spec_from_file_location("load_stentor2", path)
1105
+ mod = importlib.util.module_from_spec(spec)
1106
+ sys.modules["load_stentor2"] = mod
1107
+ spec.loader.exec_module(mod)
1108
+
1109
+ model, tokenizer = mod.load_stentor2(dtype=torch.float16)
1110
+ model = model.to("cuda").eval()
1111
+
1112
+ # Step 2: Replace linear layers with INT8 equivalents
1113
+ def replace_with_bnb_int8(module):
1114
+ for name, child in list(module.named_children()):
1115
+ if isinstance(child, torch.nn.Linear):
1116
+ new_layer = bnb.nn.Linear8bitLt(
1117
+ child.in_features,
1118
+ child.out_features,
1119
+ bias=child.bias is not None,
1120
+ has_fp16_weights=False,
1121
+ threshold=6.0,
1122
+ )
1123
+ new_layer.weight = bnb.nn.Int8Params(
1124
+ child.weight.data.cpu(),
1125
+ requires_grad=False,
1126
+ )
1127
+ if child.bias is not None:
1128
+ new_layer.bias = torch.nn.Parameter(child.bias.data)
1129
+ setattr(module, name, new_layer)
1130
+ else:
1131
+ replace_with_bnb_int8(child)
1132
+
1133
+ replace_with_bnb_int8(model)
1134
+ # Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)
1135
  ```
1136
 
1137
  Requires: `pip install bitsandbytes`
1138
 
1139
+ > **Practical note:** Given that FP16 is already only ~25 MB and the model runs at 47–71 t/s on CPU, aggressive quantization may not be necessary for most use cases. Dynamic INT8 is most useful when targeting microcontrollers or very constrained embedded environments.
1140
+
1141
  ---
1142
 
1143
  ## Format Conversion
 
1295
 
1296
  ## Citation
1297
 
1298
+ If you use this model in research or a project, please cite it as follows. Note that this is a HuggingFace model card, not an arXiv paper, so there is no arXiv ID β€” the `howpublished` URL is the canonical reference.
1299
+
1300
  ```bibtex
1301
  @misc{izumoto2026stentor2_12m_preview,
1302
+ title = {Stentor2-12M-Preview},
1303
+ author = {Kai Izumoto},
1304
+ year = {2026},
1305
+ publisher = {StentorLabs},
1306
+ howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}},
1307
+ note = {Preview checkpoint of the Stentor2 model family.
1308
+ 12.3M parameter LlamaForCausalLM base model trained on
1309
+ FineWeb-Edu with a TokenMonster 8K vocabulary.
1310
+ Apache 2.0 license.}
1311
  }
1312
  ```
1313
 
1314
  ---
1315
 
1316
+ ## Related Work
1317
+
1318
+ This section compares Stentor2-12M-Preview to other publicly available models in the sub-50M parameter range, and to relevant research that informed design decisions.
1319
+
1320
+ ### Comparable Sub-50M Models
1321
+
1322
+ | Model | Parameters | Perplexity | Vocab | Training Data | Notes |
1323
+ |---|---|---|---|---|---|
1324
+ | **Stentor2-12M-Preview** (this model) | 12.3M | ~50.1 (FineWeb-Edu val) | 8,064 | FineWeb-Edu 240M tokens | Base model, TokenMonster vocab |
1325
+ | Stentor-12M (v1) | 12.0M | 89.01 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 200M | Baseline this model improves on |
1326
+ | Stentor-30M (v1) | 30.4M | 33.02 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 600M | Larger v1 model |
1327
+ | TinyStories-33M | ~33M | ~varies | ~50K | TinyStories (synthetic) | Eldan & Li, 2023 β€” focused on story generation |
1328
+ | TinyStories-1M | ~1M | very high | ~50K | TinyStories (synthetic) | Demonstrates 1M param story capability |
1329
+ | Pythia-14M | 14M | ~varies (Pile) | 50,254 | The Pile 300B tokens | EleutherAI; well-studied scaling baseline |
1330
+ | Pythia-70M | 70M | ~varies (Pile) | 50,254 | The Pile 300B tokens | Closest Pythia model above this size |
1331
+ | BabyLlama | 58M | ~varies | ~32K | TinyStories + Wikitext | BabyLM challenge submission |
1332
+
1333
+ > **Comparison caveats:** Perplexity numbers are not directly comparable across models β€” different validation sets, vocabularies, and tokenizers all affect the number. The table is a rough orientation, not a rigorous benchmark. Stentor2's perplexity is measured on the FineWeb-Edu validation split using its own 8K TokenMonster tokenizer.
1334
+
1335
+ **Key differentiators of Stentor2 vs. comparable models:**
1336
+ - **Vocabulary efficiency focus** β€” The deliberate reduction to 8K tokens to maximize non-embedding parameter budget is a distinguishing design choice not seen in most small models.
1337
+ - **T4-specific training recipe** β€” The INT8 QAT + FP32 critical layer + FP32 norm combination is a novel stability recipe specifically designed for consumer-grade GPU training.
1338
+ - **Educational data** β€” Unlike TinyStories models (trained on synthetic children's stories) or Pythia (trained on the general-domain Pile), Stentor2 is trained on quality-filtered educational web text.
1339
+
1340
+ ### Related Research Papers
1341
+
1342
+ | Paper | Relevance |
1343
+ |---|---|
1344
+ | [TinyStories](https://arxiv.org/abs/2305.07759) β€” Eldan & Li, 2023 | Demonstrates meaningful language generation from 1M–33M parameter models; closest comparator in scale |
1345
+ | [Pythia](https://arxiv.org/abs/2304.01373) β€” Biderman et al., 2023 | Systematic study of small model scaling; Pythia-14M is a well-documented baseline |
1346
+ | [Scaling Laws](https://arxiv.org/abs/2001.08361) β€” Kaplan et al., 2020 | Foundational work on compute-optimal training; informs token budget decisions |
1347
+ | [Chinchilla](https://arxiv.org/abs/2203.15556) β€” Hoffmann et al., 2022 | Revised scaling laws; 240M tokens for 12M params is approximately compute-optimal under this analysis |
1348
+ | [Model Cards](https://arxiv.org/abs/1810.03993) β€” Mitchell et al., 2018 | Methodology underlying this model card |
1349
+ | [RoPE](https://arxiv.org/abs/2104.09864) β€” Su et al., 2021 | Positional encoding used in this model |
1350
+ | [Speculative Decoding](https://arxiv.org/abs/2211.17192) β€” Leviathan et al., 2023 | Primary use case for a fast draft model like Stentor2 |
1351
+ | [T5](https://arxiv.org/abs/1910.10683) β€” Raffel et al., 2020 | Source of NFKC text normalization approach used in data pipeline |
1352
+
1353
+ ---
1354
+
1355
  ## Related Resources
1356
 
1357
  ### StentorLabs Models
1358
+ - [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) β€” Larger v1 base model
1359
+ - [Stentor-12M](https://huggingface.co/StentorLabs/Stentor-12M) β€” v1 baseline this model improves upon
1360
+ - [Stentor-30M-Instruct](https://huggingface.co/StentorLabs/Stentor-30M-Instruct) β€” Instruction-tuned v1 model
1361
+ - [Stentor-12M-Instruct](https://huggingface.co/StentorLabs/Stentor-12M-Instruct) β€” Instruction-tuned v1 model
1362
  - [StentorLabs Collection](https://huggingface.co/StentorLabs) β€” All models from StentorLabs
1363
 
1364
  ### Referenced Tools & Datasets
 
1366
  - [TokenMonster](https://huggingface.co/alasdairforsythe/tokenmonster) β€” Tokenizer vocabulary
1367
  - [HuggingFace Accelerate](https://github.com/huggingface/accelerate) β€” Training framework
1368
  - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) β€” Quantization library
1369
+ - [mradermacher GGUF quantizations of Stentor-30M](https://huggingface.co/mradermacher/Stentor-30M-GGUF) β€” Community quantizations of v1
 
 
 
 
 
 
 
 
1370
 
1371
  ---
1372