StentorLabs commited on
Commit
fc326d1
·
verified ·
1 Parent(s): a8786b8

Upload model_card.md

Browse files
Files changed (1) hide show
  1. model_card.md +590 -0
model_card.md ADDED
@@ -0,0 +1,590 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation
8
+ - llama
9
+ - small-language-model
10
+ - efficient
11
+ - edge-deployment
12
+ - tiny-model
13
+ - 30m-parameters
14
+ - safety-tuning
15
+ - instruction-following
16
+ - chat
17
+ - lora
18
+ - peft
19
+ - beavertails
20
+ - dolly
21
+ base_model: StentorLabs/Stentor-30M
22
+ pipeline_tag: text-generation
23
+ datasets:
24
+ - PKU-Alignment/BeaverTails
25
+ - AmazonScience/FalseReject
26
+ - databricks/databricks-dolly-15k
27
+ widget:
28
+ - text: "How do I safely store household cleaning chemicals?"
29
+ example_title: "Safety Q&A"
30
+ - text: "How do I kill a process in Linux?"
31
+ example_title: "Technical Q&A"
32
+ - text: "What is machine learning in simple terms?"
33
+ example_title: "Explanation"
34
+ model_card_authors:
35
+ - StentorLabs
36
+ model-index:
37
+ - name: Stentor-30M-Instruct
38
+ results:
39
+ - task:
40
+ type: text-generation
41
+ dataset:
42
+ name: Mixed eval split (BeaverTails, FalseReject, Dolly, Seed Safety)
43
+ type: mixed
44
+ metrics:
45
+ - name: Eval Loss (overall, best checkpoint)
46
+ type: loss
47
+ value: 3.176
48
+ - name: Eval Loss — BeaverTails subset
49
+ type: loss
50
+ value: 2.135
51
+ - name: Eval Loss — FalseReject subset
52
+ type: loss
53
+ value: 3.322
54
+ - name: Eval Loss — Dolly subset
55
+ type: loss
56
+ value: 3.488
57
+ - name: Eval Loss — Seed Safety subset
58
+ type: loss
59
+ value: 3.086
60
+ - name: Post-Train Harmful Refusal Rate (greedy)
61
+ type: accuracy
62
+ value: 0.167
63
+ - name: Post-Train Benign Helpful Rate (greedy)
64
+ type: accuracy
65
+ value: 0.824
66
+ ---
67
+
68
+ # Stentor-30M-Instruct
69
+
70
+ ![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)
71
+ ![Model Size](https://img.shields.io/badge/parameters-30M-green.svg)
72
+ ![Base Model](https://img.shields.io/badge/base-Stentor--30M-orange.svg)
73
+ ![Fine-Tuning](https://img.shields.io/badge/method-LoRA%20SFT%20merged-purple.svg)
74
+ ![Hardware](https://img.shields.io/badge/hardware-2×%20Tesla%20T4-red.svg)
75
+ ![Context Length](https://img.shields.io/badge/context-512%20tokens-blue.svg)
76
+ [![Hugging Face](https://img.shields.io/badge/🤗-Base%20Model-yellow.svg)](https://huggingface.co/StentorLabs/Stentor-30M)
77
+
78
+ **Stentor-30M-Instruct** is a supervised fine-tune of [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) targeting chat-format instruction following and basic safety behavior. The base model is a strong next-token predictor but has no instruction following, no chat formatting, and no safety behavior whatsoever. This fine-tune meaningfully improves all three areas through a structured five-phase supervised curriculum — though how far those improvements go is fundamentally bounded by the 30M parameter budget. Think of it as the base model made useful for simple chat interactions, not a capable general-purpose assistant.
79
+
80
+ LoRA adapters (r=32, α=32) were trained on 2× Tesla T4s and then merged back into the base weights, so the checkpoint loads and runs exactly like a standard Hugging Face causal LM — no PEFT dependency at inference time.
81
+
82
+ > ⚠️ **Important Limitations**
83
+ >
84
+ > - **Still a 30M model.** Knowledge depth, reasoning ability, and generalization are all bounded by the tiny parameter count. This is a research / edge-deployment checkpoint, not a production assistant.
85
+ > - **Modest safety coverage.** Automated probe testing measured a **harmful-refusal rate of ~16.7%** and a **benign-helpful rate of ~82.4%** on a fixed 35-prompt evaluation suite. The low refusal rate is a fundamental capacity constraint at this scale, not a pipeline failure — the model reliably learned refusal *phrasing* but cannot semantically detect the full diversity of harmful requests.
86
+ > - **Short responses.** The stop-calibration phase encourages concise, sentence-level output. Typical generations are 10–30 tokens.
87
+ > - **512-token context window** (inherited from the base model).
88
+ > - **No RLHF.** Trained with supervised fine-tuning only.
89
+
90
+ ---
91
+
92
+ ## What This Model Learned
93
+
94
+ The fine-tune was structured as five sequential curriculum phases, each targeting a specific behavioral objective:
95
+
96
+ 1. **Refuse clearly on harmful requests** — A warmup phase on hand-crafted refusal examples anchors safe behavior before any general data is introduced, preventing the model from learning to answer harmful prompts first.
97
+
98
+ 2. **General assistant helpfulness, formatting, and instruction-following** — The main SFT phase on 18,000 mixed examples teaches the model to respond in a chat format, follow instructions, and produce useful answers for safe queries.
99
+
100
+ 3. **Stronger refusal consistency on harmful prompts** — A dedicated BeaverTails phase reinforces refusals on real-world harmful prompt patterns, reducing the regression that typically occurs after general-purpose training dilutes safety behavior.
101
+
102
+ 4. **Stable safety behavior after broader training** — A consolidation pass on seed safety examples re-anchors refusals so that the gains from phase 3 are not erased by later training stages.
103
+
104
+ 5. **Concise stopping and less rambling** — A stop-calibration phase on short Q&A pairs teaches the model to stop cleanly at the end of an answer rather than continuing to generate filler text.
105
+
106
+ ---
107
+
108
+ ## 🚀 Quick Start
109
+
110
+ ### Install
111
+ ```bash
112
+ pip install transformers torch
113
+ ```
114
+
115
+ ### Load & Chat
116
+ ```python
117
+ from transformers import AutoModelForCausalLM, AutoTokenizer
118
+
119
+ model_id = "StentorLabs/Stentor-30M-Instruct"
120
+
121
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
122
+ model = AutoModelForCausalLM.from_pretrained(model_id)
123
+
124
+ messages = [
125
+ {"role": "system", "content": "You are a helpful assistant."},
126
+ {"role": "user", "content": "How do I safely store household cleaning chemicals?"},
127
+ ]
128
+
129
+ inputs = tokenizer.apply_chat_template(
130
+ messages,
131
+ tokenize=True,
132
+ add_generation_prompt=True,
133
+ return_tensors="pt",
134
+ )
135
+ outputs = model.generate(
136
+ inputs,
137
+ max_new_tokens=80,
138
+ do_sample=True,
139
+ temperature=1.1,
140
+ top_p=0.6,
141
+ repetition_penalty=1.3,
142
+ )
143
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
144
+ ```
145
+
146
+ ### Recommended Generation Settings
147
+
148
+ | Parameter | Value |
149
+ |---|---|
150
+ | `max_new_tokens` | 40–100 |
151
+ | `temperature` | 1.0–1.2 |
152
+ | `top_p` | 0.5–0.7 |
153
+ | `repetition_penalty` | 1.2–1.4 |
154
+
155
+ ---
156
+
157
+ ## Stentor-30M vs Stentor-30M-Instruct — Comparative Statistics
158
+
159
+ ### At a Glance
160
+
161
+ | | Stentor-30M | Stentor-30M-Instruct |
162
+ |---|---|---|
163
+ | **Type** | Base next-token predictor | Instruction + safety fine-tune |
164
+ | **Parameters** | ~30.4M | ~30.4M (unchanged) |
165
+ | **Architecture** | LlamaForCausalLM | LlamaForCausalLM (identical) |
166
+ | **Context window** | 512 tokens | 512 tokens |
167
+ | **Training hardware** | 1× Tesla T4 | 2× Tesla T4 |
168
+ | **Training time** | 7.88 hours | ~1 hour (fine-tune only) |
169
+ | **Instruction-following** | ✗ None | ✓ Basic chat format |
170
+ | **Safety refusals** | ✗ None | ✓ ~17% harmful refusal rate |
171
+ | **Stops cleanly** | ✗ Rare | ✓ Stop-calibrated |
172
+ | **Helpful on benign queries** | ~ Inconsistent | ✓ ~82% of test prompts |
173
+
174
+ ### Loss & Perplexity
175
+
176
+ | Metric | Stentor-30M | Stentor-30M-Instruct | Change |
177
+ |---|---|---|---|
178
+ | Best eval loss | 3.4971 | 3.176 (SFT domain) | −0.321 |
179
+ | Perplexity (PPL) | 33.02 | 23.9 (SFT domain) | −9.1 PPL |
180
+ | Initial train loss | 9.4245 | 4.517 | — |
181
+ | Final train loss | 3.2368 | 3.224 | — |
182
+
183
+ > **Note:** The eval losses are not directly comparable — Stentor-30M was evaluated on held-out FineWeb-Edu/Cosmopedia data, while Stentor-30M-Instruct was evaluated on its SFT data mix (BeaverTails, FalseReject, Dolly). The lower PPL in the Instruct model reflects domain fit to fine-tuning data, not necessarily better general language modeling.
184
+
185
+ ### Training Scale
186
+
187
+ | | Stentor-30M | Stentor-30M-Instruct |
188
+ |---|---|---|
189
+ | **Tokens trained on** | 600,000,512 | ~3.5M (fine-tune) |
190
+ | **Training steps** | 4,578 | 273 (main SFT) |
191
+ | **Effective batch size** | 256 | 192 |
192
+ | **Optimizer** | AdamW fp16 | Paged AdamW fp32 |
193
+ | **Peak LR** | 8e-4 | 3e-5 |
194
+ | **Throughput** | ~21,137 tok/s | ~19.3 samples/s |
195
+ | **Platform** | Kaggle free (1× T4) | Kaggle free (2× T4) |
196
+
197
+ > Instruct throughput is in samples/sec rather than tokens/sec due to variable-length chat formatting.
198
+
199
+ ### Safety Behavior (Instruct only — base has none)
200
+
201
+ | Metric | Greedy | Sampled (T=0.7) |
202
+ |---|---|---|
203
+ | Harmful refusal rate | 16.7% | 16.7% |
204
+ | Benign helpful rate | 82.4% | 76.5% |
205
+ | Overall probe accuracy | 48.6% | 45.7% |
206
+ | Avg response tokens | 10.8 | 19.9 |
207
+
208
+ Use **Stentor-30M-Instruct** if you need basic chat interaction, some degree of safety-aware responses, or a fine-tuned baseline to compare curriculum approaches against. Use **Stentor-30M** if you need raw next-token generation, a pretraining baseline, or a starting point for your own fine-tune.
209
+
210
+
211
+
212
+ ---
213
+
214
+ ## Model Details
215
+
216
+ ### Architecture
217
+
218
+ All architectural parameters are identical to the base model (unchanged):
219
+
220
+ | Component | Value |
221
+ |---|---|
222
+ | Hidden Size | 256 |
223
+ | Intermediate Size | 1,024 |
224
+ | Hidden Layers | 21 |
225
+ | Attention Heads | 4 |
226
+ | KV Heads | 4 |
227
+ | Activation | SiLU |
228
+ | RoPE θ | 10,000 |
229
+ | Max Position Embeddings | 512 |
230
+ | Vocab Size | 32,768 |
231
+ | Total Parameters | ~30.4M |
232
+
233
+ ### LoRA Configuration
234
+
235
+ ```python
236
+ LoraConfig(
237
+ r=32,
238
+ lora_alpha=32,
239
+ use_rslora=True,
240
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
241
+ "gate_proj", "up_proj", "down_proj"],
242
+ lora_dropout=0.1,
243
+ bias="none",
244
+ task_type="CAUSAL_LM",
245
+ )
246
+ # Trainable params: 3,956,736 / 34,376,448 total = 11.51%
247
+ ```
248
+
249
+ ---
250
+
251
+ ## Training Details
252
+
253
+ ### Training Data
254
+
255
+ Stentor-30M-Instruct's knowledge comes from two distinct stages of training:
256
+
257
+ **Pretraining data (inherited from Stentor-30M — not retrained here)**
258
+
259
+ | Dataset | Description |
260
+ |---|---|
261
+ | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | Web text filtered for educational quality |
262
+ | [Cosmopedia v2](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | Synthetic textbooks and stories |
263
+
264
+ Total tokens seen during pretraining: **600,000,512**. This is the source of all factual knowledge and language modeling ability in the checkpoint. The fine-tuning stages below did not add new world knowledge — they only changed *how* the model responds.
265
+
266
+ **Fine-tuning data (this checkpoint)**
267
+
268
+ | Dataset | Role |
269
+ |---|---|
270
+ | [PKU-Alignment/BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) | Harmful prompt → refusal pairs + safe helpful responses |
271
+ | [AmazonScience/FalseReject](https://huggingface.co/datasets/AmazonScience/FalseReject) | Benign prompts that look risky — prevents over-refusal |
272
+ | [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | General instruction following and helpfulness |
273
+ | Seed Safety (hand-crafted) | Golden refusal examples for curriculum anchoring |
274
+
275
+ ---
276
+
277
+ ### Five-Phase Curriculum
278
+
279
+ | Phase | Dataset | Examples | Epochs | LR |
280
+ |---|---|---|---|---|
281
+ | 1 · Safety Warmup | Seed safety examples | 100 | 2 | 3e-5 |
282
+ | 2 · **Main SFT** | Mixed (see table below) | **17,460** | **3** | **3e-5 cosine** |
283
+ | 3 · BeaverTails Safety | BeaverTails harmful refusals | 300 | 2 | 5e-5 |
284
+ | 4 · Safety Consolidation | Seed safety examples | 100 | 2 | 5e-5 |
285
+ | 5 · Stop Calibration | Concise Q&A pairs | 512 | 1 | 3e-5 |
286
+
287
+ ### Main SFT Data Mix (18,000 examples after cap)
288
+
289
+ | Source | Count | Share | Role |
290
+ |---|---|---|---|
291
+ | FalseReject | 7,125 | 39.6% | Benign prompts that look risky — prevents over-refusal |
292
+ | BeaverTails | 5,708 | 31.7% | Harmful → refusal pairs + benign helpful responses |
293
+ | Dolly-15k | 5,153 | 28.6% | General instruction following and helpfulness |
294
+ | Seed Safety | 14 | 0.1% | Hand-crafted golden refusal examples |
295
+
296
+ All examples were prepended with a safety system prompt before tokenization.
297
+
298
+ ### Main SFT Hyperparameters
299
+
300
+ | Hyperparameter | Value |
301
+ |---|---|
302
+ | Epochs | 3 |
303
+ | Effective Batch Size | 192 (batch 48 × grad accum 4) |
304
+ | Max Sequence Length | 384 tokens |
305
+ | Learning Rate | 3e-5 |
306
+ | LR Scheduler | Cosine with 1 restart |
307
+ | Warmup Ratio | 0.06 |
308
+ | Weight Decay | 0.1 |
309
+ | Optimizer | Paged AdamW 32-bit |
310
+ | Adam ε | 1e-6 |
311
+ | Max Grad Norm | 1.0 |
312
+ | EMA Decay | 0.999 |
313
+ | Precision | fp32 (T4/Turing — bf16/fp16 AMP not used for main phase) |
314
+
315
+ ### Compute
316
+
317
+ | Item | Value |
318
+ |---|---|
319
+ | Hardware | 2× NVIDIA Tesla T4 (16 GB each) |
320
+ | Platform | Kaggle Notebooks (free tier) |
321
+ | Main SFT training time | ~45 min (2,721 s) |
322
+ | Total fine-tune time (all phases) | ~1 hour |
323
+ | Training samples / sec (main phase) | ~19.3 |
324
+
325
+ ---
326
+
327
+ ## Evaluation
328
+
329
+ ### Training Curves
330
+
331
+ ![Training Loss](training_loss.png)
332
+ ![Training Perplexity](training_perplexity.png)
333
+
334
+ ### Eval Loss at Checkpoints (Main SFT Phase)
335
+
336
+ | Step | Approx. Epoch | Eval Loss | Eval PPL |
337
+ |---|---|---|---|
338
+ | 40 | 0.44 | 3.711 | 40.9 |
339
+ | 80 | 0.88 | 3.397 | 29.9 |
340
+ | 120 | 1.32 | 3.272 | 26.4 |
341
+ | 160 | 1.76 | 3.213 | 24.8 |
342
+ | 200 | 2.20 | 3.186 | 24.2 |
343
+ | **240** | **2.64** | **3.176** | **23.9** |
344
+
345
+ ### Per-Source Eval Loss at End of Epoch 3
346
+
347
+ | Source | Eval Loss | Notes |
348
+ |---|---|---|
349
+ | BeaverTails | **2.135** | Model converges strongly on short refusal templates |
350
+ | Seed Safety | 3.086 | Hand-crafted refusals; good fit |
351
+ | FalseReject | 3.322 | Benign-but-edgy prompts; stable throughout training |
352
+ | Dolly | 3.488 | General instruction following; modest increase vs. early training |
353
+
354
+ The low BeaverTails eval loss confirms the model learned refusal phrasing effectively. The primary bottleneck for generalizing that to novel harmful prompts is the 30M parameter budget.
355
+
356
+ ### Safety Probe Results (Post-Training, 35-prompt suite)
357
+
358
+ | Metric | Greedy | Sampled (T=0.7) |
359
+ |---|---|---|
360
+ | Overall Accuracy | 48.6% | 45.7% |
361
+ | **Harmful Refusal Rate** | **16.7%** | **16.7%** |
362
+ | **Benign Helpful Rate** | **82.4%** | **76.5%** |
363
+ | Avg Response Tokens | 10.8 | 19.9 |
364
+
365
+ > The model reliably avoids over-refusing safe queries (~82% helpful on benign prompts) but its harmful-refusal rate (~17%) reflects the limits of what a 30M-parameter SFT model can generalize. It is a useful research baseline for studying safety curricula at small scale, not a deployable content filter.
366
+
367
+ ---
368
+
369
+ ## Uses
370
+
371
+ ### Recommended
372
+
373
+ - Research baseline for safety SFT curriculum design on sub-100M models
374
+ - Speculative decoding draft model for larger safety-tuned Llama variants
375
+ - Edge-device or CPU-constrained chatbot prototype
376
+ - Educational demonstrations of chat fine-tuning and LoRA merging workflows
377
+
378
+ ### Out-of-Scope
379
+
380
+ - **Production safety filtering** — harmful refusal rate is too low for this use
381
+ - **Complex reasoning or long-form generation** — 512-token context, 30M params
382
+ - **Tool use or structured JSON output** — not trained for this
383
+ - **Multilingual use** — English only
384
+
385
+ ---
386
+
387
+ ## Bias, Risks, and Limitations
388
+
389
+ - **Weak safety generalization.** The model learned short refusal templates rather than deep semantic harm detection. Paraphrased or novel harmful prompts frequently bypass refusals.
390
+ - **Terse outputs.** Stop-calibration training produces short responses by design. Increase `max_new_tokens` if you need longer answers.
391
+ - **All base model limitations apply.** 512-token context, limited world knowledge, occasional hallucination — see the [Stentor-30M model card](https://huggingface.co/StentorLabs/Stentor-30M) for full details.
392
+ - **No RLHF.** SFT only — no preference-based alignment was applied.
393
+ - **Dataset biases.** BeaverTails and Dolly carry their respective dataset biases into the fine-tune.
394
+
395
+ ---
396
+
397
+ ## Using the Training System Prompt
398
+
399
+ The model was trained with the following system prompt prepended to every example. Including it at inference time will produce the most consistent behavior:
400
+
401
+ ```python
402
+ SAFETY_SYSTEM = (
403
+ "You are a helpful assistant. Follow these rules:\n"
404
+ "1) Never provide instructions that facilitate self-harm, suicide, "
405
+ "explicit sexual content, or harassment, hate, or bullying.\n"
406
+ "2) For self-harm intent, respond with empathy, encourage immediate support, "
407
+ "and suggest local emergency services. If the user is in the US, mention 988.\n"
408
+ "3) Assume positive intent unless explicit red flags appear.\n"
409
+ "4) When refusing, briefly acknowledge the user's underlying need if it can be "
410
+ "addressed safely, then redirect.\n"
411
+ "5) For benign educational requests, answer clearly and avoid over-refusal."
412
+ )
413
+
414
+ messages = [
415
+ {"role": "system", "content": SAFETY_SYSTEM},
416
+ {"role": "user", "content": "Your question here."},
417
+ ]
418
+ ```
419
+
420
+ ---
421
+
422
+ ## Running in Other Formats
423
+
424
+ Because the LoRA adapters have been merged back into the weights, Stentor-30M-Instruct is a standard Hugging Face causal LM and can be converted to any format that accepts base Llama checkpoints.
425
+
426
+ ### 8-bit Quantization (bitsandbytes)
427
+
428
+ ```python
429
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
430
+
431
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
432
+ model = AutoModelForCausalLM.from_pretrained(
433
+ "StentorLabs/Stentor-30M-Instruct",
434
+ quantization_config=quantization_config,
435
+ device_map="auto"
436
+ )
437
+ # Memory: ~30 MB (~50% reduction from fp16 weights)
438
+ ```
439
+
440
+ ### 4-bit Quantization (bitsandbytes)
441
+
442
+ ```python
443
+ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
444
+ model = AutoModelForCausalLM.from_pretrained(
445
+ "StentorLabs/Stentor-30M-Instruct",
446
+ quantization_config=quantization_config,
447
+ device_map="auto"
448
+ )
449
+ # Memory: ~15 MB (~75% reduction from fp16 weights)
450
+ ```
451
+
452
+ **Note:** Requires `bitsandbytes`: `pip install bitsandbytes`
453
+
454
+ ### Convert to GGUF (llama.cpp / LM Studio / Ollama)
455
+
456
+ ```bash
457
+ # Clone llama.cpp
458
+ git clone https://github.com/ggerganov/llama.cpp
459
+ cd llama.cpp
460
+ pip install -r requirements.txt
461
+
462
+ # Download model
463
+ huggingface-cli download StentorLabs/Stentor-30M-Instruct --local-dir stentor-30m-instruct
464
+
465
+ # Convert to GGUF
466
+ python convert_hf_to_gguf.py stentor-30m-instruct/ \
467
+ --outfile stentor-30m-instruct.gguf \
468
+ --outtype f16
469
+
470
+ # Quantize (optional — Q4_K_M is a good size/quality balance)
471
+ ./llama-quantize stentor-30m-instruct.gguf stentor-30m-instruct-q4_k_m.gguf q4_k_m
472
+
473
+ # Run
474
+ ./llama-cli -m stentor-30m-instruct-q4_k_m.gguf -p "Hello, how can I help you?" -n 80
475
+ ```
476
+
477
+ ### Convert to ONNX (cross-platform / web)
478
+
479
+ ```bash
480
+ pip install optimum[exporters]
481
+
482
+ optimum-cli export onnx \
483
+ --model StentorLabs/Stentor-30M-Instruct \
484
+ --task text-generation-with-past \
485
+ stentor-30m-instruct-onnx/
486
+ ```
487
+
488
+ ```python
489
+ from optimum.onnxruntime import ORTModelForCausalLM
490
+ from transformers import AutoTokenizer
491
+
492
+ model = ORTModelForCausalLM.from_pretrained("stentor-30m-instruct-onnx")
493
+ tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor-30M-Instruct")
494
+
495
+ inputs = tokenizer("How do I sort a list in Python?", return_tensors="pt")
496
+ outputs = model.generate(**inputs, max_new_tokens=60)
497
+ print(tokenizer.decode(outputs[0]))
498
+ ```
499
+
500
+ ### Convert to TensorFlow Lite (Android / iOS)
501
+
502
+ ```bash
503
+ # Install dependencies
504
+ pip install tensorflow tf2onnx
505
+
506
+ # First export to ONNX (see above), then:
507
+ python -m tf2onnx.convert \
508
+ --onnx stentor-30m-instruct-onnx/model.onnx \
509
+ --output stentor-30m-instruct.tflite \
510
+ --opset 13
511
+ ```
512
+
513
+ ### Speculative Decoding with a Larger Target Model
514
+
515
+ ```python
516
+ from transformers import AutoModelForCausalLM, AutoTokenizer
517
+
518
+ draft_model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor-30M-Instruct")
519
+ target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
520
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
521
+
522
+ inputs = tokenizer("Explain machine learning briefly.", return_tensors="pt")
523
+ outputs = target_model.generate(
524
+ **inputs,
525
+ assistant_model=draft_model,
526
+ do_sample=True,
527
+ max_new_tokens=100,
528
+ )
529
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
530
+ ```
531
+
532
+ **Format summary:**
533
+
534
+ | Format | Best for |
535
+ |---|---|
536
+ | HuggingFace (default) | Python inference, fine-tuning |
537
+ | GGUF | llama.cpp, LM Studio, Ollama — DIY conversion above |
538
+ | ONNX | Cross-platform (Windows / Linux / Mac / Web) |
539
+ | TFLite | Android / iOS mobile apps |
540
+ | 8-bit / 4-bit | Low-VRAM GPU inference |
541
+
542
+ ---
543
+
544
+ ## Environmental Impact
545
+
546
+ | Item | Value |
547
+ |---|---|
548
+ | Hardware | 2× NVIDIA Tesla T4 |
549
+ | Platform | Kaggle (free tier) |
550
+ | Compute region | US West |
551
+ | Total fine-tune time (all phases) | ~1 hour |
552
+ | Estimated CO₂e | ~5 gCO₂e |
553
+
554
+ ---
555
+
556
+ ## Citation
557
+
558
+ ```bibtex
559
+ @misc{izumoto2026stentor30m-instruct,
560
+ title={Stentor-30M-Instruct: Instruction-Tuned and Safety-Aligned Fine-Tune of Stentor-30M},
561
+ author={Kai Izumoto},
562
+ year={2026},
563
+ publisher={StentorLabs},
564
+ howpublished={\url{https://huggingface.co/StentorLabs/Stentor-30M-Instruct}}
565
+ }
566
+ ```
567
+
568
+ ---
569
+
570
+ ## Acknowledgments
571
+
572
+ - [StentorLabs/Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) — base model
573
+ - [PKU-Alignment/BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) — safety training data
574
+ - [AmazonScience/FalseReject](https://huggingface.co/datasets/AmazonScience/FalseReject) — over-refusal mitigation data
575
+ - [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) — general instruction following data
576
+ - Hugging Face TRL, PEFT, and Transformers libraries
577
+ - Kaggle for free GPU compute
578
+
579
+ ---
580
+
581
+ ## Contact
582
+
583
+ Questions or feedback: [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open a discussion on the model page.
584
+
585
+ ---
586
+
587
+ <p align="center">
588
+ Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a><br>
589
+ <i>Democratizing AI through accessible, efficient models</i>
590
+ </p>