Text Generation
Transformers
Safetensors
English
llama
small-language-model
efficient
edge-deployment
speculative-decoding
tiny-model
30m-parameters
kaggle-trained
educational
research
low-resource
cpu-inference
mobile-deployment
synthetic-data
fineweb
cosmopedia
conversational
Eval Results (legacy)
text-generation-inference
Update README.md
Browse files
README.md
CHANGED
|
@@ -61,6 +61,7 @@ model-index:
|
|
| 61 |

|
| 62 |

|
| 63 |
[](https://huggingface.co/StentorLabs/Stentor-30M)
|
|
|
|
| 64 |
|
| 65 |
Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
|
| 66 |
|
|
@@ -113,6 +114,8 @@ Text Generated:
|
|
| 113 |
Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
|
| 114 |
```
|
| 115 |
|
|
|
|
|
|
|
| 116 |
## 🚀 Quick Start
|
| 117 |
|
| 118 |
Get up and running in 3 simple steps:
|
|
@@ -148,6 +151,18 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 148 |
|
| 149 |
---
|
| 150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
## Model Details
|
| 152 |
|
| 153 |
### Model Description
|
|
@@ -271,7 +286,18 @@ outputs = target_model.generate(
|
|
| 271 |
print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 272 |
```
|
| 273 |
|
| 274 |
-
### 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
|
| 276 |
Convert to ONNX for mobile/edge deployment:
|
| 277 |
|
|
@@ -284,7 +310,9 @@ optimum-cli export onnx \
|
|
| 284 |
--model StentorLabs/Stentor-30M \
|
| 285 |
--task text-generation-with-past \
|
| 286 |
stentor-30m-onnx/
|
|
|
|
| 287 |
|
|
|
|
| 288 |
# Use with ONNX Runtime
|
| 289 |
from optimum.onnxruntime import ORTModelForCausalLM
|
| 290 |
from transformers import AutoTokenizer
|
|
@@ -297,7 +325,7 @@ outputs = model.generate(**inputs, max_new_tokens=20)
|
|
| 297 |
print(tokenizer.decode(outputs[0]))
|
| 298 |
```
|
| 299 |
|
| 300 |
-
###
|
| 301 |
|
| 302 |
Quick experimentation before scaling:
|
| 303 |
|
|
@@ -319,11 +347,11 @@ for prompt in test_prompts:
|
|
| 319 |
print(f"Prompt: {prompt}\nResult: {result}\n")
|
| 320 |
```
|
| 321 |
|
| 322 |
-
##
|
| 323 |
|
| 324 |
-
|
| 325 |
|
| 326 |
-
### 8-bit Quantization
|
| 327 |
|
| 328 |
```python
|
| 329 |
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
|
@@ -337,7 +365,7 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 337 |
# Memory: ~30 MB (~50% reduction from fp16 weights)
|
| 338 |
```
|
| 339 |
|
| 340 |
-
### 4-bit Quantization
|
| 341 |
|
| 342 |
```python
|
| 343 |
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
|
@@ -351,9 +379,7 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 351 |
|
| 352 |
**Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
|
| 353 |
|
| 354 |
-
##
|
| 355 |
-
|
| 356 |
-
### Convert to GGUF (for llama.cpp)
|
| 357 |
|
| 358 |
```bash
|
| 359 |
# Clone llama.cpp
|
|
@@ -373,27 +399,6 @@ python convert_hf_to_gguf.py stentor-30m/ \
|
|
| 373 |
|
| 374 |
# Quantize (optional)
|
| 375 |
./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
|
| 376 |
-
|
| 377 |
-
# Run with llama.cpp
|
| 378 |
-
./llama-cli -m stentor-30m-q4_0.gguf -p "Hello world" -n 50
|
| 379 |
-
```
|
| 380 |
-
|
| 381 |
-
### Convert to ONNX
|
| 382 |
-
|
| 383 |
-
```bash
|
| 384 |
-
# Install optimum
|
| 385 |
-
pip install optimum[exporters]
|
| 386 |
-
|
| 387 |
-
# Export to ONNX
|
| 388 |
-
optimum-cli export onnx \
|
| 389 |
-
--model StentorLabs/Stentor-30M \
|
| 390 |
-
--task text-generation-with-past \
|
| 391 |
-
stentor-30m-onnx/
|
| 392 |
-
|
| 393 |
-
# Use with ONNX Runtime (C++/Python/JS)
|
| 394 |
-
from optimum.onnxruntime import ORTModelForCausalLM
|
| 395 |
-
|
| 396 |
-
model = ORTModelForCausalLM.from_pretrained("stentor-30m-onnx")
|
| 397 |
```
|
| 398 |
|
| 399 |
### Convert to TensorFlow Lite (Mobile)
|
|
@@ -410,11 +415,13 @@ python -m tf2onnx.convert \
|
|
| 410 |
--opset 13
|
| 411 |
```
|
| 412 |
|
| 413 |
-
**
|
| 414 |
-
- **GGUF:** C++ applications,
|
| 415 |
- **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
|
| 416 |
- **TFLite:** Android/iOS mobile apps
|
| 417 |
|
|
|
|
|
|
|
| 418 |
## Training Details
|
| 419 |
|
| 420 |
### Training Data
|
|
@@ -472,6 +479,8 @@ The training pipeline utilized lightweight but effective preprocessing steps:
|
|
| 472 |
|
| 473 |
> **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
|
| 474 |
|
|
|
|
|
|
|
| 475 |
## Evaluation
|
| 476 |
|
| 477 |
### Testing Data, Factors & Metrics
|
|
@@ -506,6 +515,8 @@ The model showed steady improvement throughout training:
|
|
| 506 |
|
| 507 |
> **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
|
| 508 |
|
|
|
|
|
|
|
| 509 |
## Technical Specifications
|
| 510 |
|
| 511 |
### Model Architecture and Objective
|
|
@@ -549,6 +560,8 @@ The model was trained using standard cloud infrastructure available to researche
|
|
| 549 |
- **Torch Compile:** False (disabled for notebook stability)
|
| 550 |
- **Accelerate:** Enabled for training
|
| 551 |
|
|
|
|
|
|
|
| 552 |
## Environmental Impact
|
| 553 |
|
| 554 |
- **Hardware Type:** 1x NVIDIA Tesla T4
|
|
@@ -559,12 +572,17 @@ The model was trained using standard cloud infrastructure available to researche
|
|
| 559 |
|
| 560 |
Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
|
| 561 |
|
|
|
|
|
|
|
| 562 |
## Related Resources
|
| 563 |
|
| 564 |
### Official Resources
|
| 565 |
- 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
|
| 566 |
- 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
|
| 567 |
|
|
|
|
|
|
|
|
|
|
| 568 |
### Related Models
|
| 569 |
- [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
|
| 570 |
- [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
|
|
@@ -574,6 +592,8 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
|
|
| 574 |
- [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
|
| 575 |
- [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
|
| 576 |
|
|
|
|
|
|
|
| 577 |
## Citation
|
| 578 |
|
| 579 |
```bibtex
|
|
@@ -586,6 +606,8 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
|
|
| 586 |
}
|
| 587 |
```
|
| 588 |
|
|
|
|
|
|
|
| 589 |
## Glossary
|
| 590 |
|
| 591 |
- **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
|
|
@@ -594,17 +616,23 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
|
|
| 594 |
- **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
|
| 595 |
- **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
|
| 596 |
- **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
|
|
|
|
|
|
|
|
|
|
| 597 |
|
| 598 |
## Model Card Contact
|
| 599 |
|
| 600 |
For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
|
| 601 |
|
|
|
|
|
|
|
| 602 |
## Acknowledgments
|
| 603 |
|
| 604 |
Special thanks to:
|
| 605 |
- Hugging Face for the transformers library and dataset hosting
|
| 606 |
- The creators of FineWeb-Edu and Cosmopedia v2 datasets
|
| 607 |
- Kaggle for providing free GPU compute resources
|
|
|
|
| 608 |
- The open-source community for making accessible AI research possible
|
| 609 |
|
| 610 |
---
|
|
@@ -624,4 +652,4 @@ Special thanks to:
|
|
| 624 |
Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
|
| 625 |
<br>
|
| 626 |
<i>Democratizing AI through accessible, efficient models</i>
|
| 627 |
-
</p>
|
|
|
|
| 61 |

|
| 62 |

|
| 63 |
[](https://huggingface.co/StentorLabs/Stentor-30M)
|
| 64 |
+
[](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
|
| 65 |
|
| 66 |
Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
|
| 67 |
|
|
|
|
| 114 |
Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
|
| 115 |
```
|
| 116 |
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
## 🚀 Quick Start
|
| 120 |
|
| 121 |
Get up and running in 3 simple steps:
|
|
|
|
| 151 |
|
| 152 |
---
|
| 153 |
|
| 154 |
+
## 📦 Quantized Versions
|
| 155 |
+
|
| 156 |
+
Pre-quantized versions of Stentor-30M are available for use with llama.cpp, LM Studio, Ollama, and other compatible runtimes — no conversion needed.
|
| 157 |
+
|
| 158 |
+
| Format | Provider | Link |
|
| 159 |
+
|--------|----------|------|
|
| 160 |
+
| GGUF (multiple quants) | mradermacher | [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) |
|
| 161 |
+
|
| 162 |
+
Just download your preferred quantization (e.g. `Q4_K_M` for a good size/quality balance) and run it directly with llama.cpp or load it in LM Studio.
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
## Model Details
|
| 167 |
|
| 168 |
### Model Description
|
|
|
|
| 286 |
print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 287 |
```
|
| 288 |
|
| 289 |
+
### 2. Run with llama.cpp / LM Studio / Ollama (GGUF)
|
| 290 |
+
|
| 291 |
+
Pre-quantized GGUF files are available at [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) — no conversion required.
|
| 292 |
+
|
| 293 |
+
```bash
|
| 294 |
+
# Download a quantized GGUF (e.g. Q4_K_M) from the link above, then run with llama.cpp:
|
| 295 |
+
./llama-cli -m stentor-30m-Q4_K_M.gguf -p "Hello world" -n 50
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
Or simply load the `.gguf` file directly in **LM Studio** or **Ollama** for a GUI/API experience.
|
| 299 |
+
|
| 300 |
+
### 3. Edge Deployment with ONNX
|
| 301 |
|
| 302 |
Convert to ONNX for mobile/edge deployment:
|
| 303 |
|
|
|
|
| 310 |
--model StentorLabs/Stentor-30M \
|
| 311 |
--task text-generation-with-past \
|
| 312 |
stentor-30m-onnx/
|
| 313 |
+
```
|
| 314 |
|
| 315 |
+
```python
|
| 316 |
# Use with ONNX Runtime
|
| 317 |
from optimum.onnxruntime import ORTModelForCausalLM
|
| 318 |
from transformers import AutoTokenizer
|
|
|
|
| 325 |
print(tokenizer.decode(outputs[0]))
|
| 326 |
```
|
| 327 |
|
| 328 |
+
### 4. Rapid Prototyping
|
| 329 |
|
| 330 |
Quick experimentation before scaling:
|
| 331 |
|
|
|
|
| 347 |
print(f"Prompt: {prompt}\nResult: {result}\n")
|
| 348 |
```
|
| 349 |
|
| 350 |
+
## Quantize It Yourself
|
| 351 |
|
| 352 |
+
If you want to produce your own quantized versions rather than using the pre-built GGUFs:
|
| 353 |
|
| 354 |
+
### 8-bit Quantization (bitsandbytes)
|
| 355 |
|
| 356 |
```python
|
| 357 |
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
|
|
|
| 365 |
# Memory: ~30 MB (~50% reduction from fp16 weights)
|
| 366 |
```
|
| 367 |
|
| 368 |
+
### 4-bit Quantization (bitsandbytes)
|
| 369 |
|
| 370 |
```python
|
| 371 |
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
|
|
|
| 379 |
|
| 380 |
**Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
|
| 381 |
|
| 382 |
+
### Convert to GGUF Manually
|
|
|
|
|
|
|
| 383 |
|
| 384 |
```bash
|
| 385 |
# Clone llama.cpp
|
|
|
|
| 399 |
|
| 400 |
# Quantize (optional)
|
| 401 |
./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 402 |
```
|
| 403 |
|
| 404 |
### Convert to TensorFlow Lite (Mobile)
|
|
|
|
| 415 |
--opset 13
|
| 416 |
```
|
| 417 |
|
| 418 |
+
**Format summary:**
|
| 419 |
+
- **GGUF:** C++ applications, llama.cpp, LM Studio, Ollama — [pre-built available](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
|
| 420 |
- **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
|
| 421 |
- **TFLite:** Android/iOS mobile apps
|
| 422 |
|
| 423 |
+
---
|
| 424 |
+
|
| 425 |
## Training Details
|
| 426 |
|
| 427 |
### Training Data
|
|
|
|
| 479 |
|
| 480 |
> **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
|
| 481 |
|
| 482 |
+
---
|
| 483 |
+
|
| 484 |
## Evaluation
|
| 485 |
|
| 486 |
### Testing Data, Factors & Metrics
|
|
|
|
| 515 |
|
| 516 |
> **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
|
| 517 |
|
| 518 |
+
---
|
| 519 |
+
|
| 520 |
## Technical Specifications
|
| 521 |
|
| 522 |
### Model Architecture and Objective
|
|
|
|
| 560 |
- **Torch Compile:** False (disabled for notebook stability)
|
| 561 |
- **Accelerate:** Enabled for training
|
| 562 |
|
| 563 |
+
---
|
| 564 |
+
|
| 565 |
## Environmental Impact
|
| 566 |
|
| 567 |
- **Hardware Type:** 1x NVIDIA Tesla T4
|
|
|
|
| 572 |
|
| 573 |
Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
|
| 574 |
|
| 575 |
+
---
|
| 576 |
+
|
| 577 |
## Related Resources
|
| 578 |
|
| 579 |
### Official Resources
|
| 580 |
- 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
|
| 581 |
- 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
|
| 582 |
|
| 583 |
+
### Quantized Versions
|
| 584 |
+
- 🗜️ [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) - GGUF quantizations for llama.cpp, LM Studio, Ollama
|
| 585 |
+
|
| 586 |
### Related Models
|
| 587 |
- [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
|
| 588 |
- [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
|
|
|
|
| 592 |
- [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
|
| 593 |
- [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
|
| 594 |
|
| 595 |
+
---
|
| 596 |
+
|
| 597 |
## Citation
|
| 598 |
|
| 599 |
```bibtex
|
|
|
|
| 606 |
}
|
| 607 |
```
|
| 608 |
|
| 609 |
+
---
|
| 610 |
+
|
| 611 |
## Glossary
|
| 612 |
|
| 613 |
- **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
|
|
|
|
| 616 |
- **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
|
| 617 |
- **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
|
| 618 |
- **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
|
| 619 |
+
- **GGUF:** A file format used by llama.cpp and compatible runtimes for efficient local inference.
|
| 620 |
+
|
| 621 |
+
---
|
| 622 |
|
| 623 |
## Model Card Contact
|
| 624 |
|
| 625 |
For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
|
| 626 |
|
| 627 |
+
---
|
| 628 |
+
|
| 629 |
## Acknowledgments
|
| 630 |
|
| 631 |
Special thanks to:
|
| 632 |
- Hugging Face for the transformers library and dataset hosting
|
| 633 |
- The creators of FineWeb-Edu and Cosmopedia v2 datasets
|
| 634 |
- Kaggle for providing free GPU compute resources
|
| 635 |
+
- [mradermacher](https://huggingface.co/mradermacher) for providing GGUF quantizations
|
| 636 |
- The open-source community for making accessible AI research possible
|
| 637 |
|
| 638 |
---
|
|
|
|
| 652 |
Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
|
| 653 |
<br>
|
| 654 |
<i>Democratizing AI through accessible, efficient models</i>
|
| 655 |
+
</p>
|