Update README.md

Browse files

Files changed (1) hide show

README.md +61 -33

README.md CHANGED Viewed

@@ -61,6 +61,7 @@ model-index:
 ![Hardware](https://img.shields.io/badge/hardware-1x%20Tesla%20T4-red.svg)
 ![Context Length](https://img.shields.io/badge/context-512%20tokens-purple.svg)
 [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs/Stentor-30M)
 Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
@@ -113,6 +114,8 @@ Text Generated:
 Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
 ```
 ## 🚀 Quick Start
 Get up and running in 3 simple steps:
@@ -148,6 +151,18 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ---
 ## Model Details
 ### Model Description
@@ -271,7 +286,18 @@ outputs = target_model.generate(
 print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### 2. Edge Deployment with ONNX
 Convert to ONNX for mobile/edge deployment:
@@ -284,7 +310,9 @@ optimum-cli export onnx \
   --model StentorLabs/Stentor-30M \
   --task text-generation-with-past \
   stentor-30m-onnx/
 # Use with ONNX Runtime
 from optimum.onnxruntime import ORTModelForCausalLM
 from transformers import AutoTokenizer
@@ -297,7 +325,7 @@ outputs = model.generate(**inputs, max_new_tokens=20)
 print(tokenizer.decode(outputs[0]))
 ```
-### 3. Rapid Prototyping
 Quick experimentation before scaling:
@@ -319,11 +347,11 @@ for prompt in test_prompts:
     print(f"Prompt: {prompt}\nResult: {result}\n")
 ```
-## Quantization Options
-Reduce memory footprint even further with quantization:
-### 8-bit Quantization
 ```python
 from transformers import AutoModelForCausalLM, BitsAndBytesConfig
@@ -337,7 +365,7 @@ model = AutoModelForCausalLM.from_pretrained(
 # Memory: ~30 MB (~50% reduction from fp16 weights)
 ```
-### 4-bit Quantization
 ```python
 quantization_config = BitsAndBytesConfig(load_in_4bit=True)
@@ -351,9 +379,7 @@ model = AutoModelForCausalLM.from_pretrained(
 **Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
-## Model Format Conversions
-### Convert to GGUF (for llama.cpp)
 ```bash
 # Clone llama.cpp
@@ -373,27 +399,6 @@ python convert_hf_to_gguf.py stentor-30m/ \
 # Quantize (optional)
 ./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
-# Run with llama.cpp
-./llama-cli -m stentor-30m-q4_0.gguf -p "Hello world" -n 50
-```
-### Convert to ONNX
-```bash
-# Install optimum
-pip install optimum[exporters]
-# Export to ONNX
-optimum-cli export onnx \
-  --model StentorLabs/Stentor-30M \
-  --task text-generation-with-past \
-  stentor-30m-onnx/
-# Use with ONNX Runtime (C++/Python/JS)
-from optimum.onnxruntime import ORTModelForCausalLM
-model = ORTModelForCausalLM.from_pretrained("stentor-30m-onnx")
 ```
 ### Convert to TensorFlow Lite (Mobile)
@@ -410,11 +415,13 @@ python -m tf2onnx.convert \
   --opset 13
 ```
-**Use cases:**
-- **GGUF:** C++ applications, maximum performance
 - **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
 - **TFLite:** Android/iOS mobile apps
 ## Training Details
 ### Training Data
@@ -472,6 +479,8 @@ The training pipeline utilized lightweight but effective preprocessing steps:
 > **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
 ## Evaluation
 ### Testing Data, Factors & Metrics
@@ -506,6 +515,8 @@ The model showed steady improvement throughout training:
 > **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
 ## Technical Specifications
 ### Model Architecture and Objective
@@ -549,6 +560,8 @@ The model was trained using standard cloud infrastructure available to researche
 - **Torch Compile:** False (disabled for notebook stability)
 - **Accelerate:** Enabled for training
 ## Environmental Impact
 - **Hardware Type:** 1x NVIDIA Tesla T4
@@ -559,12 +572,17 @@ The model was trained using standard cloud infrastructure available to researche
 Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
 ## Related Resources
 ### Official Resources
 - 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
 - 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
 ### Related Models
 - [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
 - [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
@@ -574,6 +592,8 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
 - [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
 - [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
 ## Citation
 ```bibtex
@@ -586,6 +606,8 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
 }
 ```
 ## Glossary
 - **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
@@ -594,17 +616,23 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
 - **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
 - **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
 - **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
 ## Model Card Contact
 For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
 ## Acknowledgments
 Special thanks to:
 - Hugging Face for the transformers library and dataset hosting
 - The creators of FineWeb-Edu and Cosmopedia v2 datasets
 - Kaggle for providing free GPU compute resources
 - The open-source community for making accessible AI research possible
 ---
@@ -624,4 +652,4 @@ Special thanks to:
   Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
   <br>
   <i>Democratizing AI through accessible, efficient models</i>
-</p>

 ![Hardware](https://img.shields.io/badge/hardware-1x%20Tesla%20T4-red.svg)
 ![Context Length](https://img.shields.io/badge/context-512%20tokens-purple.svg)
 [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs/Stentor-30M)
+[![GGUF](https://img.shields.io/badge/GGUF-mradermacher-blue.svg)](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
 Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
 Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
 ```
+---
 ## 🚀 Quick Start
 Get up and running in 3 simple steps:
 ---
+## 📦 Quantized Versions
+Pre-quantized versions of Stentor-30M are available for use with llama.cpp, LM Studio, Ollama, and other compatible runtimes — no conversion needed.
+| Format | Provider | Link |
+|--------|----------|------|
+| GGUF (multiple quants) | mradermacher | [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) |
+Just download your preferred quantization (e.g. `Q4_K_M` for a good size/quality balance) and run it directly with llama.cpp or load it in LM Studio.
+---
 ## Model Details
 ### Model Description
 print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+### 2. Run with llama.cpp / LM Studio / Ollama (GGUF)
+Pre-quantized GGUF files are available at [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) — no conversion required.
+```bash
+# Download a quantized GGUF (e.g. Q4_K_M) from the link above, then run with llama.cpp:
+./llama-cli -m stentor-30m-Q4_K_M.gguf -p "Hello world" -n 50
+```
+Or simply load the `.gguf` file directly in **LM Studio** or **Ollama** for a GUI/API experience.
+### 3. Edge Deployment with ONNX
 Convert to ONNX for mobile/edge deployment:
   --model StentorLabs/Stentor-30M \
   --task text-generation-with-past \
   stentor-30m-onnx/
+```
+```python
 # Use with ONNX Runtime
 from optimum.onnxruntime import ORTModelForCausalLM
 from transformers import AutoTokenizer
 print(tokenizer.decode(outputs[0]))
 ```
+### 4. Rapid Prototyping
 Quick experimentation before scaling:
     print(f"Prompt: {prompt}\nResult: {result}\n")
 ```
+## Quantize It Yourself
+If you want to produce your own quantized versions rather than using the pre-built GGUFs:
+### 8-bit Quantization (bitsandbytes)
 ```python
 from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 # Memory: ~30 MB (~50% reduction from fp16 weights)
 ```
+### 4-bit Quantization (bitsandbytes)
 ```python
 quantization_config = BitsAndBytesConfig(load_in_4bit=True)
 **Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
+### Convert to GGUF Manually
 ```bash
 # Clone llama.cpp
 # Quantize (optional)
 ./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
 ```
 ### Convert to TensorFlow Lite (Mobile)
   --opset 13
 ```
+**Format summary:**
+- **GGUF:** C++ applications, llama.cpp, LM Studio, Ollama — [pre-built available](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
 - **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
 - **TFLite:** Android/iOS mobile apps
+---
 ## Training Details
 ### Training Data
 > **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
+---
 ## Evaluation
 ### Testing Data, Factors & Metrics
 > **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
+---
 ## Technical Specifications
 ### Model Architecture and Objective
 - **Torch Compile:** False (disabled for notebook stability)
 - **Accelerate:** Enabled for training
+---
 ## Environmental Impact
 - **Hardware Type:** 1x NVIDIA Tesla T4
 Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
+---
 ## Related Resources
 ### Official Resources
 - 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
 - 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
+### Quantized Versions
+- 🗜️ [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) - GGUF quantizations for llama.cpp, LM Studio, Ollama
 ### Related Models
 - [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
 - [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
 - [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
 - [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
+---
 ## Citation
 ```bibtex
 }
 ```
+---
 ## Glossary
 - **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
 - **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
 - **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
 - **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
+- **GGUF:** A file format used by llama.cpp and compatible runtimes for efficient local inference.
+---
 ## Model Card Contact
 For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
+---
 ## Acknowledgments
 Special thanks to:
 - Hugging Face for the transformers library and dataset hosting
 - The creators of FineWeb-Edu and Cosmopedia v2 datasets
 - Kaggle for providing free GPU compute resources
+- [mradermacher](https://huggingface.co/mradermacher) for providing GGUF quantizations
 - The open-source community for making accessible AI research possible
 ---
   Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
   <br>
   <i>Democratizing AI through accessible, efficient models</i>
+</p>