StentorLabs commited on
Commit
6a69098
·
verified ·
1 Parent(s): a92dd83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -33
README.md CHANGED
@@ -61,6 +61,7 @@ model-index:
61
  ![Hardware](https://img.shields.io/badge/hardware-1x%20Tesla%20T4-red.svg)
62
  ![Context Length](https://img.shields.io/badge/context-512%20tokens-purple.svg)
63
  [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs/Stentor-30M)
 
64
 
65
  Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
66
 
@@ -113,6 +114,8 @@ Text Generated:
113
  Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
114
  ```
115
 
 
 
116
  ## 🚀 Quick Start
117
 
118
  Get up and running in 3 simple steps:
@@ -148,6 +151,18 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
148
 
149
  ---
150
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  ## Model Details
152
 
153
  ### Model Description
@@ -271,7 +286,18 @@ outputs = target_model.generate(
271
  print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
272
  ```
273
 
274
- ### 2. Edge Deployment with ONNX
 
 
 
 
 
 
 
 
 
 
 
275
 
276
  Convert to ONNX for mobile/edge deployment:
277
 
@@ -284,7 +310,9 @@ optimum-cli export onnx \
284
  --model StentorLabs/Stentor-30M \
285
  --task text-generation-with-past \
286
  stentor-30m-onnx/
 
287
 
 
288
  # Use with ONNX Runtime
289
  from optimum.onnxruntime import ORTModelForCausalLM
290
  from transformers import AutoTokenizer
@@ -297,7 +325,7 @@ outputs = model.generate(**inputs, max_new_tokens=20)
297
  print(tokenizer.decode(outputs[0]))
298
  ```
299
 
300
- ### 3. Rapid Prototyping
301
 
302
  Quick experimentation before scaling:
303
 
@@ -319,11 +347,11 @@ for prompt in test_prompts:
319
  print(f"Prompt: {prompt}\nResult: {result}\n")
320
  ```
321
 
322
- ## Quantization Options
323
 
324
- Reduce memory footprint even further with quantization:
325
 
326
- ### 8-bit Quantization
327
 
328
  ```python
329
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig
@@ -337,7 +365,7 @@ model = AutoModelForCausalLM.from_pretrained(
337
  # Memory: ~30 MB (~50% reduction from fp16 weights)
338
  ```
339
 
340
- ### 4-bit Quantization
341
 
342
  ```python
343
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
@@ -351,9 +379,7 @@ model = AutoModelForCausalLM.from_pretrained(
351
 
352
  **Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
353
 
354
- ## Model Format Conversions
355
-
356
- ### Convert to GGUF (for llama.cpp)
357
 
358
  ```bash
359
  # Clone llama.cpp
@@ -373,27 +399,6 @@ python convert_hf_to_gguf.py stentor-30m/ \
373
 
374
  # Quantize (optional)
375
  ./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
376
-
377
- # Run with llama.cpp
378
- ./llama-cli -m stentor-30m-q4_0.gguf -p "Hello world" -n 50
379
- ```
380
-
381
- ### Convert to ONNX
382
-
383
- ```bash
384
- # Install optimum
385
- pip install optimum[exporters]
386
-
387
- # Export to ONNX
388
- optimum-cli export onnx \
389
- --model StentorLabs/Stentor-30M \
390
- --task text-generation-with-past \
391
- stentor-30m-onnx/
392
-
393
- # Use with ONNX Runtime (C++/Python/JS)
394
- from optimum.onnxruntime import ORTModelForCausalLM
395
-
396
- model = ORTModelForCausalLM.from_pretrained("stentor-30m-onnx")
397
  ```
398
 
399
  ### Convert to TensorFlow Lite (Mobile)
@@ -410,11 +415,13 @@ python -m tf2onnx.convert \
410
  --opset 13
411
  ```
412
 
413
- **Use cases:**
414
- - **GGUF:** C++ applications, maximum performance
415
  - **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
416
  - **TFLite:** Android/iOS mobile apps
417
 
 
 
418
  ## Training Details
419
 
420
  ### Training Data
@@ -472,6 +479,8 @@ The training pipeline utilized lightweight but effective preprocessing steps:
472
 
473
  > **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
474
 
 
 
475
  ## Evaluation
476
 
477
  ### Testing Data, Factors & Metrics
@@ -506,6 +515,8 @@ The model showed steady improvement throughout training:
506
 
507
  > **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
508
 
 
 
509
  ## Technical Specifications
510
 
511
  ### Model Architecture and Objective
@@ -549,6 +560,8 @@ The model was trained using standard cloud infrastructure available to researche
549
  - **Torch Compile:** False (disabled for notebook stability)
550
  - **Accelerate:** Enabled for training
551
 
 
 
552
  ## Environmental Impact
553
 
554
  - **Hardware Type:** 1x NVIDIA Tesla T4
@@ -559,12 +572,17 @@ The model was trained using standard cloud infrastructure available to researche
559
 
560
  Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
561
 
 
 
562
  ## Related Resources
563
 
564
  ### Official Resources
565
  - 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
566
  - 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
567
 
 
 
 
568
  ### Related Models
569
  - [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
570
  - [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
@@ -574,6 +592,8 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
574
  - [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
575
  - [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
576
 
 
 
577
  ## Citation
578
 
579
  ```bibtex
@@ -586,6 +606,8 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
586
  }
587
  ```
588
 
 
 
589
  ## Glossary
590
 
591
  - **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
@@ -594,17 +616,23 @@ Training on free-tier cloud GPUs demonstrates the accessibility of small languag
594
  - **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
595
  - **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
596
  - **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
 
 
 
597
 
598
  ## Model Card Contact
599
 
600
  For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
601
 
 
 
602
  ## Acknowledgments
603
 
604
  Special thanks to:
605
  - Hugging Face for the transformers library and dataset hosting
606
  - The creators of FineWeb-Edu and Cosmopedia v2 datasets
607
  - Kaggle for providing free GPU compute resources
 
608
  - The open-source community for making accessible AI research possible
609
 
610
  ---
@@ -624,4 +652,4 @@ Special thanks to:
624
  Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
625
  <br>
626
  <i>Democratizing AI through accessible, efficient models</i>
627
- </p>
 
61
  ![Hardware](https://img.shields.io/badge/hardware-1x%20Tesla%20T4-red.svg)
62
  ![Context Length](https://img.shields.io/badge/context-512%20tokens-purple.svg)
63
  [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs/Stentor-30M)
64
+ [![GGUF](https://img.shields.io/badge/GGUF-mradermacher-blue.svg)](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
65
 
66
  Stentor-30M is a highly compact, efficient language model built on the Llama architecture. Designed for speed and low-resource environments, this ~30.4M parameter checkpoint utilizes a mixed-precision training pipeline and is best treated as a **base next-token predictor** (not a chat assistant). It does not "understand" text in a human sense and is not trained to reliably follow instructions. While the tokenizer may include special tokens/templates that resemble instruction or tool formats, the model itself is **not instruction-tuned** and will often generate **plausible but off-topic** text. It serves as an accessible entry point for researching attention mechanisms and testing training pipelines on consumer hardware.
67
 
 
114
  Everyone is dead: 50 percent of our people will be killed in the coming days of our nation. 60 percent of us will live and go in
115
  ```
116
 
117
+ ---
118
+
119
  ## 🚀 Quick Start
120
 
121
  Get up and running in 3 simple steps:
 
151
 
152
  ---
153
 
154
+ ## 📦 Quantized Versions
155
+
156
+ Pre-quantized versions of Stentor-30M are available for use with llama.cpp, LM Studio, Ollama, and other compatible runtimes — no conversion needed.
157
+
158
+ | Format | Provider | Link |
159
+ |--------|----------|------|
160
+ | GGUF (multiple quants) | mradermacher | [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) |
161
+
162
+ Just download your preferred quantization (e.g. `Q4_K_M` for a good size/quality balance) and run it directly with llama.cpp or load it in LM Studio.
163
+
164
+ ---
165
+
166
  ## Model Details
167
 
168
  ### Model Description
 
286
  print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
287
  ```
288
 
289
+ ### 2. Run with llama.cpp / LM Studio / Ollama (GGUF)
290
+
291
+ Pre-quantized GGUF files are available at [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) — no conversion required.
292
+
293
+ ```bash
294
+ # Download a quantized GGUF (e.g. Q4_K_M) from the link above, then run with llama.cpp:
295
+ ./llama-cli -m stentor-30m-Q4_K_M.gguf -p "Hello world" -n 50
296
+ ```
297
+
298
+ Or simply load the `.gguf` file directly in **LM Studio** or **Ollama** for a GUI/API experience.
299
+
300
+ ### 3. Edge Deployment with ONNX
301
 
302
  Convert to ONNX for mobile/edge deployment:
303
 
 
310
  --model StentorLabs/Stentor-30M \
311
  --task text-generation-with-past \
312
  stentor-30m-onnx/
313
+ ```
314
 
315
+ ```python
316
  # Use with ONNX Runtime
317
  from optimum.onnxruntime import ORTModelForCausalLM
318
  from transformers import AutoTokenizer
 
325
  print(tokenizer.decode(outputs[0]))
326
  ```
327
 
328
+ ### 4. Rapid Prototyping
329
 
330
  Quick experimentation before scaling:
331
 
 
347
  print(f"Prompt: {prompt}\nResult: {result}\n")
348
  ```
349
 
350
+ ## Quantize It Yourself
351
 
352
+ If you want to produce your own quantized versions rather than using the pre-built GGUFs:
353
 
354
+ ### 8-bit Quantization (bitsandbytes)
355
 
356
  ```python
357
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig
 
365
  # Memory: ~30 MB (~50% reduction from fp16 weights)
366
  ```
367
 
368
+ ### 4-bit Quantization (bitsandbytes)
369
 
370
  ```python
371
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
 
379
 
380
  **Note:** Requires `bitsandbytes` library: `pip install bitsandbytes`
381
 
382
+ ### Convert to GGUF Manually
 
 
383
 
384
  ```bash
385
  # Clone llama.cpp
 
399
 
400
  # Quantize (optional)
401
  ./llama-quantize stentor-30m.gguf stentor-30m-q4_0.gguf q4_0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
402
  ```
403
 
404
  ### Convert to TensorFlow Lite (Mobile)
 
415
  --opset 13
416
  ```
417
 
418
+ **Format summary:**
419
+ - **GGUF:** C++ applications, llama.cpp, LM Studio, Ollama — [pre-built available](https://huggingface.co/mradermacher/Stentor-30M-GGUF)
420
  - **ONNX:** Cross-platform (Windows/Linux/Mac/Web)
421
  - **TFLite:** Android/iOS mobile apps
422
 
423
+ ---
424
+
425
  ## Training Details
426
 
427
  ### Training Data
 
479
 
480
  > **Note:** A significant portion of parameters are allocated to embeddings due to the 32K vocabulary size. For future iterations, a smaller vocabulary (8K-16K) could free up capacity for additional model layers.
481
 
482
+ ---
483
+
484
  ## Evaluation
485
 
486
  ### Testing Data, Factors & Metrics
 
515
 
516
  > **Note:** As a 30M parameter base model, this checkpoint should be treated as a functional proof-of-concept baseline. The model does not run external benchmarks like MMLU or GSM8K.
517
 
518
+ ---
519
+
520
  ## Technical Specifications
521
 
522
  ### Model Architecture and Objective
 
560
  - **Torch Compile:** False (disabled for notebook stability)
561
  - **Accelerate:** Enabled for training
562
 
563
+ ---
564
+
565
  ## Environmental Impact
566
 
567
  - **Hardware Type:** 1x NVIDIA Tesla T4
 
572
 
573
  Training on free-tier cloud GPUs demonstrates the accessibility of small language model research to students and independent researchers.
574
 
575
+ ---
576
+
577
  ## Related Resources
578
 
579
  ### Official Resources
580
  - 📊 Best model artifact: `results/best_model` (config + tokenizer + weights + metadata)
581
  - 🎓 [Model Card Methodology](https://arxiv.org/abs/1810.03993) - Mitchell et al., 2018
582
 
583
+ ### Quantized Versions
584
+ - 🗜️ [mradermacher/Stentor-30M-GGUF](https://huggingface.co/mradermacher/Stentor-30M-GGUF) - GGUF quantizations for llama.cpp, LM Studio, Ollama
585
+
586
  ### Related Models
587
  - [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - Larger alternative (1.1B params)
588
  - [SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) - Similar size category
 
592
  - [Speculative Decoding](https://arxiv.org/abs/2211.17192) - Leviathan et al., 2023
593
  - [Small Language Models Survey](https://arxiv.org/abs/2402.14848) - Survey on efficient LLMs
594
 
595
+ ---
596
+
597
  ## Citation
598
 
599
  ```bibtex
 
606
  }
607
  ```
608
 
609
+ ---
610
+
611
  ## Glossary
612
 
613
  - **NLP (Natural Language Processing):** The field of AI focused on the interaction between computers and human language.
 
616
  - **SLM (Small Language Model):** Language models with parameters typically under 1B, designed for efficiency and specific tasks.
617
  - **RoPE (Rotary Position Embedding):** A method for encoding position information in transformer models.
618
  - **Edge Deployment:** Running models on resource-constrained devices like mobile phones or IoT devices.
619
+ - **GGUF:** A file format used by llama.cpp and compatible runtimes for efficient local inference.
620
+
621
+ ---
622
 
623
  ## Model Card Contact
624
 
625
  For questions, please contact [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open an issue on the [model repository](https://huggingface.co/StentorLabs/Stentor-30M/discussions).
626
 
627
+ ---
628
+
629
  ## Acknowledgments
630
 
631
  Special thanks to:
632
  - Hugging Face for the transformers library and dataset hosting
633
  - The creators of FineWeb-Edu and Cosmopedia v2 datasets
634
  - Kaggle for providing free GPU compute resources
635
+ - [mradermacher](https://huggingface.co/mradermacher) for providing GGUF quantizations
636
  - The open-source community for making accessible AI research possible
637
 
638
  ---
 
652
  Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a>
653
  <br>
654
  <i>Democratizing AI through accessible, efficient models</i>
655
+ </p>