NNEngine
/

TinyWay-1.2.0

+---
+license: mit
+datasets:
+- shivendrra/consolidated-datasets
+language:
+- en
+metrics:
+- perplexity
+tags:
+- Basemodel
+- text-generation
+- nlp
+---
+# 📘 TinyWay-1.2.0
+**TinyWay-1.2.0** is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code).
+The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training.
+> ⚡ Trained end-to-end on Kaggle using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets.
+---
+## 🔍 Model Overview
+| Property          | Value                                |
+| ----------------- | ------------------------------------ |
+| Model type        | Decoder-only Transformer (GPT-style) |
+| Parameters        | **~109.6M**                          |
+| Layers            | 10                                   |
+| Hidden size       | 768                                  |
+| Attention heads   | 12                                   |
+| Context length    | 256 tokens                           |
+| Activation        | GELU                                 |
+| Dropout           | 0.1                                  |
+| Precision         | fp16 / bf16                          |
+| Weight tying      | Token embedding tied with LM head    |
+| Position encoding | Learned absolute embeddings          |
+---
+## 🧠 Training Details
+### Dataset
+The model was trained using **streaming data** from:
+* 🌍 Web text
+* 📚 Stories
+* 💻 Code
+via the HuggingFace dataset:
+```
+shivendrra/consolidated-datasets
+```
+Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace.
+---
+### Tokenization
+* Tokenizer: **GPT2TokenizerFast**
+* Vocabulary size: **50,257**
+* Special tokens:
+  * `bos_token_id = eos_token_id = pad_token_id = 50256`
+---
+### Training Configuration
+| Setting               | Value                        |
+| --------------------- | ---------------------------- |
+| Sequence length       | 256                          |
+| Effective batch size  | 64 sequences                 |
+| Optimizer             | AdamW                        |
+| Learning rate         | 3e-4 (cosine decay + warmup) |
+| Betas                 | (0.9, 0.95)                  |
+| Weight decay          | 0.1                          |
+| Gradient clipping     | 1.0                          |
+| Mixed precision       | AMP (fp16 / bf16)            |
+| Gradient accumulation | Yes                          |
+| Training steps        | ~60k                         |
+| Total tokens          | ~1B (approx)                 |
+Final training loss ≈ **3.0**
+Final perplexity ≈ **~20**
+---
+## 🚀 Usage
+### Load with Transformers (Custom Code Required)
+This repository uses a custom model definition (`modeling_tinyway.py`).
+Make sure it is available in your environment before loading.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0")
+tokenizer = AutoTokenizer.from_pretrained("gpt2")
+```
+---
+### Text Generation Example
+```python
+import torch
+prompt = "Once upon a time"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    temperature=0.8,
+    top_k=50,
+    top_p=0.95,
+    do_sample=True
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## 📊 Example Generations
+The model demonstrates:
+* ✅ Coherent sentence structure
+* ✅ Narrative flow in stories
+* ✅ Reasonable grammar and punctuation
+* ⚠️ Occasional repetition and topic drift (expected for this scale)
+This is a research-grade small LLM, not instruction-aligned by default.
+---
+## ⚠️ Limitations
+* ❌ Not instruction-tuned
+* ❌ Limited reasoning depth compared to large LLMs
+* ❌ Context length limited to 256 tokens
+* ⚠️ May hallucinate or generate inconsistent facts
+* ⚠️ Training data may contain noise from web sources
+Use responsibly.
+---
+## 🧪 Intended Use
+* Research experiments
+* Educational purposes
+* Model scaling studies
+* Training pipeline benchmarking
+* Custom fine-tuning experiments
+Not recommended for production or safety-critical applications.
+---
+## 🛠️ Reproducibility
+The model was trained using:
+* Custom PyTorch training loop
+* Streaming datasets via HuggingFace
+* Mixed precision training
+* Gradient accumulation
+* Periodic checkpointing
+* Full monitoring (loss, perplexity, gradient norm, attention entropy)
+If you’d like the full training code or configs, feel free to reach out.
+---
+## 📜 License
+This model follows the license of the underlying datasets and tokenizer.
+Please ensure compliance before commercial usage.
+---
+## 🙌 Acknowledgements
+* HuggingFace 🤗
+* PyTorch
+* Kaggle
+* GPT-2 tokenizer
+* Open research community