|
|
--- |
|
|
license: apache-2.0 |
|
|
language: en |
|
|
tags: |
|
|
- causal-lm |
|
|
- from-scratch |
|
|
- transformer |
|
|
- tiny-stories |
|
|
- pytorch |
|
|
- custom-architecture |
|
|
- text-generation |
|
|
datasets: |
|
|
- fhswf/TinyStoriesV2_cleaned |
|
|
--- |
|
|
|
|
|
# TinyWay-1.1.0 |
|
|
|
|
|
**TinyWay-1.1.0** is a lightweight **decoder-only Transformer language model** trained **from scratch** on limited compute. |
|
|
The project demonstrates that meaningful language modeling behavior can emerge from modest-scale models trained in constrained environments such as Kaggle. |
|
|
|
|
|
> **Core idea:** *Understanding LLM training mechanics end-to-end by building, training, debugging, and deploying a Transformer LM without relying on pretrained weights.* |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Architecture:** Decoder-only Transformer (GPT-style) |
|
|
* **Parameters:** ~83M |
|
|
* **Layers:** 10 Transformer blocks |
|
|
* **Hidden size:** 512 |
|
|
* **Attention heads:** 8 |
|
|
* **Context length:** 256 tokens |
|
|
* **Activation:** GELU |
|
|
* **Normalization:** Pre-LayerNorm |
|
|
* **Weight tying:** Token embedding ↔ LM head |
|
|
* **Precision during training:** FP16 (AMP) |
|
|
|
|
|
--- |
|
|
|
|
|
## Training |
|
|
|
|
|
### Dataset |
|
|
|
|
|
* **TinyStoriesV2 (cleaned)** |
|
|
* Natural language short stories designed for training small language models |
|
|
|
|
|
### Tokenization |
|
|
|
|
|
* GPT-2 BPE tokenizer |
|
|
* Vocabulary size: 50,257 |
|
|
|
|
|
### Training Setup |
|
|
|
|
|
* Optimizer: AdamW |
|
|
* Learning rate: tuned for stable convergence |
|
|
* Gradient accumulation: enabled |
|
|
* Gradient clipping: enabled |
|
|
* Mixed precision training (AMP) |
|
|
* Training performed entirely on **Kaggle GPU environment (12-hour sessions)** |
|
|
|
|
|
### Checkpoints |
|
|
|
|
|
Models were saved at multiple training steps (5k → 30k). |
|
|
**TinyWay-1.1.0** corresponds to the **~25k step checkpoint**, which showed the best balance of fluency and stability. |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "NNEngine/TinyWay-1.1.0" |
|
|
|
|
|
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True) |
|
|
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
|
mdl = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True) |
|
|
|
|
|
out = mdl.generate( |
|
|
**tok( |
|
|
"Once upon a time", |
|
|
return_tensors="pt" |
|
|
).to(mdl.device), |
|
|
|
|
|
max_new_tokens=200, # force length |
|
|
do_sample=True, # sampling, not greedy |
|
|
temperature=0.8, |
|
|
top_k=50, |
|
|
repetition_penalty=1.2, |
|
|
|
|
|
eos_token_id=None, # 🔥 disable EOS stopping |
|
|
pad_token_id=tok.eos_token_id |
|
|
) |
|
|
|
|
|
print(tok.decode(out[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Sample Output |
|
|
|
|
|
> *Once upon a time, there was a little girl named Lily. She loved to play with her toys and explore the park near her home. One day, she found a shiny red ball hidden behind a tree…* |
|
|
|
|
|
(Outputs vary due to sampling.) |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
* Educational purposes |
|
|
* Research on small-scale language models |
|
|
* Understanding Transformer internals |
|
|
* Studying training dynamics under compute constraints |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* Not instruction-tuned |
|
|
* Not aligned for factual accuracy or safety |
|
|
* May produce repetitive or incoherent text at times |
|
|
* Trained on a limited dataset |
|
|
|
|
|
This model is **not intended for production use** or sensitive applications. |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
* The model may generate fictional or incorrect information |
|
|
* No explicit safety or content filtering was applied |
|
|
* Users should apply downstream safeguards if deploying |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in academic or technical work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{sharma2025tinyway, |
|
|
title={TinyWay: Training Decoder-Only Language Models from Scratch on Limited Compute}, |
|
|
author={Shivam Sharma}, |
|
|
year={2025}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Author |
|
|
|
|
|
**Shivam Sharma** |
|
|
B.Tech in Computer Science and Engineering (AIML) |
|
|
ITM Gwalior, India |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
* Hugging Face Transformers |
|
|
* Kaggle GPU resources |
|
|
* Open research community for open-source inspiration |