File size: 3,879 Bytes

---
license: apache-2.0
language: en
tags:
- causal-lm
- from-scratch
- transformer
- tiny-stories
- pytorch
- custom-architecture
- text-generation
datasets:
- fhswf/TinyStoriesV2_cleaned
---

# TinyWay-1.1.0

**TinyWay-1.1.0** is a lightweight **decoder-only Transformer language model** trained **from scratch** on limited compute.
The project demonstrates that meaningful language modeling behavior can emerge from modest-scale models trained in constrained environments such as Kaggle.

> **Core idea:** *Understanding LLM training mechanics end-to-end by building, training, debugging, and deploying a Transformer LM without relying on pretrained weights.*

---

## Model Details

* **Architecture:** Decoder-only Transformer (GPT-style)
* **Parameters:** ~83M
* **Layers:** 10 Transformer blocks
* **Hidden size:** 512
* **Attention heads:** 8
* **Context length:** 256 tokens
* **Activation:** GELU
* **Normalization:** Pre-LayerNorm
* **Weight tying:** Token embedding ↔ LM head
* **Precision during training:** FP16 (AMP)

---

## Training

### Dataset

* **TinyStoriesV2 (cleaned)**
* Natural language short stories designed for training small language models

### Tokenization

* GPT-2 BPE tokenizer
* Vocabulary size: 50,257

### Training Setup

* Optimizer: AdamW
* Learning rate: tuned for stable convergence
* Gradient accumulation: enabled
* Gradient clipping: enabled
* Mixed precision training (AMP)
* Training performed entirely on **Kaggle GPU environment (12-hour sessions)**

### Checkpoints

Models were saved at multiple training steps (5k → 30k).
**TinyWay-1.1.0** corresponds to the **~25k step checkpoint**, which showed the best balance of fluency and stability.

---

## Example Usage

```python
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

model_id = "NNEngine/TinyWay-1.1.0"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
mdl = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)

out = mdl.generate(
    **tok(
        "Once upon a time",
        return_tensors="pt"
    ).to(mdl.device),

    max_new_tokens=200,          # force length
    do_sample=True,              # sampling, not greedy
    temperature=0.8,
    top_k=50,
    repetition_penalty=1.2,

    eos_token_id=None,           # 🔥 disable EOS stopping
    pad_token_id=tok.eos_token_id
)

print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Sample Output

> *Once upon a time, there was a little girl named Lily. She loved to play with her toys and explore the park near her home. One day, she found a shiny red ball hidden behind a tree…*

(Outputs vary due to sampling.)

---

## Intended Use

* Educational purposes
* Research on small-scale language models
* Understanding Transformer internals
* Studying training dynamics under compute constraints

---

## Limitations

* Not instruction-tuned
* Not aligned for factual accuracy or safety
* May produce repetitive or incoherent text at times
* Trained on a limited dataset

This model is **not intended for production use** or sensitive applications.

---

## Ethical Considerations

* The model may generate fictional or incorrect information
* No explicit safety or content filtering was applied
* Users should apply downstream safeguards if deploying

---

## Citation

If you use this model in academic or technical work, please cite:

```bibtex
@misc{sharma2025tinyway,
  title={TinyWay: Training Decoder-Only Language Models from Scratch on Limited Compute},
  author={Shivam Sharma},
  year={2025},
}
```

---

## Author

**Shivam Sharma**
B.Tech in Computer Science and Engineering (AIML)
ITM Gwalior, India

---

## Acknowledgements

* Hugging Face Transformers
* Kaggle GPU resources
* Open research community for open-source inspiration