File size: 3,879 Bytes
4b01687
 
 
 
 
 
 
 
 
 
 
4551395
 
4b01687
 
dad5b6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27b1fc1
dad5b6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d91a44d
dad5b6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4551395
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: apache-2.0
language: en
tags:
- causal-lm
- from-scratch
- transformer
- tiny-stories
- pytorch
- custom-architecture
- text-generation
datasets:
- fhswf/TinyStoriesV2_cleaned
---

# TinyWay-1.1.0

**TinyWay-1.1.0** is a lightweight **decoder-only Transformer language model** trained **from scratch** on limited compute.
The project demonstrates that meaningful language modeling behavior can emerge from modest-scale models trained in constrained environments such as Kaggle.

> **Core idea:** *Understanding LLM training mechanics end-to-end by building, training, debugging, and deploying a Transformer LM without relying on pretrained weights.*

---

## Model Details

* **Architecture:** Decoder-only Transformer (GPT-style)
* **Parameters:** ~83M
* **Layers:** 10 Transformer blocks
* **Hidden size:** 512
* **Attention heads:** 8
* **Context length:** 256 tokens
* **Activation:** GELU
* **Normalization:** Pre-LayerNorm
* **Weight tying:** Token embedding ↔ LM head
* **Precision during training:** FP16 (AMP)

---

## Training

### Dataset

* **TinyStoriesV2 (cleaned)**
* Natural language short stories designed for training small language models

### Tokenization

* GPT-2 BPE tokenizer
* Vocabulary size: 50,257

### Training Setup

* Optimizer: AdamW
* Learning rate: tuned for stable convergence
* Gradient accumulation: enabled
* Gradient clipping: enabled
* Mixed precision training (AMP)
* Training performed entirely on **Kaggle GPU environment (12-hour sessions)**

### Checkpoints

Models were saved at multiple training steps (5k → 30k).
**TinyWay-1.1.0** corresponds to the **~25k step checkpoint**, which showed the best balance of fluency and stability.

---

## Example Usage

```python
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

model_id = "NNEngine/TinyWay-1.1.0"

config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
mdl = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True)

out = mdl.generate(
    **tok(
        "Once upon a time",
        return_tensors="pt"
    ).to(mdl.device),

    max_new_tokens=200,          # force length
    do_sample=True,              # sampling, not greedy
    temperature=0.8,
    top_k=50,
    repetition_penalty=1.2,

    eos_token_id=None,           # 🔥 disable EOS stopping
    pad_token_id=tok.eos_token_id
)

print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Sample Output

> *Once upon a time, there was a little girl named Lily. She loved to play with her toys and explore the park near her home. One day, she found a shiny red ball hidden behind a tree…*

(Outputs vary due to sampling.)

---

## Intended Use

* Educational purposes
* Research on small-scale language models
* Understanding Transformer internals
* Studying training dynamics under compute constraints

---

## Limitations

* Not instruction-tuned
* Not aligned for factual accuracy or safety
* May produce repetitive or incoherent text at times
* Trained on a limited dataset

This model is **not intended for production use** or sensitive applications.

---

## Ethical Considerations

* The model may generate fictional or incorrect information
* No explicit safety or content filtering was applied
* Users should apply downstream safeguards if deploying

---

## Citation

If you use this model in academic or technical work, please cite:

```bibtex
@misc{sharma2025tinyway,
  title={TinyWay: Training Decoder-Only Language Models from Scratch on Limited Compute},
  author={Shivam Sharma},
  year={2025},
}
```

---

## Author

**Shivam Sharma**
B.Tech in Computer Science and Engineering (AIML)
ITM Gwalior, India

---

## Acknowledgements

* Hugging Face Transformers
* Kaggle GPU resources
* Open research community for open-source inspiration