NNEngine commited on
Commit
2fb8642
Β·
verified Β·
1 Parent(s): aa63000

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -0
README.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - shivendrra/consolidated-datasets
5
+ language:
6
+ - en
7
+ metrics:
8
+ - perplexity
9
+ tags:
10
+ - Basemodel
11
+ - text-generation
12
+ - nlp
13
+ ---
14
+
15
+
16
+ # πŸ“˜ TinyWay-1.2.0
17
+
18
+ **TinyWay-1.2.0** is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code).
19
+ The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training.
20
+
21
+ > ⚑ Trained end-to-end on Kaggle using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets.
22
+
23
+ ---
24
+
25
+ ## πŸ” Model Overview
26
+
27
+ | Property | Value |
28
+ | ----------------- | ------------------------------------ |
29
+ | Model type | Decoder-only Transformer (GPT-style) |
30
+ | Parameters | **~109.6M** |
31
+ | Layers | 10 |
32
+ | Hidden size | 768 |
33
+ | Attention heads | 12 |
34
+ | Context length | 256 tokens |
35
+ | Activation | GELU |
36
+ | Dropout | 0.1 |
37
+ | Precision | fp16 / bf16 |
38
+ | Weight tying | Token embedding tied with LM head |
39
+ | Position encoding | Learned absolute embeddings |
40
+
41
+ ---
42
+
43
+ ## 🧠 Training Details
44
+
45
+ ### Dataset
46
+
47
+ The model was trained using **streaming data** from:
48
+
49
+ * 🌍 Web text
50
+ * πŸ“š Stories
51
+ * πŸ’» Code
52
+
53
+ via the HuggingFace dataset:
54
+
55
+ ```
56
+ shivendrra/consolidated-datasets
57
+ ```
58
+
59
+ Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace.
60
+
61
+ ---
62
+
63
+ ### Tokenization
64
+
65
+ * Tokenizer: **GPT2TokenizerFast**
66
+ * Vocabulary size: **50,257**
67
+ * Special tokens:
68
+
69
+ * `bos_token_id = eos_token_id = pad_token_id = 50256`
70
+
71
+ ---
72
+
73
+ ### Training Configuration
74
+
75
+ | Setting | Value |
76
+ | --------------------- | ---------------------------- |
77
+ | Sequence length | 256 |
78
+ | Effective batch size | 64 sequences |
79
+ | Optimizer | AdamW |
80
+ | Learning rate | 3e-4 (cosine decay + warmup) |
81
+ | Betas | (0.9, 0.95) |
82
+ | Weight decay | 0.1 |
83
+ | Gradient clipping | 1.0 |
84
+ | Mixed precision | AMP (fp16 / bf16) |
85
+ | Gradient accumulation | Yes |
86
+ | Training steps | ~60k |
87
+ | Total tokens | ~1B (approx) |
88
+
89
+ Final training loss β‰ˆ **3.0**
90
+ Final perplexity β‰ˆ **~20**
91
+
92
+ ---
93
+
94
+ ## πŸš€ Usage
95
+
96
+ ### Load with Transformers (Custom Code Required)
97
+
98
+ This repository uses a custom model definition (`modeling_tinyway.py`).
99
+ Make sure it is available in your environment before loading.
100
+
101
+ ```python
102
+ from transformers import AutoModelForCausalLM, AutoTokenizer
103
+
104
+ model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0")
105
+ tokenizer = AutoTokenizer.from_pretrained("gpt2")
106
+ ```
107
+
108
+ ---
109
+
110
+ ### Text Generation Example
111
+
112
+ ```python
113
+ import torch
114
+
115
+ prompt = "Once upon a time"
116
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
117
+
118
+ outputs = model.generate(
119
+ **inputs,
120
+ max_new_tokens=200,
121
+ temperature=0.8,
122
+ top_k=50,
123
+ top_p=0.95,
124
+ do_sample=True
125
+ )
126
+
127
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
128
+ ```
129
+
130
+ ---
131
+
132
+ ## πŸ“Š Example Generations
133
+
134
+ The model demonstrates:
135
+
136
+ * βœ… Coherent sentence structure
137
+ * βœ… Narrative flow in stories
138
+ * βœ… Reasonable grammar and punctuation
139
+ * ⚠️ Occasional repetition and topic drift (expected for this scale)
140
+
141
+ This is a research-grade small LLM, not instruction-aligned by default.
142
+
143
+ ---
144
+
145
+ ## ⚠️ Limitations
146
+
147
+ * ❌ Not instruction-tuned
148
+ * ❌ Limited reasoning depth compared to large LLMs
149
+ * ❌ Context length limited to 256 tokens
150
+ * ⚠️ May hallucinate or generate inconsistent facts
151
+ * ⚠️ Training data may contain noise from web sources
152
+
153
+ Use responsibly.
154
+
155
+ ---
156
+
157
+ ## πŸ§ͺ Intended Use
158
+
159
+ * Research experiments
160
+ * Educational purposes
161
+ * Model scaling studies
162
+ * Training pipeline benchmarking
163
+ * Custom fine-tuning experiments
164
+
165
+ Not recommended for production or safety-critical applications.
166
+
167
+ ---
168
+
169
+ ## πŸ› οΈ Reproducibility
170
+
171
+ The model was trained using:
172
+
173
+ * Custom PyTorch training loop
174
+ * Streaming datasets via HuggingFace
175
+ * Mixed precision training
176
+ * Gradient accumulation
177
+ * Periodic checkpointing
178
+ * Full monitoring (loss, perplexity, gradient norm, attention entropy)
179
+
180
+ If you’d like the full training code or configs, feel free to reach out.
181
+
182
+ ---
183
+
184
+ ## πŸ“œ License
185
+
186
+ This model follows the license of the underlying datasets and tokenizer.
187
+ Please ensure compliance before commercial usage.
188
+
189
+ ---
190
+
191
+ ## πŸ™Œ Acknowledgements
192
+
193
+ * HuggingFace πŸ€—
194
+ * PyTorch
195
+ * Kaggle
196
+ * GPT-2 tokenizer
197
+ * Open research community