TinyWay-1.2.0 / README.md
NNEngine's picture
Update README.md
b7f29be verified
---
license: mit
datasets:
- shivendrra/consolidated-datasets
language:
- en
metrics:
- perplexity
tags:
- Basemodel
- text-generation
- nlp
- custom_code
- casual-llm
library_name: transformers
---
# TinyWay-1.2.0
**TinyWay-1.2.0** is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code).
The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training.
> ⚡ Trained end-to-end using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets.
---
## Model Overview
| Property | Value |
| ----------------- | ------------------------------------ |
| Model type | Decoder-only Transformer (GPT-style) |
| Parameters | **~109.6M** |
| Layers | 10 |
| Hidden size | 768 |
| Attention heads | 12 |
| Context length | 256 tokens |
| Activation | GELU |
| Dropout | 0.1 |
| Precision | fp16 / bf16 |
| Weight tying | Token embedding tied with LM head |
| Position encoding | Learned absolute embeddings |
---
## Training Details
### Dataset
The model was trained using **streaming data** from:
* 🌍 Web text
* 📚 Stories
* 💻 Code
via the HuggingFace dataset:
```
shivendrra/consolidated-datasets
```
Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace.
---
### Tokenization
* Tokenizer: **GPT2TokenizerFast**
* Vocabulary size: **50,257**
* Special tokens:
* `bos_token_id = eos_token_id = pad_token_id = 50256`
---
### Training Configuration
| Setting | Value |
| --------------------- | ---------------------------- |
| Sequence length | 256 |
| Effective batch size | 64 sequences |
| Optimizer | AdamW |
| Learning rate | 3e-4 (cosine decay + warmup) |
| Betas | (0.9, 0.95) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Mixed precision | AMP (fp16 / bf16) |
| Gradient accumulation | Yes |
| Training steps | ~60k |
| Total tokens | ~1B (approx) |
Final training loss ≈ **3.0**
Final perplexity ≈ **~20**
---
## Usage
### Load with Transformers (Custom Code Required)
This repository uses a custom model definition (`modeling_tinyway.py`).
Make sure it is available in your environment before loading.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
```
---
### Text Generation Example
```python
import torch
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.8,
top_k=50,
top_p=0.95,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Example Generations
The model demonstrates:
* ✅ Coherent sentence structure
* ✅ Narrative flow in stories
* ✅ Reasonable grammar and punctuation
* ⚠️ Occasional repetition and topic drift (expected for this scale)
This is a research-grade small LLM, not instruction-aligned by default.
---
## Limitations
* ❌ Not instruction-tuned
* ❌ Limited reasoning depth compared to large LLMs
* ❌ Context length limited to 256 tokens
* ⚠️ May hallucinate or generate inconsistent facts
* ⚠️ Training data may contain noise from web sources
Use responsibly.
---
## Intended Use
* Research experiments
* Educational purposes
* Model scaling studies
* Training pipeline benchmarking
* Custom fine-tuning experiments
Not recommended for production or safety-critical applications.
---
## Reproducibility
The model was trained using:
* Custom PyTorch training loop
* Streaming datasets via HuggingFace
* Mixed precision training
* Gradient accumulation
* Periodic checkpointing
* Full monitoring (loss, perplexity, gradient norm, attention entropy)
If you’d like the full training code or configs, feel free to reach out.
---
## License
This model follows the license of the underlying datasets and tokenizer.
Please ensure compliance before commercial usage.
---
## Acknowledgements
* HuggingFace 🤗
* PyTorch
* GPT-2 tokenizer
* Open research community