|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- shivendrra/consolidated-datasets |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- perplexity |
|
|
tags: |
|
|
- Basemodel |
|
|
- text-generation |
|
|
- nlp |
|
|
- custom_code |
|
|
- casual-llm |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
|
|
|
# TinyWay-1.2.0 |
|
|
|
|
|
**TinyWay-1.2.0** is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code). |
|
|
The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training. |
|
|
|
|
|
> ⚡ Trained end-to-end using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
| Property | Value | |
|
|
| ----------------- | ------------------------------------ | |
|
|
| Model type | Decoder-only Transformer (GPT-style) | |
|
|
| Parameters | **~109.6M** | |
|
|
| Layers | 10 | |
|
|
| Hidden size | 768 | |
|
|
| Attention heads | 12 | |
|
|
| Context length | 256 tokens | |
|
|
| Activation | GELU | |
|
|
| Dropout | 0.1 | |
|
|
| Precision | fp16 / bf16 | |
|
|
| Weight tying | Token embedding tied with LM head | |
|
|
| Position encoding | Learned absolute embeddings | |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
|
|
|
The model was trained using **streaming data** from: |
|
|
|
|
|
* 🌍 Web text |
|
|
* 📚 Stories |
|
|
* 💻 Code |
|
|
|
|
|
via the HuggingFace dataset: |
|
|
|
|
|
``` |
|
|
shivendrra/consolidated-datasets |
|
|
``` |
|
|
|
|
|
Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace. |
|
|
|
|
|
--- |
|
|
|
|
|
### Tokenization |
|
|
|
|
|
* Tokenizer: **GPT2TokenizerFast** |
|
|
* Vocabulary size: **50,257** |
|
|
* Special tokens: |
|
|
|
|
|
* `bos_token_id = eos_token_id = pad_token_id = 50256` |
|
|
|
|
|
--- |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
| Setting | Value | |
|
|
| --------------------- | ---------------------------- | |
|
|
| Sequence length | 256 | |
|
|
| Effective batch size | 64 sequences | |
|
|
| Optimizer | AdamW | |
|
|
| Learning rate | 3e-4 (cosine decay + warmup) | |
|
|
| Betas | (0.9, 0.95) | |
|
|
| Weight decay | 0.1 | |
|
|
| Gradient clipping | 1.0 | |
|
|
| Mixed precision | AMP (fp16 / bf16) | |
|
|
| Gradient accumulation | Yes | |
|
|
| Training steps | ~60k | |
|
|
| Total tokens | ~1B (approx) | |
|
|
|
|
|
Final training loss ≈ **3.0** |
|
|
Final perplexity ≈ **~20** |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Load with Transformers (Custom Code Required) |
|
|
|
|
|
This repository uses a custom model definition (`modeling_tinyway.py`). |
|
|
Make sure it is available in your environment before loading. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0") |
|
|
tokenizer = AutoTokenizer.from_pretrained("gpt2") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### Text Generation Example |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
prompt = "Once upon a time" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=200, |
|
|
temperature=0.8, |
|
|
top_k=50, |
|
|
top_p=0.95, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Generations |
|
|
|
|
|
The model demonstrates: |
|
|
|
|
|
* ✅ Coherent sentence structure |
|
|
* ✅ Narrative flow in stories |
|
|
* ✅ Reasonable grammar and punctuation |
|
|
* ⚠️ Occasional repetition and topic drift (expected for this scale) |
|
|
|
|
|
This is a research-grade small LLM, not instruction-aligned by default. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* ❌ Not instruction-tuned |
|
|
* ❌ Limited reasoning depth compared to large LLMs |
|
|
* ❌ Context length limited to 256 tokens |
|
|
* ⚠️ May hallucinate or generate inconsistent facts |
|
|
* ⚠️ Training data may contain noise from web sources |
|
|
|
|
|
Use responsibly. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
* Research experiments |
|
|
* Educational purposes |
|
|
* Model scaling studies |
|
|
* Training pipeline benchmarking |
|
|
* Custom fine-tuning experiments |
|
|
|
|
|
Not recommended for production or safety-critical applications. |
|
|
|
|
|
--- |
|
|
|
|
|
## Reproducibility |
|
|
|
|
|
The model was trained using: |
|
|
|
|
|
* Custom PyTorch training loop |
|
|
* Streaming datasets via HuggingFace |
|
|
* Mixed precision training |
|
|
* Gradient accumulation |
|
|
* Periodic checkpointing |
|
|
* Full monitoring (loss, perplexity, gradient norm, attention entropy) |
|
|
|
|
|
If you’d like the full training code or configs, feel free to reach out. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
This model follows the license of the underlying datasets and tokenizer. |
|
|
Please ensure compliance before commercial usage. |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
* HuggingFace 🤗 |
|
|
* PyTorch |
|
|
* GPT-2 tokenizer |
|
|
* Open research community |