TinyWay-1.2.0 / README.md

Update README.md

b7f29be verified 15 days ago

4.95 kB

	---
	license: mit
	datasets:
	- shivendrra/consolidated-datasets
	language:
	- en
	metrics:
	- perplexity
	tags:
	- Basemodel
	- text-generation
	- nlp
	- custom_code
	- casual-llm
	library_name: transformers
	---


	# TinyWay-1.2.0

	TinyWay-1.2.0 is a lightweight GPT-style causal language model (~110M parameters) trained from scratch on a mixed streaming corpus (web text, stories, and code).
	The model is designed for research, experimentation, and educational purposes, with an emphasis on transparent architecture and reproducible training.

	> ⚡ Trained end-to-end using a custom PyTorch pipeline with mixed precision, gradient accumulation, and streaming datasets.

	---

	## Model Overview

	\| Property \| Value \|
	\| ----------------- \| ------------------------------------ \|
	\| Model type \| Decoder-only Transformer (GPT-style) \|
	\| Parameters \| ~109.6M \|
	\| Layers \| 10 \|
	\| Hidden size \| 768 \|
	\| Attention heads \| 12 \|
	\| Context length \| 256 tokens \|
	\| Activation \| GELU \|
	\| Dropout \| 0.1 \|
	\| Precision \| fp16 / bf16 \|
	\| Weight tying \| Token embedding tied with LM head \|
	\| Position encoding \| Learned absolute embeddings \|

	---

	## Training Details

	### Dataset

	The model was trained using streaming data from:

	* 🌍 Web text
	* 📚 Stories
	* 💻 Code

	via the HuggingFace dataset:

	```
	shivendrra/consolidated-datasets
	```

	Streaming was used to avoid large local storage and to allow continuous sampling directly from HuggingFace.

	---

	### Tokenization

	* Tokenizer: GPT2TokenizerFast
	* Vocabulary size: 50,257
	* Special tokens:

	* `bos_token_id = eos_token_id = pad_token_id = 50256`

	---

	### Training Configuration

	\| Setting \| Value \|
	\| --------------------- \| ---------------------------- \|
	\| Sequence length \| 256 \|
	\| Effective batch size \| 64 sequences \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 3e-4 (cosine decay + warmup) \|
	\| Betas \| (0.9, 0.95) \|
	\| Weight decay \| 0.1 \|
	\| Gradient clipping \| 1.0 \|
	\| Mixed precision \| AMP (fp16 / bf16) \|
	\| Gradient accumulation \| Yes \|
	\| Training steps \| ~60k \|
	\| Total tokens \| ~1B (approx) \|

	Final training loss ≈ 3.0
	Final perplexity ≈ ~20

	---

	## Usage

	### Load with Transformers (Custom Code Required)

	This repository uses a custom model definition (`modeling_tinyway.py`).
	Make sure it is available in your environment before loading.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("NNEngine/TinyWay-1.2.0")
	tokenizer = AutoTokenizer.from_pretrained("gpt2")
	```

	---

	### Text Generation Example

	```python
	import torch

	prompt = "Once upon a time"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=200,
	temperature=0.8,
	top_k=50,
	top_p=0.95,
	do_sample=True
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## Example Generations

	The model demonstrates:

	* ✅ Coherent sentence structure
	* ✅ Narrative flow in stories
	* ✅ Reasonable grammar and punctuation
	* ⚠️ Occasional repetition and topic drift (expected for this scale)

	This is a research-grade small LLM, not instruction-aligned by default.

	---

	## Limitations

	* ❌ Not instruction-tuned
	* ❌ Limited reasoning depth compared to large LLMs
	* ❌ Context length limited to 256 tokens
	* ⚠️ May hallucinate or generate inconsistent facts
	* ⚠️ Training data may contain noise from web sources

	Use responsibly.

	---

	## Intended Use

	* Research experiments
	* Educational purposes
	* Model scaling studies
	* Training pipeline benchmarking
	* Custom fine-tuning experiments

	Not recommended for production or safety-critical applications.

	---

	## Reproducibility

	The model was trained using:

	* Custom PyTorch training loop
	* Streaming datasets via HuggingFace
	* Mixed precision training
	* Gradient accumulation
	* Periodic checkpointing
	* Full monitoring (loss, perplexity, gradient norm, attention entropy)

	If you’d like the full training code or configs, feel free to reach out.

	---

	## License

	This model follows the license of the underlying datasets and tokenizer.
	Please ensure compliance before commercial usage.

	---

	## Acknowledgements

	* HuggingFace 🤗
	* PyTorch
	* GPT-2 tokenizer
	* Open research community