Update README.md

f02cfcb verified 8 months ago

4.06 kB

	# MiniGPT — Lightweight Transformer for Text Generation

	MiniGPT is a minimal yet powerful GPT-style language model built from scratch using PyTorch. It is designed for educational clarity, customization, and efficient real-time text generation. This project demonstrates the full training and inference pipeline of a decoder-only transformer architecture, including streaming capabilities and modern sampling strategies.

	> Hosted with ❤️ by [@Austin207](https://huggingface.co/Austin207)

	---

	## Model Description

	MiniGPT is a small, word-level transformer model with the following architecture:

	* 4 Transformer layers
	* 4 Attention heads
	* 128 Embedding dimensions
	* 512 FFN hidden size
	* Max sequence length: 128
	* Word-level tokenizer (trained with Hugging Face `tokenizers`)

	Despite its size, it supports advanced generation strategies including:

	* Repetition Penalty
	* Temperature Sampling
	* Top-K & Top-P (nucleus) sampling
	* Real-time streaming output

	---

	## Usage

	Install dependencies:

	```bash
	pip install torch tokenizers
	```

	Load the model and tokenizer:

	```python
	from miniGPT import MiniGPT
	from inference import generate_stream
	from tokenizers import Tokenizer
	import torch

	# Load tokenizer
	tokenizer = Tokenizer.from_file("wordlevel.json")

	# Load model
	model = MiniGPT(
	vocab_size=tokenizer.get_vocab_size(),
	embed_dim=128,
	num_heads=4,
	ff_dim=512,
	num_layers=4,
	max_seq_len=128
	)

	checkpoint = torch.load("model_checkpoint_step20000.pt")
	model.load_state_dict(checkpoint["model_state_dict"])
	model.eval()

	# Generate text
	prompt = "Beneath the ancient ruins"
	generate_stream(model, tokenizer, prompt, max_new_tokens=60, temperature=1.0, top_k=50, top_p=0.9)
	```

	---

	## Training

	Train from scratch on any plain-text dataset:

	```bash
	python training.py
	```

	Training includes:

	* Checkpointing
	* Sample generation previews
	* Word-level tokenization with `tokenizers`
	* Custom datasets via `alphabetical_dataset.txt` or your own

	---

	## Files in This Repository

	\| File \| Purpose \|
	\| -------------------------- \| ---------------------------- \|
	\| `miniGPT.py` \| Core Transformer model \|
	\| `transformer.py` \| Transformer block logic \|
	\| `multiheadattention.py` \| Multi-head attention module \|
	\| `Tokenizer.py` \| Tokenizer loader \|
	\| `training.py` \| Training loop \|
	\| `inference.py` \| CLI and streaming generation \|
	\| `dataprocess.py` \| Text preprocessing tools \|
	\| `wordlevel.json` \| Trained word-level tokenizer \|
	\| `alphabetical_dataset.txt` \| Sample dataset \|
	\| `requirements.txt` \| Required dependencies \|

	---

	## Model Card

	\| Property \| Value \|
	\| ------------ \| --------------------------------- \|
	\| Model Type \| Decoder-only GPT \|
	\| Size \| Small (\~4.6M params) \|
	\| Trained On \| Word-level dataset (custom) \|
	\| Intended Use \| Text generation, educational demo \|
	\| License \| MIT \|

	---

	## Intended Use and Limitations

	This model is meant for educational, experimental, and research purposes. It is not suitable for commercial or production use out-of-the-box. Expect limitations in coherence, factuality, and long-context reasoning.

	---

	## Contributions

	We welcome improvements, bug fixes, and new features!

	```bash
	# Fork, clone, and create a branch
	git clone https://github.com/austin207/Transformer-Virtue-v2.git
	cd Transformer-Virtue-v2
	git checkout -b feature/your-feature
	```

	Then open a pull request!

	---

	## License

	This project is licensed under the [MIT License](https://github.com/austin207/Transformer-Virtue-v2/blob/main/LICENSE).

	---

	## Explore More

	* Based on GPT architecture from OpenAI
	* Inspired by [karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)
	* Compatible with Hugging Face tools and tokenizer ecosystem