Austin207
/

Transformer-MiniGPT

+#  MiniGPT — Lightweight Transformer for Text Generation
+**MiniGPT** is a minimal yet powerful GPT-style language model built from scratch using PyTorch. It is designed for educational clarity, customization, and efficient real-time text generation. This project demonstrates the full training and inference pipeline of a decoder-only transformer architecture, including streaming capabilities and modern sampling strategies.
+>  Hosted with ❤️ by [@Austin207](https://huggingface.co/Austin207)
+---
+##  Model Description
+MiniGPT is a small, word-level transformer model with the following architecture:
+*  4 Transformer layers
+*  4 Attention heads
+*  128 Embedding dimensions
+*  512 FFN hidden size
+*  Max sequence length: 128
+*  Word-level tokenizer (trained with Hugging Face `tokenizers`)
+Despite its size, it supports advanced generation strategies including:
+*  Repetition Penalty
+*  Temperature Sampling
+*  Top-K & Top-P (nucleus) sampling
+*  Real-time streaming output
+---
+##  Usage
+Install dependencies:
+```bash
+pip install torch tokenizers
+```
+Load the model and tokenizer:
+```python
+from miniGPT import MiniGPT
+from inference import generate_stream
+from tokenizers import Tokenizer
+import torch
+# Load tokenizer
+tokenizer = Tokenizer.from_file("wordlevel.json")
+# Load model
+model = MiniGPT(
+    vocab_size=tokenizer.get_vocab_size(),
+    embed_dim=128,
+    num_heads=4,
+    ff_dim=512,
+    num_layers=4,
+    max_seq_len=128
+)
+checkpoint = torch.load("model_checkpoint_step20000.pt")
+model.load_state_dict(checkpoint["model_state_dict"])
+model.eval()
+# Generate text
+prompt = "Beneath the ancient ruins"
+generate_stream(model, tokenizer, prompt, max_new_tokens=60, temperature=1.0, top_k=50, top_p=0.9)
+```
+---
+##  Training
+Train from scratch on any plain-text dataset:
+```bash
+python training.py
+```
+Training includes:
+*  Checkpointing
+*  Sample generation previews
+*  Word-level tokenization with `tokenizers`
+*  Custom datasets via `alphabetical_dataset.txt` or your own
+---
+##  Files in This Repository
+| File                       | Purpose                      |
+| -------------------------- | ---------------------------- |
+| `miniGPT.py`               | Core Transformer model       |
+| `transformer.py`           | Transformer block logic      |
+| `multiheadattention.py`    | Multi-head attention module  |
+| `Tokenizer.py`             | Tokenizer loader             |
+| `training.py`              | Training loop                |
+| `inference.py`             | CLI and streaming generation |
+| `dataprocess.py`           | Text preprocessing tools     |
+| `wordlevel.json`           | Trained word-level tokenizer |
+| `alphabetical_dataset.txt` | Sample dataset               |
+| `requirements.txt`         | Required dependencies        |
+---
+##  Model Card
+| Property     | Value                             |
+| ------------ | --------------------------------- |
+| Model Type   | Decoder-only GPT                  |
+| Size         | Small (\~4.6M params)             |
+| Trained On   | Word-level dataset (custom)       |
+| Intended Use | Text generation, educational demo |
+| License      | MIT                               |
+---
+##  Intended Use and Limitations
+This model is meant for educational, experimental, and research purposes. It is not suitable for commercial or production use out-of-the-box. Expect limitations in coherence, factuality, and long-context reasoning.
+---
+##  Contributions
+We welcome improvements, bug fixes, and new features!
+```bash
+# Fork, clone, and create a branch
+git clone https://github.com/austin207/Transformer-Virtue-v2.git
+cd Transformer-Virtue-v2
+git checkout -b feature/your-feature
+```
+Then open a pull request!
+---
+##  License
+This project is licensed under the [MIT License](https://github.com/austin207/Transformer-Virtue-v2/blob/main/LICENSE).
+---
+##  Explore More
+*  Based on GPT architecture from OpenAI
+*  Inspired by [karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)
+*  Compatible with Hugging Face tools and tokenizer ecosystem