| # VicAI |
|
|
| A 5B parameter decoder-only transformer language model built from scratch in PyTorch. |
|
|
| ## Overview |
|
|
| VicAI is a state-of-the-art language model featuring: |
| - **5.1B parameters** with 32 transformer layers |
| - **Grouped Query Attention (GQA)** for efficient inference |
| - **Rotary Position Embeddings (RoPE)** for better long-context modeling |
| - **SwiGLU activation** in feed-forward layers |
| - **RMSNorm** pre-normalization |
| - **Byte-level BPE tokenization** (32K vocabulary) |
|
|
| ## Architecture |
|
|
| | Component | Specification | |
| |-----------|---------------| |
| | Parameters | ~5.1B | |
| | Layers | 32 | |
| | Hidden Dim | 4096 | |
| | FFN Dim | 14336 | |
| | Attention Heads | 32 | |
| | KV Heads | 8 (GQA) | |
| | Context Length | 8192 | |
| | Vocabulary | 32,000 | |
|
|
| ## File Structure |
|
|
| ``` |
| vicai/ |
| βββ model.py # Model architecture and VicAI 5B config |
| βββ tokenizer.py # BPE tokenizer implementation |
| βββ dataset.py # Data loading (Wikipedia + custom sources) |
| βββ train.py # Distributed training script |
| βββ utils.py # Training utilities and helpers |
| βββ generate.py # Text generation and inference |
| βββ requirements.txt # Dependencies |
| βββ README.md # This file |
| ``` |
|
|
| ## Installation |
|
|
| ```bash |
| # Clone the repository |
| git clone https://github.com/yourusername/vicai.git |
| cd vicai |
| |
| # Create virtual environment |
| python -m venv venv |
| source venv/bin/activate # On Windows: venv\Scripts\activate |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Quick Start |
|
|
| ### 1. Prepare Training Data |
|
|
| Option A: Create sample corpus from Wikipedia |
| ```bash |
| python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)" |
| ``` |
|
|
| Option B: Use your own text files |
| ```bash |
| # Place your text files in data/ directory |
| # Format: plain text with <|endoftext|> markers between documents |
| ``` |
|
|
| ### 2. Train Tokenizer |
|
|
| ```python |
| from tokenizer import ByteLevelBPETokenizer |
| from dataset import create_sample_corpus |
| |
| # Create corpus |
| corpus = create_sample_corpus('data/train.txt', num_articles=1000) |
| |
| # Read texts |
| with open(corpus, 'r') as f: |
| texts = f.read().split('<|endoftext|>') |
| |
| # Train tokenizer |
| tokenizer = ByteLevelBPETokenizer(vocab_size=32000) |
| tokenizer.train([t for t in texts if t.strip()]) |
| tokenizer.save('tokenizer.pkl') |
| ``` |
|
|
| ### 3. Train Model |
|
|
| Single GPU: |
| ```bash |
| python train.py \ |
| --train-data data/train.txt \ |
| --val-data data/val.txt \ |
| --tokenizer tokenizer.pkl \ |
| --batch-size 4 \ |
| --max-steps 100000 \ |
| --output-dir checkpoints |
| ``` |
|
|
| Multi-GPU (DDP): |
| ```bash |
| torchrun --nproc_per_node=4 train.py \ |
| --train-data data/train.txt \ |
| --val-data data/val.txt \ |
| --batch-size 1 \ |
| --max-steps 100000 \ |
| --output-dir checkpoints |
| ``` |
|
|
| Multi-GPU (FSDP): |
| ```bash |
| torchrun --nproc_per_node=8 train.py \ |
| --use-fsdp \ |
| --train-data data/train.txt \ |
| --batch-size 1 \ |
| --output-dir checkpoints |
| ``` |
|
|
| ### 4. Generate Text |
|
|
| Interactive mode: |
| ```bash |
| python generate.py \ |
| --checkpoint checkpoints/best_model.pt \ |
| --tokenizer tokenizer.pkl \ |
| --interactive |
| ``` |
|
|
| Single prompt: |
| ```bash |
| python generate.py \ |
| --checkpoint checkpoints/best_model.pt \ |
| --tokenizer tokenizer.pkl \ |
| --prompt "The future of AI is" \ |
| --max-new-tokens 256 |
| ``` |
|
|
| ## Training Configuration |
|
|
| ### Default Hyperparameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Learning Rate | 3e-4 | |
| | Min LR | 3e-5 | |
| | Warmup Steps | 2,000 | |
| | Weight Decay | 0.1 | |
| | Batch Size | 4 (per device) | |
| | Max Steps | 100,000 | |
| | Beta1 | 0.9 | |
| | Beta2 | 0.95 | |
|
|
| ### Training Tips |
|
|
| - **Memory constrained?** Reduce batch size or use gradient accumulation |
| - **Longer context?** Increase `--max-seq-len` (up to 8192) |
| - **Faster training?** Enable `--compile` for torch.compile optimization |
| - **Better quality?** Train longer or use larger dataset |
|
|
| ## Generation Parameters |
|
|
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | temperature | 0.8 | Lower = more focused, higher = more random | |
| | top_k | 50 | Consider only top-k tokens | |
| | top_p | 0.9 | Nucleus sampling threshold | |
| | repetition_penalty | 1.1 | Penalize repeated tokens | |
| | max_new_tokens | 256 | Maximum tokens to generate | |
| |
| ## Data Sources |
| |
| The model can be trained on: |
| |
| 1. **Wikipedia** (streaming via API) |
| 2. **OpenWebText** (Common Crawl filtered) |
| 3. **Custom text files** (your own data) |
| 4. **Mixed datasets** (combine multiple sources) |
| |
| ## Hardware Requirements |
| |
| ### Training |
| |
| | GPUs | VRAM per GPU | Config | |
| |------|--------------|--------| |
| | 1x A100 (80GB) | 80GB | batch_size=4, compile=True | |
| | 4x A100 (40GB) | 40GB | batch_size=1, DDP | |
| | 8x A100 (40GB) | 40GB | batch_size=1, FSDP | |
| | 1x RTX 4090 | 24GB | batch_size=1, smaller model | |
| |
| ### Inference |
| |
| - Minimum: 1x GPU with 16GB VRAM (with quantization) |
| - Recommended: 1x GPU with 24GB+ VRAM |
| |
| ## Model Architecture Details |
| |
| ### Grouped Query Attention |
| |
| Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality. |
| |
| ### Rotary Position Embeddings |
| |
| Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings. |
| |
| ### SwiGLU Feed-Forward |
| |
| ```python |
| FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2 |
| ``` |
| |
| This has been shown to improve training stability and performance. |
| |
| ## Example Usage |
| |
| ```python |
| from model import create_vicai_5b |
| from tokenizer import ByteLevelBPETokenizer |
| import torch |
| |
| # Load tokenizer |
| tokenizer = ByteLevelBPETokenizer() |
| tokenizer.load('tokenizer.pkl') |
| |
| # Create model |
| model = create_vicai_5b(vocab_size=len(tokenizer)) |
|
|
| # Load checkpoint |
| checkpoint = torch.load('checkpoints/best_model.pt') |
| model.load_state_dict(checkpoint['model']) |
| model = model.cuda() |
| |
| # Generate |
| text = "Artificial intelligence will" |
| input_ids = torch.tensor([tokenizer.encode(text)]).cuda() |
|
|
| with torch.no_grad(): |
| output = model.generate( |
| input_ids, |
| max_new_tokens=100, |
| temperature=0.8, |
| top_k=50, |
| top_p=0.9, |
| ) |
| |
| generated = tokenizer.decode(output[0].tolist()) |
| print(generated) |
| ``` |
| |
| ## Citation |
| |
| If you use VicAI in your research, please cite: |
| |
| ```bibtex |
| @software{vicai2024, |
| title = {VicAI: A 5B Parameter Language Model from Scratch}, |
| author = {Your Name}, |
| year = {2024}, |
| url = {https://github.com/yourusername/vicai} |
| } |
| ``` |
| |
| ## License |
| |
| This project is licensed under the MIT License. |
| |
| ## Acknowledgments |
| |
| - Transformer architecture based on "Attention Is All You Need" |
| - RoPE embeddings from RoFormer |
| - GQA from the Llama 2 paper |
| - SwiGLU from the PaLM paper |
| |
| ## Contributing |
| |
| Contributions are welcome! Please feel free to submit a Pull Request. |
| |
| ## Troubleshooting |
| |
| ### CUDA Out of Memory |
| - Reduce batch size |
| - Enable gradient checkpointing |
| - Use FSDP for multi-GPU training |
| - Reduce sequence length |
| |
| ### Slow Training |
| - Enable `--compile` flag |
| - Use mixed precision (AMP) |
| - Ensure data is on fast storage (SSD) |
| - Use DataLoader `num_workers > 0` |
| |
| ### Poor Generation Quality |
| - Train longer |
| - Use larger, higher quality dataset |
| - Adjust sampling parameters (temperature, top_p) |
| - Check tokenizer was trained on similar data |
| |
| ##important |
| this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it |
| |