# VicAI A 5B parameter decoder-only transformer language model built from scratch in PyTorch. ## Overview VicAI is a state-of-the-art language model featuring: - **5.1B parameters** with 32 transformer layers - **Grouped Query Attention (GQA)** for efficient inference - **Rotary Position Embeddings (RoPE)** for better long-context modeling - **SwiGLU activation** in feed-forward layers - **RMSNorm** pre-normalization - **Byte-level BPE tokenization** (32K vocabulary) ## Architecture | Component | Specification | |-----------|---------------| | Parameters | ~5.1B | | Layers | 32 | | Hidden Dim | 4096 | | FFN Dim | 14336 | | Attention Heads | 32 | | KV Heads | 8 (GQA) | | Context Length | 8192 | | Vocabulary | 32,000 | ## File Structure ``` vicai/ ├── model.py # Model architecture and VicAI 5B config ├── tokenizer.py # BPE tokenizer implementation ├── dataset.py # Data loading (Wikipedia + custom sources) ├── train.py # Distributed training script ├── utils.py # Training utilities and helpers ├── generate.py # Text generation and inference ├── requirements.txt # Dependencies └── README.md # This file ``` ## Installation ```bash # Clone the repository git clone https://github.com/yourusername/vicai.git cd vicai # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ## Quick Start ### 1. Prepare Training Data Option A: Create sample corpus from Wikipedia ```bash python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)" ``` Option B: Use your own text files ```bash # Place your text files in data/ directory # Format: plain text with <|endoftext|> markers between documents ``` ### 2. Train Tokenizer ```python from tokenizer import ByteLevelBPETokenizer from dataset import create_sample_corpus # Create corpus corpus = create_sample_corpus('data/train.txt', num_articles=1000) # Read texts with open(corpus, 'r') as f: texts = f.read().split('<|endoftext|>') # Train tokenizer tokenizer = ByteLevelBPETokenizer(vocab_size=32000) tokenizer.train([t for t in texts if t.strip()]) tokenizer.save('tokenizer.pkl') ``` ### 3. Train Model Single GPU: ```bash python train.py \ --train-data data/train.txt \ --val-data data/val.txt \ --tokenizer tokenizer.pkl \ --batch-size 4 \ --max-steps 100000 \ --output-dir checkpoints ``` Multi-GPU (DDP): ```bash torchrun --nproc_per_node=4 train.py \ --train-data data/train.txt \ --val-data data/val.txt \ --batch-size 1 \ --max-steps 100000 \ --output-dir checkpoints ``` Multi-GPU (FSDP): ```bash torchrun --nproc_per_node=8 train.py \ --use-fsdp \ --train-data data/train.txt \ --batch-size 1 \ --output-dir checkpoints ``` ### 4. Generate Text Interactive mode: ```bash python generate.py \ --checkpoint checkpoints/best_model.pt \ --tokenizer tokenizer.pkl \ --interactive ``` Single prompt: ```bash python generate.py \ --checkpoint checkpoints/best_model.pt \ --tokenizer tokenizer.pkl \ --prompt "The future of AI is" \ --max-new-tokens 256 ``` ## Training Configuration ### Default Hyperparameters | Parameter | Value | |-----------|-------| | Learning Rate | 3e-4 | | Min LR | 3e-5 | | Warmup Steps | 2,000 | | Weight Decay | 0.1 | | Batch Size | 4 (per device) | | Max Steps | 100,000 | | Beta1 | 0.9 | | Beta2 | 0.95 | ### Training Tips - **Memory constrained?** Reduce batch size or use gradient accumulation - **Longer context?** Increase `--max-seq-len` (up to 8192) - **Faster training?** Enable `--compile` for torch.compile optimization - **Better quality?** Train longer or use larger dataset ## Generation Parameters | Parameter | Default | Description | |-----------|---------|-------------| | temperature | 0.8 | Lower = more focused, higher = more random | | top_k | 50 | Consider only top-k tokens | | top_p | 0.9 | Nucleus sampling threshold | | repetition_penalty | 1.1 | Penalize repeated tokens | | max_new_tokens | 256 | Maximum tokens to generate | ## Data Sources The model can be trained on: 1. **Wikipedia** (streaming via API) 2. **OpenWebText** (Common Crawl filtered) 3. **Custom text files** (your own data) 4. **Mixed datasets** (combine multiple sources) ## Hardware Requirements ### Training | GPUs | VRAM per GPU | Config | |------|--------------|--------| | 1x A100 (80GB) | 80GB | batch_size=4, compile=True | | 4x A100 (40GB) | 40GB | batch_size=1, DDP | | 8x A100 (40GB) | 40GB | batch_size=1, FSDP | | 1x RTX 4090 | 24GB | batch_size=1, smaller model | ### Inference - Minimum: 1x GPU with 16GB VRAM (with quantization) - Recommended: 1x GPU with 24GB+ VRAM ## Model Architecture Details ### Grouped Query Attention Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality. ### Rotary Position Embeddings Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings. ### SwiGLU Feed-Forward ```python FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2 ``` This has been shown to improve training stability and performance. ## Example Usage ```python from model import create_vicai_5b from tokenizer import ByteLevelBPETokenizer import torch # Load tokenizer tokenizer = ByteLevelBPETokenizer() tokenizer.load('tokenizer.pkl') # Create model model = create_vicai_5b(vocab_size=len(tokenizer)) # Load checkpoint checkpoint = torch.load('checkpoints/best_model.pt') model.load_state_dict(checkpoint['model']) model = model.cuda() # Generate text = "Artificial intelligence will" input_ids = torch.tensor([tokenizer.encode(text)]).cuda() with torch.no_grad(): output = model.generate( input_ids, max_new_tokens=100, temperature=0.8, top_k=50, top_p=0.9, ) generated = tokenizer.decode(output[0].tolist()) print(generated) ``` ## Citation If you use VicAI in your research, please cite: ```bibtex @software{vicai2024, title = {VicAI: A 5B Parameter Language Model from Scratch}, author = {Your Name}, year = {2024}, url = {https://github.com/yourusername/vicai} } ``` ## License This project is licensed under the MIT License. ## Acknowledgments - Transformer architecture based on "Attention Is All You Need" - RoPE embeddings from RoFormer - GQA from the Llama 2 paper - SwiGLU from the PaLM paper ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## Troubleshooting ### CUDA Out of Memory - Reduce batch size - Enable gradient checkpointing - Use FSDP for multi-GPU training - Reduce sequence length ### Slow Training - Enable `--compile` flag - Use mixed precision (AMP) - Ensure data is on fast storage (SSD) - Use DataLoader `num_workers > 0` ### Poor Generation Quality - Train longer - Use larger, higher quality dataset - Adjust sampling parameters (temperature, top_p) - Check tokenizer was trained on similar data ##important this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it