VicAI
A 5B parameter decoder-only transformer language model built from scratch in PyTorch.
Overview
VicAI is a state-of-the-art language model featuring:
- 5.1B parameters with 32 transformer layers
- Grouped Query Attention (GQA) for efficient inference
- Rotary Position Embeddings (RoPE) for better long-context modeling
- SwiGLU activation in feed-forward layers
- RMSNorm pre-normalization
- Byte-level BPE tokenization (32K vocabulary)
Architecture
| Component | Specification |
|---|---|
| Parameters | ~5.1B |
| Layers | 32 |
| Hidden Dim | 4096 |
| FFN Dim | 14336 |
| Attention Heads | 32 |
| KV Heads | 8 (GQA) |
| Context Length | 8192 |
| Vocabulary | 32,000 |
File Structure
vicai/
βββ model.py # Model architecture and VicAI 5B config
βββ tokenizer.py # BPE tokenizer implementation
βββ dataset.py # Data loading (Wikipedia + custom sources)
βββ train.py # Distributed training script
βββ utils.py # Training utilities and helpers
βββ generate.py # Text generation and inference
βββ requirements.txt # Dependencies
βββ README.md # This file
Installation
# Clone the repository
git clone https://github.com/yourusername/vicai.git
cd vicai
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Quick Start
1. Prepare Training Data
Option A: Create sample corpus from Wikipedia
python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)"
Option B: Use your own text files
# Place your text files in data/ directory
# Format: plain text with <|endoftext|> markers between documents
2. Train Tokenizer
from tokenizer import ByteLevelBPETokenizer
from dataset import create_sample_corpus
# Create corpus
corpus = create_sample_corpus('data/train.txt', num_articles=1000)
# Read texts
with open(corpus, 'r') as f:
texts = f.read().split('<|endoftext|>')
# Train tokenizer
tokenizer = ByteLevelBPETokenizer(vocab_size=32000)
tokenizer.train([t for t in texts if t.strip()])
tokenizer.save('tokenizer.pkl')
3. Train Model
Single GPU:
python train.py \
--train-data data/train.txt \
--val-data data/val.txt \
--tokenizer tokenizer.pkl \
--batch-size 4 \
--max-steps 100000 \
--output-dir checkpoints
Multi-GPU (DDP):
torchrun --nproc_per_node=4 train.py \
--train-data data/train.txt \
--val-data data/val.txt \
--batch-size 1 \
--max-steps 100000 \
--output-dir checkpoints
Multi-GPU (FSDP):
torchrun --nproc_per_node=8 train.py \
--use-fsdp \
--train-data data/train.txt \
--batch-size 1 \
--output-dir checkpoints
4. Generate Text
Interactive mode:
python generate.py \
--checkpoint checkpoints/best_model.pt \
--tokenizer tokenizer.pkl \
--interactive
Single prompt:
python generate.py \
--checkpoint checkpoints/best_model.pt \
--tokenizer tokenizer.pkl \
--prompt "The future of AI is" \
--max-new-tokens 256
Training Configuration
Default Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 3e-4 |
| Min LR | 3e-5 |
| Warmup Steps | 2,000 |
| Weight Decay | 0.1 |
| Batch Size | 4 (per device) |
| Max Steps | 100,000 |
| Beta1 | 0.9 |
| Beta2 | 0.95 |
Training Tips
- Memory constrained? Reduce batch size or use gradient accumulation
- Longer context? Increase
--max-seq-len(up to 8192) - Faster training? Enable
--compilefor torch.compile optimization - Better quality? Train longer or use larger dataset
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
| temperature | 0.8 | Lower = more focused, higher = more random |
| top_k | 50 | Consider only top-k tokens |
| top_p | 0.9 | Nucleus sampling threshold |
| repetition_penalty | 1.1 | Penalize repeated tokens |
| max_new_tokens | 256 | Maximum tokens to generate |
Data Sources
The model can be trained on:
- Wikipedia (streaming via API)
- OpenWebText (Common Crawl filtered)
- Custom text files (your own data)
- Mixed datasets (combine multiple sources)
Hardware Requirements
Training
| GPUs | VRAM per GPU | Config |
|---|---|---|
| 1x A100 (80GB) | 80GB | batch_size=4, compile=True |
| 4x A100 (40GB) | 40GB | batch_size=1, DDP |
| 8x A100 (40GB) | 40GB | batch_size=1, FSDP |
| 1x RTX 4090 | 24GB | batch_size=1, smaller model |
Inference
- Minimum: 1x GPU with 16GB VRAM (with quantization)
- Recommended: 1x GPU with 24GB+ VRAM
Model Architecture Details
Grouped Query Attention
Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality.
Rotary Position Embeddings
Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings.
SwiGLU Feed-Forward
FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2
This has been shown to improve training stability and performance.
Example Usage
from model import create_vicai_5b
from tokenizer import ByteLevelBPETokenizer
import torch
# Load tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.load('tokenizer.pkl')
# Create model
model = create_vicai_5b(vocab_size=len(tokenizer))
# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt')
model.load_state_dict(checkpoint['model'])
model = model.cuda()
# Generate
text = "Artificial intelligence will"
input_ids = torch.tensor([tokenizer.encode(text)]).cuda()
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.8,
top_k=50,
top_p=0.9,
)
generated = tokenizer.decode(output[0].tolist())
print(generated)
Citation
If you use VicAI in your research, please cite:
@software{vicai2024,
title = {VicAI: A 5B Parameter Language Model from Scratch},
author = {Your Name},
year = {2024},
url = {https://github.com/yourusername/vicai}
}
License
This project is licensed under the MIT License.
Acknowledgments
- Transformer architecture based on "Attention Is All You Need"
- RoPE embeddings from RoFormer
- GQA from the Llama 2 paper
- SwiGLU from the PaLM paper
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Troubleshooting
CUDA Out of Memory
- Reduce batch size
- Enable gradient checkpointing
- Use FSDP for multi-GPU training
- Reduce sequence length
Slow Training
- Enable
--compileflag - Use mixed precision (AMP)
- Ensure data is on fast storage (SSD)
- Use DataLoader
num_workers > 0
Poor Generation Quality
- Train longer
- Use larger, higher quality dataset
- Adjust sampling parameters (temperature, top_p)
- Check tokenizer was trained on similar data
##important this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it