tinyvic / readme.md
Viclim's picture
Update readme.md
b8a1d0e verified
# VicAI
A 5B parameter decoder-only transformer language model built from scratch in PyTorch.
## Overview
VicAI is a state-of-the-art language model featuring:
- **5.1B parameters** with 32 transformer layers
- **Grouped Query Attention (GQA)** for efficient inference
- **Rotary Position Embeddings (RoPE)** for better long-context modeling
- **SwiGLU activation** in feed-forward layers
- **RMSNorm** pre-normalization
- **Byte-level BPE tokenization** (32K vocabulary)
## Architecture
| Component | Specification |
|-----------|---------------|
| Parameters | ~5.1B |
| Layers | 32 |
| Hidden Dim | 4096 |
| FFN Dim | 14336 |
| Attention Heads | 32 |
| KV Heads | 8 (GQA) |
| Context Length | 8192 |
| Vocabulary | 32,000 |
## File Structure
```
vicai/
β”œβ”€β”€ model.py # Model architecture and VicAI 5B config
β”œβ”€β”€ tokenizer.py # BPE tokenizer implementation
β”œβ”€β”€ dataset.py # Data loading (Wikipedia + custom sources)
β”œβ”€β”€ train.py # Distributed training script
β”œβ”€β”€ utils.py # Training utilities and helpers
β”œβ”€β”€ generate.py # Text generation and inference
β”œβ”€β”€ requirements.txt # Dependencies
└── README.md # This file
```
## Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/vicai.git
cd vicai
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
## Quick Start
### 1. Prepare Training Data
Option A: Create sample corpus from Wikipedia
```bash
python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)"
```
Option B: Use your own text files
```bash
# Place your text files in data/ directory
# Format: plain text with <|endoftext|> markers between documents
```
### 2. Train Tokenizer
```python
from tokenizer import ByteLevelBPETokenizer
from dataset import create_sample_corpus
# Create corpus
corpus = create_sample_corpus('data/train.txt', num_articles=1000)
# Read texts
with open(corpus, 'r') as f:
texts = f.read().split('<|endoftext|>')
# Train tokenizer
tokenizer = ByteLevelBPETokenizer(vocab_size=32000)
tokenizer.train([t for t in texts if t.strip()])
tokenizer.save('tokenizer.pkl')
```
### 3. Train Model
Single GPU:
```bash
python train.py \
--train-data data/train.txt \
--val-data data/val.txt \
--tokenizer tokenizer.pkl \
--batch-size 4 \
--max-steps 100000 \
--output-dir checkpoints
```
Multi-GPU (DDP):
```bash
torchrun --nproc_per_node=4 train.py \
--train-data data/train.txt \
--val-data data/val.txt \
--batch-size 1 \
--max-steps 100000 \
--output-dir checkpoints
```
Multi-GPU (FSDP):
```bash
torchrun --nproc_per_node=8 train.py \
--use-fsdp \
--train-data data/train.txt \
--batch-size 1 \
--output-dir checkpoints
```
### 4. Generate Text
Interactive mode:
```bash
python generate.py \
--checkpoint checkpoints/best_model.pt \
--tokenizer tokenizer.pkl \
--interactive
```
Single prompt:
```bash
python generate.py \
--checkpoint checkpoints/best_model.pt \
--tokenizer tokenizer.pkl \
--prompt "The future of AI is" \
--max-new-tokens 256
```
## Training Configuration
### Default Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 3e-4 |
| Min LR | 3e-5 |
| Warmup Steps | 2,000 |
| Weight Decay | 0.1 |
| Batch Size | 4 (per device) |
| Max Steps | 100,000 |
| Beta1 | 0.9 |
| Beta2 | 0.95 |
### Training Tips
- **Memory constrained?** Reduce batch size or use gradient accumulation
- **Longer context?** Increase `--max-seq-len` (up to 8192)
- **Faster training?** Enable `--compile` for torch.compile optimization
- **Better quality?** Train longer or use larger dataset
## Generation Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| temperature | 0.8 | Lower = more focused, higher = more random |
| top_k | 50 | Consider only top-k tokens |
| top_p | 0.9 | Nucleus sampling threshold |
| repetition_penalty | 1.1 | Penalize repeated tokens |
| max_new_tokens | 256 | Maximum tokens to generate |
## Data Sources
The model can be trained on:
1. **Wikipedia** (streaming via API)
2. **OpenWebText** (Common Crawl filtered)
3. **Custom text files** (your own data)
4. **Mixed datasets** (combine multiple sources)
## Hardware Requirements
### Training
| GPUs | VRAM per GPU | Config |
|------|--------------|--------|
| 1x A100 (80GB) | 80GB | batch_size=4, compile=True |
| 4x A100 (40GB) | 40GB | batch_size=1, DDP |
| 8x A100 (40GB) | 40GB | batch_size=1, FSDP |
| 1x RTX 4090 | 24GB | batch_size=1, smaller model |
### Inference
- Minimum: 1x GPU with 16GB VRAM (with quantization)
- Recommended: 1x GPU with 24GB+ VRAM
## Model Architecture Details
### Grouped Query Attention
Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality.
### Rotary Position Embeddings
Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings.
### SwiGLU Feed-Forward
```python
FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2
```
This has been shown to improve training stability and performance.
## Example Usage
```python
from model import create_vicai_5b
from tokenizer import ByteLevelBPETokenizer
import torch
# Load tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.load('tokenizer.pkl')
# Create model
model = create_vicai_5b(vocab_size=len(tokenizer))
# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt')
model.load_state_dict(checkpoint['model'])
model = model.cuda()
# Generate
text = "Artificial intelligence will"
input_ids = torch.tensor([tokenizer.encode(text)]).cuda()
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.8,
top_k=50,
top_p=0.9,
)
generated = tokenizer.decode(output[0].tolist())
print(generated)
```
## Citation
If you use VicAI in your research, please cite:
```bibtex
@software{vicai2024,
title = {VicAI: A 5B Parameter Language Model from Scratch},
author = {Your Name},
year = {2024},
url = {https://github.com/yourusername/vicai}
}
```
## License
This project is licensed under the MIT License.
## Acknowledgments
- Transformer architecture based on "Attention Is All You Need"
- RoPE embeddings from RoFormer
- GQA from the Llama 2 paper
- SwiGLU from the PaLM paper
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Troubleshooting
### CUDA Out of Memory
- Reduce batch size
- Enable gradient checkpointing
- Use FSDP for multi-GPU training
- Reduce sequence length
### Slow Training
- Enable `--compile` flag
- Use mixed precision (AMP)
- Ensure data is on fast storage (SSD)
- Use DataLoader `num_workers > 0`
### Poor Generation Quality
- Train longer
- Use larger, higher quality dataset
- Adjust sampling parameters (temperature, top_p)
- Check tokenizer was trained on similar data
##important
this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it