File size: 7,309 Bytes

# VicAI

A 5B parameter decoder-only transformer language model built from scratch in PyTorch.

## Overview

VicAI is a state-of-the-art language model featuring:
- **5.1B parameters** with 32 transformer layers
- **Grouped Query Attention (GQA)** for efficient inference
- **Rotary Position Embeddings (RoPE)** for better long-context modeling
- **SwiGLU activation** in feed-forward layers
- **RMSNorm** pre-normalization
- **Byte-level BPE tokenization** (32K vocabulary)

## Architecture

| Component | Specification |
|-----------|---------------|
| Parameters | ~5.1B |
| Layers | 32 |
| Hidden Dim | 4096 |
| FFN Dim | 14336 |
| Attention Heads | 32 |
| KV Heads | 8 (GQA) |
| Context Length | 8192 |
| Vocabulary | 32,000 |

## File Structure

```
vicai/
├── model.py           # Model architecture and VicAI 5B config
├── tokenizer.py       # BPE tokenizer implementation
├── dataset.py         # Data loading (Wikipedia + custom sources)
├── train.py           # Distributed training script
├── utils.py           # Training utilities and helpers
├── generate.py        # Text generation and inference
├── requirements.txt   # Dependencies
└── README.md          # This file
```

## Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/vicai.git
cd vicai

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

## Quick Start

### 1. Prepare Training Data

Option A: Create sample corpus from Wikipedia
```bash
python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)"
```

Option B: Use your own text files
```bash
# Place your text files in data/ directory
# Format: plain text with <|endoftext|> markers between documents
```

### 2. Train Tokenizer

```python
from tokenizer import ByteLevelBPETokenizer
from dataset import create_sample_corpus

# Create corpus
corpus = create_sample_corpus('data/train.txt', num_articles=1000)

# Read texts
with open(corpus, 'r') as f:
    texts = f.read().split('<|endoftext|>')

# Train tokenizer
tokenizer = ByteLevelBPETokenizer(vocab_size=32000)
tokenizer.train([t for t in texts if t.strip()])
tokenizer.save('tokenizer.pkl')
```

### 3. Train Model

Single GPU:
```bash
python train.py \
    --train-data data/train.txt \
    --val-data data/val.txt \
    --tokenizer tokenizer.pkl \
    --batch-size 4 \
    --max-steps 100000 \
    --output-dir checkpoints
```

Multi-GPU (DDP):
```bash
torchrun --nproc_per_node=4 train.py \
    --train-data data/train.txt \
    --val-data data/val.txt \
    --batch-size 1 \
    --max-steps 100000 \
    --output-dir checkpoints
```

Multi-GPU (FSDP):
```bash
torchrun --nproc_per_node=8 train.py \
    --use-fsdp \
    --train-data data/train.txt \
    --batch-size 1 \
    --output-dir checkpoints
```

### 4. Generate Text

Interactive mode:
```bash
python generate.py \
    --checkpoint checkpoints/best_model.pt \
    --tokenizer tokenizer.pkl \
    --interactive
```

Single prompt:
```bash
python generate.py \
    --checkpoint checkpoints/best_model.pt \
    --tokenizer tokenizer.pkl \
    --prompt "The future of AI is" \
    --max-new-tokens 256
```

## Training Configuration

### Default Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning Rate | 3e-4 |
| Min LR | 3e-5 |
| Warmup Steps | 2,000 |
| Weight Decay | 0.1 |
| Batch Size | 4 (per device) |
| Max Steps | 100,000 |
| Beta1 | 0.9 |
| Beta2 | 0.95 |

### Training Tips

- **Memory constrained?** Reduce batch size or use gradient accumulation
- **Longer context?** Increase `--max-seq-len` (up to 8192)
- **Faster training?** Enable `--compile` for torch.compile optimization
- **Better quality?** Train longer or use larger dataset

## Generation Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| temperature | 0.8 | Lower = more focused, higher = more random |
| top_k | 50 | Consider only top-k tokens |
| top_p | 0.9 | Nucleus sampling threshold |
| repetition_penalty | 1.1 | Penalize repeated tokens |
| max_new_tokens | 256 | Maximum tokens to generate |

## Data Sources

The model can be trained on:

1. **Wikipedia** (streaming via API)
2. **OpenWebText** (Common Crawl filtered)
3. **Custom text files** (your own data)
4. **Mixed datasets** (combine multiple sources)

## Hardware Requirements

### Training

| GPUs | VRAM per GPU | Config |
|------|--------------|--------|
| 1x A100 (80GB) | 80GB | batch_size=4, compile=True |
| 4x A100 (40GB) | 40GB | batch_size=1, DDP |
| 8x A100 (40GB) | 40GB | batch_size=1, FSDP |
| 1x RTX 4090 | 24GB | batch_size=1, smaller model |

### Inference

- Minimum: 1x GPU with 16GB VRAM (with quantization)
- Recommended: 1x GPU with 24GB+ VRAM

## Model Architecture Details

### Grouped Query Attention

Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality.

### Rotary Position Embeddings

Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings.

### SwiGLU Feed-Forward

```python
FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2
```

This has been shown to improve training stability and performance.

## Example Usage

```python
from model import create_vicai_5b
from tokenizer import ByteLevelBPETokenizer
import torch

# Load tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.load('tokenizer.pkl')

# Create model
model = create_vicai_5b(vocab_size=len(tokenizer))

# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt')
model.load_state_dict(checkpoint['model'])
model = model.cuda()

# Generate
text = "Artificial intelligence will"
input_ids = torch.tensor([tokenizer.encode(text)]).cuda()

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        top_p=0.9,
    )

generated = tokenizer.decode(output[0].tolist())
print(generated)
```

## Citation

If you use VicAI in your research, please cite:

```bibtex
@software{vicai2024,
  title = {VicAI: A 5B Parameter Language Model from Scratch},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/vicai}
}
```

## License

This project is licensed under the MIT License.

## Acknowledgments

- Transformer architecture based on "Attention Is All You Need"
- RoPE embeddings from RoFormer
- GQA from the Llama 2 paper
- SwiGLU from the PaLM paper

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Troubleshooting

### CUDA Out of Memory
- Reduce batch size
- Enable gradient checkpointing
- Use FSDP for multi-GPU training
- Reduce sequence length

### Slow Training
- Enable `--compile` flag
- Use mixed precision (AMP)
- Ensure data is on fast storage (SSD)
- Use DataLoader `num_workers > 0`

### Poor Generation Quality
- Train longer
- Use larger, higher quality dataset
- Adjust sampling parameters (temperature, top_p)
- Check tokenizer was trained on similar data

##important 
this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it