tinyvic / readme.md
Viclim's picture
Update readme.md
b8a1d0e verified

VicAI

A 5B parameter decoder-only transformer language model built from scratch in PyTorch.

Overview

VicAI is a state-of-the-art language model featuring:

  • 5.1B parameters with 32 transformer layers
  • Grouped Query Attention (GQA) for efficient inference
  • Rotary Position Embeddings (RoPE) for better long-context modeling
  • SwiGLU activation in feed-forward layers
  • RMSNorm pre-normalization
  • Byte-level BPE tokenization (32K vocabulary)

Architecture

Component Specification
Parameters ~5.1B
Layers 32
Hidden Dim 4096
FFN Dim 14336
Attention Heads 32
KV Heads 8 (GQA)
Context Length 8192
Vocabulary 32,000

File Structure

vicai/
β”œβ”€β”€ model.py           # Model architecture and VicAI 5B config
β”œβ”€β”€ tokenizer.py       # BPE tokenizer implementation
β”œβ”€β”€ dataset.py         # Data loading (Wikipedia + custom sources)
β”œβ”€β”€ train.py           # Distributed training script
β”œβ”€β”€ utils.py           # Training utilities and helpers
β”œβ”€β”€ generate.py        # Text generation and inference
β”œβ”€β”€ requirements.txt   # Dependencies
└── README.md          # This file

Installation

# Clone the repository
git clone https://github.com/yourusername/vicai.git
cd vicai

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Prepare Training Data

Option A: Create sample corpus from Wikipedia

python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)"

Option B: Use your own text files

# Place your text files in data/ directory
# Format: plain text with <|endoftext|> markers between documents

2. Train Tokenizer

from tokenizer import ByteLevelBPETokenizer
from dataset import create_sample_corpus

# Create corpus
corpus = create_sample_corpus('data/train.txt', num_articles=1000)

# Read texts
with open(corpus, 'r') as f:
    texts = f.read().split('<|endoftext|>')

# Train tokenizer
tokenizer = ByteLevelBPETokenizer(vocab_size=32000)
tokenizer.train([t for t in texts if t.strip()])
tokenizer.save('tokenizer.pkl')

3. Train Model

Single GPU:

python train.py \
    --train-data data/train.txt \
    --val-data data/val.txt \
    --tokenizer tokenizer.pkl \
    --batch-size 4 \
    --max-steps 100000 \
    --output-dir checkpoints

Multi-GPU (DDP):

torchrun --nproc_per_node=4 train.py \
    --train-data data/train.txt \
    --val-data data/val.txt \
    --batch-size 1 \
    --max-steps 100000 \
    --output-dir checkpoints

Multi-GPU (FSDP):

torchrun --nproc_per_node=8 train.py \
    --use-fsdp \
    --train-data data/train.txt \
    --batch-size 1 \
    --output-dir checkpoints

4. Generate Text

Interactive mode:

python generate.py \
    --checkpoint checkpoints/best_model.pt \
    --tokenizer tokenizer.pkl \
    --interactive

Single prompt:

python generate.py \
    --checkpoint checkpoints/best_model.pt \
    --tokenizer tokenizer.pkl \
    --prompt "The future of AI is" \
    --max-new-tokens 256

Training Configuration

Default Hyperparameters

Parameter Value
Learning Rate 3e-4
Min LR 3e-5
Warmup Steps 2,000
Weight Decay 0.1
Batch Size 4 (per device)
Max Steps 100,000
Beta1 0.9
Beta2 0.95

Training Tips

  • Memory constrained? Reduce batch size or use gradient accumulation
  • Longer context? Increase --max-seq-len (up to 8192)
  • Faster training? Enable --compile for torch.compile optimization
  • Better quality? Train longer or use larger dataset

Generation Parameters

Parameter Default Description
temperature 0.8 Lower = more focused, higher = more random
top_k 50 Consider only top-k tokens
top_p 0.9 Nucleus sampling threshold
repetition_penalty 1.1 Penalize repeated tokens
max_new_tokens 256 Maximum tokens to generate

Data Sources

The model can be trained on:

  1. Wikipedia (streaming via API)
  2. OpenWebText (Common Crawl filtered)
  3. Custom text files (your own data)
  4. Mixed datasets (combine multiple sources)

Hardware Requirements

Training

GPUs VRAM per GPU Config
1x A100 (80GB) 80GB batch_size=4, compile=True
4x A100 (40GB) 40GB batch_size=1, DDP
8x A100 (40GB) 40GB batch_size=1, FSDP
1x RTX 4090 24GB batch_size=1, smaller model

Inference

  • Minimum: 1x GPU with 16GB VRAM (with quantization)
  • Recommended: 1x GPU with 24GB+ VRAM

Model Architecture Details

Grouped Query Attention

Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality.

Rotary Position Embeddings

Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings.

SwiGLU Feed-Forward

FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2

This has been shown to improve training stability and performance.

Example Usage

from model import create_vicai_5b
from tokenizer import ByteLevelBPETokenizer
import torch

# Load tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.load('tokenizer.pkl')

# Create model
model = create_vicai_5b(vocab_size=len(tokenizer))

# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt')
model.load_state_dict(checkpoint['model'])
model = model.cuda()

# Generate
text = "Artificial intelligence will"
input_ids = torch.tensor([tokenizer.encode(text)]).cuda()

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        top_p=0.9,
    )

generated = tokenizer.decode(output[0].tolist())
print(generated)

Citation

If you use VicAI in your research, please cite:

@software{vicai2024,
  title = {VicAI: A 5B Parameter Language Model from Scratch},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/vicai}
}

License

This project is licensed under the MIT License.

Acknowledgments

  • Transformer architecture based on "Attention Is All You Need"
  • RoPE embeddings from RoFormer
  • GQA from the Llama 2 paper
  • SwiGLU from the PaLM paper

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

CUDA Out of Memory

  • Reduce batch size
  • Enable gradient checkpointing
  • Use FSDP for multi-GPU training
  • Reduce sequence length

Slow Training

  • Enable --compile flag
  • Use mixed precision (AMP)
  • Ensure data is on fast storage (SSD)
  • Use DataLoader num_workers > 0

Poor Generation Quality

  • Train longer
  • Use larger, higher quality dataset
  • Adjust sampling parameters (temperature, top_p)
  • Check tokenizer was trained on similar data

##important this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it