tinyvic / readme.md

Viclim

Update readme.md

b8a1d0e verified about 1 month ago

preview code

raw

history blame contribute delete

7.31 kB

VicAI

A 5B parameter decoder-only transformer language model built from scratch in PyTorch.

Overview

VicAI is a state-of-the-art language model featuring:

5.1B parameters with 32 transformer layers
Grouped Query Attention (GQA) for efficient inference
Rotary Position Embeddings (RoPE) for better long-context modeling
SwiGLU activation in feed-forward layers
RMSNorm pre-normalization
Byte-level BPE tokenization (32K vocabulary)

Architecture

Component	Specification
Parameters	~5.1B
Layers	32
Hidden Dim	4096
FFN Dim	14336
Attention Heads	32
KV Heads	8 (GQA)
Context Length	8192
Vocabulary	32,000

File Structure

vicai/
├── model.py           # Model architecture and VicAI 5B config
├── tokenizer.py       # BPE tokenizer implementation
├── dataset.py         # Data loading (Wikipedia + custom sources)
├── train.py           # Distributed training script
├── utils.py           # Training utilities and helpers
├── generate.py        # Text generation and inference
├── requirements.txt   # Dependencies
└── README.md          # This file

Installation

# Clone the repository
git clone https://github.com/yourusername/vicai.git
cd vicai

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Prepare Training Data

Option A: Create sample corpus from Wikipedia

python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)"

Option B: Use your own text files

# Place your text files in data/ directory
# Format: plain text with <|endoftext|> markers between documents

2. Train Tokenizer

from tokenizer import ByteLevelBPETokenizer
from dataset import create_sample_corpus

# Create corpus
corpus = create_sample_corpus('data/train.txt', num_articles=1000)

# Read texts
with open(corpus, 'r') as f:
    texts = f.read().split('<|endoftext|>')

# Train tokenizer
tokenizer = ByteLevelBPETokenizer(vocab_size=32000)
tokenizer.train([t for t in texts if t.strip()])
tokenizer.save('tokenizer.pkl')

3. Train Model

Single GPU:

python train.py \
    --train-data data/train.txt \
    --val-data data/val.txt \
    --tokenizer tokenizer.pkl \
    --batch-size 4 \
    --max-steps 100000 \
    --output-dir checkpoints

Multi-GPU (DDP):

torchrun --nproc_per_node=4 train.py \
    --train-data data/train.txt \
    --val-data data/val.txt \
    --batch-size 1 \
    --max-steps 100000 \
    --output-dir checkpoints

Multi-GPU (FSDP):

torchrun --nproc_per_node=8 train.py \
    --use-fsdp \
    --train-data data/train.txt \
    --batch-size 1 \
    --output-dir checkpoints

4. Generate Text

Interactive mode:

python generate.py \
    --checkpoint checkpoints/best_model.pt \
    --tokenizer tokenizer.pkl \
    --interactive

Single prompt:

python generate.py \
    --checkpoint checkpoints/best_model.pt \
    --tokenizer tokenizer.pkl \
    --prompt "The future of AI is" \
    --max-new-tokens 256

Training Configuration

Default Hyperparameters

Parameter	Value
Learning Rate	3e-4
Min LR	3e-5
Warmup Steps	2,000
Weight Decay	0.1
Batch Size	4 (per device)
Max Steps	100,000
Beta1	0.9
Beta2	0.95

Training Tips

Memory constrained? Reduce batch size or use gradient accumulation
Longer context? Increase --max-seq-len (up to 8192)
Faster training? Enable --compile for torch.compile optimization
Better quality? Train longer or use larger dataset

Generation Parameters

Parameter	Default	Description
temperature	0.8	Lower = more focused, higher = more random
top_k	50	Consider only top-k tokens
top_p	0.9	Nucleus sampling threshold
repetition_penalty	1.1	Penalize repeated tokens
max_new_tokens	256	Maximum tokens to generate

Data Sources

The model can be trained on:

Wikipedia (streaming via API)
OpenWebText (Common Crawl filtered)
Custom text files (your own data)
Mixed datasets (combine multiple sources)

Hardware Requirements

Training

GPUs	VRAM per GPU	Config
1x A100 (80GB)	80GB	batch_size=4, compile=True
4x A100 (40GB)	40GB	batch_size=1, DDP
8x A100 (40GB)	40GB	batch_size=1, FSDP
1x RTX 4090	24GB	batch_size=1, smaller model

Inference

Minimum: 1x GPU with 16GB VRAM (with quantization)
Recommended: 1x GPU with 24GB+ VRAM

Model Architecture Details

Grouped Query Attention

Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality.

Rotary Position Embeddings

Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings.

SwiGLU Feed-Forward

FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2

This has been shown to improve training stability and performance.

Example Usage

from model import create_vicai_5b
from tokenizer import ByteLevelBPETokenizer
import torch

# Load tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.load('tokenizer.pkl')

# Create model
model = create_vicai_5b(vocab_size=len(tokenizer))

# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt')
model.load_state_dict(checkpoint['model'])
model = model.cuda()

# Generate
text = "Artificial intelligence will"
input_ids = torch.tensor([tokenizer.encode(text)]).cuda()

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        top_p=0.9,
    )

generated = tokenizer.decode(output[0].tolist())
print(generated)

Citation

If you use VicAI in your research, please cite:

@software{vicai2024,
  title = {VicAI: A 5B Parameter Language Model from Scratch},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/vicai}
}

License

This project is licensed under the MIT License.

Acknowledgments

Transformer architecture based on "Attention Is All You Need"
RoPE embeddings from RoFormer
GQA from the Llama 2 paper
SwiGLU from the PaLM paper

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Troubleshooting

CUDA Out of Memory

Reduce batch size
Enable gradient checkpointing
Use FSDP for multi-GPU training
Reduce sequence length

Slow Training

Enable --compile flag
Use mixed precision (AMP)
Ensure data is on fast storage (SSD)
Use DataLoader num_workers > 0

Poor Generation Quality

Train longer
Use larger, higher quality dataset
Adjust sampling parameters (temperature, top_p)
Check tokenizer was trained on similar data

##important this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it