tinyvic / readme.md

Update readme.md

b8a1d0e verified about 1 month ago

7.31 kB

	# VicAI

	A 5B parameter decoder-only transformer language model built from scratch in PyTorch.

	## Overview

	VicAI is a state-of-the-art language model featuring:
	- 5.1B parameters with 32 transformer layers
	- Grouped Query Attention (GQA) for efficient inference
	- Rotary Position Embeddings (RoPE) for better long-context modeling
	- SwiGLU activation in feed-forward layers
	- RMSNorm pre-normalization
	- Byte-level BPE tokenization (32K vocabulary)

	## Architecture

	\| Component \| Specification \|
	\|-----------\|---------------\|
	\| Parameters \| ~5.1B \|
	\| Layers \| 32 \|
	\| Hidden Dim \| 4096 \|
	\| FFN Dim \| 14336 \|
	\| Attention Heads \| 32 \|
	\| KV Heads \| 8 (GQA) \|
	\| Context Length \| 8192 \|
	\| Vocabulary \| 32,000 \|

	## File Structure

	```
	vicai/
	├── model.py # Model architecture and VicAI 5B config
	├── tokenizer.py # BPE tokenizer implementation
	├── dataset.py # Data loading (Wikipedia + custom sources)
	├── train.py # Distributed training script
	├── utils.py # Training utilities and helpers
	├── generate.py # Text generation and inference
	├── requirements.txt # Dependencies
	└── README.md # This file
	```

	## Installation

	```bash
	# Clone the repository
	git clone https://github.com/yourusername/vicai.git
	cd vicai

	# Create virtual environment
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate

	# Install dependencies
	pip install -r requirements.txt
	```

	## Quick Start

	### 1. Prepare Training Data

	Option A: Create sample corpus from Wikipedia
	```bash
	python -c "from dataset import create_sample_corpus; create_sample_corpus('data/train.txt', num_articles=10000)"
	```

	Option B: Use your own text files
	```bash
	# Place your text files in data/ directory
	# Format: plain text with <\|endoftext\|> markers between documents
	```

	### 2. Train Tokenizer

	```python
	from tokenizer import ByteLevelBPETokenizer
	from dataset import create_sample_corpus

	# Create corpus
	corpus = create_sample_corpus('data/train.txt', num_articles=1000)

	# Read texts
	with open(corpus, 'r') as f:
	texts = f.read().split('<\|endoftext\|>')

	# Train tokenizer
	tokenizer = ByteLevelBPETokenizer(vocab_size=32000)
	tokenizer.train([t for t in texts if t.strip()])
	tokenizer.save('tokenizer.pkl')
	```

	### 3. Train Model

	Single GPU:
	```bash
	python train.py \
	--train-data data/train.txt \
	--val-data data/val.txt \
	--tokenizer tokenizer.pkl \
	--batch-size 4 \
	--max-steps 100000 \
	--output-dir checkpoints
	```

	Multi-GPU (DDP):
	```bash
	torchrun --nproc_per_node=4 train.py \
	--train-data data/train.txt \
	--val-data data/val.txt \
	--batch-size 1 \
	--max-steps 100000 \
	--output-dir checkpoints
	```

	Multi-GPU (FSDP):
	```bash
	torchrun --nproc_per_node=8 train.py \
	--use-fsdp \
	--train-data data/train.txt \
	--batch-size 1 \
	--output-dir checkpoints
	```

	### 4. Generate Text

	Interactive mode:
	```bash
	python generate.py \
	--checkpoint checkpoints/best_model.pt \
	--tokenizer tokenizer.pkl \
	--interactive
	```

	Single prompt:
	```bash
	python generate.py \
	--checkpoint checkpoints/best_model.pt \
	--tokenizer tokenizer.pkl \
	--prompt "The future of AI is" \
	--max-new-tokens 256
	```

	## Training Configuration

	### Default Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 3e-4 \|
	\| Min LR \| 3e-5 \|
	\| Warmup Steps \| 2,000 \|
	\| Weight Decay \| 0.1 \|
	\| Batch Size \| 4 (per device) \|
	\| Max Steps \| 100,000 \|
	\| Beta1 \| 0.9 \|
	\| Beta2 \| 0.95 \|

	### Training Tips

	- Memory constrained? Reduce batch size or use gradient accumulation
	- Longer context? Increase `--max-seq-len` (up to 8192)
	- Faster training? Enable `--compile` for torch.compile optimization
	- Better quality? Train longer or use larger dataset

	## Generation Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| temperature \| 0.8 \| Lower = more focused, higher = more random \|
	\| top_k \| 50 \| Consider only top-k tokens \|
	\| top_p \| 0.9 \| Nucleus sampling threshold \|
	\| repetition_penalty \| 1.1 \| Penalize repeated tokens \|
	\| max_new_tokens \| 256 \| Maximum tokens to generate \|

	## Data Sources

	The model can be trained on:

	1. Wikipedia (streaming via API)
	2. OpenWebText (Common Crawl filtered)
	3. Custom text files (your own data)
	4. Mixed datasets (combine multiple sources)

	## Hardware Requirements

	### Training

	\| GPUs \| VRAM per GPU \| Config \|
	\|------\|--------------\|--------\|
	\| 1x A100 (80GB) \| 80GB \| batch_size=4, compile=True \|
	\| 4x A100 (40GB) \| 40GB \| batch_size=1, DDP \|
	\| 8x A100 (40GB) \| 40GB \| batch_size=1, FSDP \|
	\| 1x RTX 4090 \| 24GB \| batch_size=1, smaller model \|

	### Inference

	- Minimum: 1x GPU with 16GB VRAM (with quantization)
	- Recommended: 1x GPU with 24GB+ VRAM

	## Model Architecture Details

	### Grouped Query Attention

	Uses 8 key-value heads instead of 32, reducing memory bandwidth during inference while maintaining quality.

	### Rotary Position Embeddings

	Rotary embeddings are applied to queries and keys, providing better relative position encoding than absolute embeddings.

	### SwiGLU Feed-Forward

	```python
	FFN(x) = (silu(W1 @ x) * (W3 @ x)) @ W2
	```

	This has been shown to improve training stability and performance.

	## Example Usage

	```python
	from model import create_vicai_5b
	from tokenizer import ByteLevelBPETokenizer
	import torch

	# Load tokenizer
	tokenizer = ByteLevelBPETokenizer()
	tokenizer.load('tokenizer.pkl')

	# Create model
	model = create_vicai_5b(vocab_size=len(tokenizer))

	# Load checkpoint
	checkpoint = torch.load('checkpoints/best_model.pt')
	model.load_state_dict(checkpoint['model'])
	model = model.cuda()

	# Generate
	text = "Artificial intelligence will"
	input_ids = torch.tensor([tokenizer.encode(text)]).cuda()

	with torch.no_grad():
	output = model.generate(
	input_ids,
	max_new_tokens=100,
	temperature=0.8,
	top_k=50,
	top_p=0.9,
	)

	generated = tokenizer.decode(output[0].tolist())
	print(generated)
	```

	## Citation

	If you use VicAI in your research, please cite:

	```bibtex
	@software{vicai2024,
	title = {VicAI: A 5B Parameter Language Model from Scratch},
	author = {Your Name},
	year = {2024},
	url = {https://github.com/yourusername/vicai}
	}
	```

	## License

	This project is licensed under the MIT License.

	## Acknowledgments

	- Transformer architecture based on "Attention Is All You Need"
	- RoPE embeddings from RoFormer
	- GQA from the Llama 2 paper
	- SwiGLU from the PaLM paper

	## Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	## Troubleshooting

	### CUDA Out of Memory
	- Reduce batch size
	- Enable gradient checkpointing
	- Use FSDP for multi-GPU training
	- Reduce sequence length

	### Slow Training
	- Enable `--compile` flag
	- Use mixed precision (AMP)
	- Ensure data is on fast storage (SSD)
	- Use DataLoader `num_workers > 0`

	### Poor Generation Quality
	- Train longer
	- Use larger, higher quality dataset
	- Adjust sampling parameters (temperature, top_p)
	- Check tokenizer was trained on similar data

	##important
	this model is absolute dodo tere is no data this is a tets modle cz i wanna test something so dont us eit or waste ur time downloading it