VelocityLM / README.md

Removed unwanted tags

3a67b97 4 months ago

4.05 kB

	---
	license: mit
	tags:
	- text-generation
	- pytorch
	- transformer
	- rope
	language:
	- en
	pipeline_tag: text-generation
	library_name: pytorch
	---

	# VelocityLM 🚀

	A high-performance, custom transformer language model trained from scratch using modern architectural innovations. VelocityLM combines state-of-the-art techniques including RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE) to deliver efficient and scalable language modeling.

	## 🎯 Quick Links

	- 🚀 Try the Model: [Interactive Demo Space](https://huggingface.co/spaces/dixisouls/VelocityLM)
	- 💻 Source Code: [GitHub Repository](https://github.com/dixisouls/VelocityLM)

	## 🏗️ Model Architecture

	VelocityLM features a custom transformer architecture optimized for performance and efficiency:

	### Model Specifications
	- Parameters: ~2B parameters
	- Architecture: Decoder-only transformer with causal attention
	- Hidden Size: 2,048
	- Layers: 24 transformer layers
	- Attention Heads: 32 heads per layer
	- Vocabulary: 50,257 tokens (GPT-2 tokenizer compatible)
	- Context Length: 2,048 tokens
	- Intermediate Size: 8,192 (4x hidden size)

	### 🔬 Key Innovations

	#### RMSNorm (Root Mean Square Normalization)
	- Replaces LayerNorm for improved training stability and efficiency
	- Better gradient flow compared to traditional normalization

	#### SwiGLU Activation Function
	- Gated Linear Unit with Swish activation
	- Superior performance compared to standard ReLU/GELU for language modeling
	- Enhanced expressivity and gradient flow

	#### Rotary Position Embeddings (RoPE)
	- Relative position encoding with rotational invariance
	- Better extrapolation capabilities to longer sequences
	- More efficient than learned absolute position embeddings

	## 🎯 Training Details

	- Dataset: [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) - high-quality web text
	- Training Steps: 5,000+ completed
	- Optimization: AdamW with cosine annealing schedule
	- Hardware: Trained on 4x NVIDIA A100 (80GB) GPUs
	- Features: Mixed precision (FP16), gradient checkpointing, distributed training

	## 🚀 Usage

	### Basic Text Generation

	```python
	# Note: This model requires custom loading code
	# See the GitHub repository for complete implementation

	from transformers import AutoTokenizer
	import torch

	# Load tokenizer (GPT-2 compatible)
	tokenizer = AutoTokenizer.from_pretrained("gpt2")

	# For complete usage examples and model loading:
	# Visit: https://github.com/dixisouls/VelocityLM
	```

	### Interactive Demo
	Try the model immediately in our [Hugging Face Space](https://huggingface.co/spaces/dixisouls/VelocityLM) - no setup required!

	## 📊 Performance Features

	### Generation Strategies
	- Greedy decoding for deterministic output
	- Top-k and top-p (nucleus) sampling
	- Temperature control for creativity adjustment
	- Repetition penalty to reduce repetitive text

	### Memory Optimizations
	- Gradient checkpointing (40% memory reduction)
	- Efficient causal attention implementation
	- Streaming data processing

	## 🔧 Technical Implementation

	This model implements several cutting-edge techniques:

	- Distributed Training: Multi-GPU support with PyTorch DDP
	- Mixed Precision: FP16 training with automatic loss scaling
	- Advanced Scheduling: Cosine annealing with warm restarts
	- Memory Efficiency: Gradient checkpointing and parameter grouping


	## 🛠️ Installation & Setup

	For detailed installation instructions, training scripts, and advanced usage:

	👉 Visit the [GitHub Repository](https://github.com/dixisouls/VelocityLM)

	The repository includes:
	- Complete training pipeline
	- Inference utilities
	- Configuration management
	- Multi-GPU training support
	- Comprehensive documentation

	## 📈 Roadmap

	Future enhancements planned:
	- Flash Attention 2.0 integration
	- Extended context length support (4K+)
	- Model quantization for efficient deployment
	- Fine-tuning capabilities for downstream tasks
	- ONNX export for production inference