|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- text-generation |
|
|
- pytorch |
|
|
- transformer |
|
|
- rope |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: pytorch |
|
|
--- |
|
|
|
|
|
# VelocityLM π |
|
|
|
|
|
A high-performance, custom transformer language model trained from scratch using modern architectural innovations. VelocityLM combines state-of-the-art techniques including RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE) to deliver efficient and scalable language modeling. |
|
|
|
|
|
## π― Quick Links |
|
|
|
|
|
- **π Try the Model**: [Interactive Demo Space](https://huggingface.co/spaces/dixisouls/VelocityLM) |
|
|
- **π» Source Code**: [GitHub Repository](https://github.com/dixisouls/VelocityLM) |
|
|
|
|
|
## ποΈ Model Architecture |
|
|
|
|
|
VelocityLM features a custom transformer architecture optimized for performance and efficiency: |
|
|
|
|
|
### Model Specifications |
|
|
- **Parameters**: ~2B parameters |
|
|
- **Architecture**: Decoder-only transformer with causal attention |
|
|
- **Hidden Size**: 2,048 |
|
|
- **Layers**: 24 transformer layers |
|
|
- **Attention Heads**: 32 heads per layer |
|
|
- **Vocabulary**: 50,257 tokens (GPT-2 tokenizer compatible) |
|
|
- **Context Length**: 2,048 tokens |
|
|
- **Intermediate Size**: 8,192 (4x hidden size) |
|
|
|
|
|
### π¬ Key Innovations |
|
|
|
|
|
#### RMSNorm (Root Mean Square Normalization) |
|
|
- Replaces LayerNorm for improved training stability and efficiency |
|
|
- Better gradient flow compared to traditional normalization |
|
|
|
|
|
#### SwiGLU Activation Function |
|
|
- Gated Linear Unit with Swish activation |
|
|
- Superior performance compared to standard ReLU/GELU for language modeling |
|
|
- Enhanced expressivity and gradient flow |
|
|
|
|
|
#### Rotary Position Embeddings (RoPE) |
|
|
- Relative position encoding with rotational invariance |
|
|
- Better extrapolation capabilities to longer sequences |
|
|
- More efficient than learned absolute position embeddings |
|
|
|
|
|
## π― Training Details |
|
|
|
|
|
- **Dataset**: [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) - high-quality web text |
|
|
- **Training Steps**: 5,000+ completed |
|
|
- **Optimization**: AdamW with cosine annealing schedule |
|
|
- **Hardware**: Trained on 4x NVIDIA A100 (80GB) GPUs |
|
|
- **Features**: Mixed precision (FP16), gradient checkpointing, distributed training |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### Basic Text Generation |
|
|
|
|
|
```python |
|
|
# Note: This model requires custom loading code |
|
|
# See the GitHub repository for complete implementation |
|
|
|
|
|
from transformers import AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load tokenizer (GPT-2 compatible) |
|
|
tokenizer = AutoTokenizer.from_pretrained("gpt2") |
|
|
|
|
|
# For complete usage examples and model loading: |
|
|
# Visit: https://github.com/dixisouls/VelocityLM |
|
|
``` |
|
|
|
|
|
### Interactive Demo |
|
|
Try the model immediately in our [Hugging Face Space](https://huggingface.co/spaces/dixisouls/VelocityLM) - no setup required! |
|
|
|
|
|
## π Performance Features |
|
|
|
|
|
### Generation Strategies |
|
|
- Greedy decoding for deterministic output |
|
|
- Top-k and top-p (nucleus) sampling |
|
|
- Temperature control for creativity adjustment |
|
|
- Repetition penalty to reduce repetitive text |
|
|
|
|
|
### Memory Optimizations |
|
|
- Gradient checkpointing (40% memory reduction) |
|
|
- Efficient causal attention implementation |
|
|
- Streaming data processing |
|
|
|
|
|
## π§ Technical Implementation |
|
|
|
|
|
This model implements several cutting-edge techniques: |
|
|
|
|
|
- **Distributed Training**: Multi-GPU support with PyTorch DDP |
|
|
- **Mixed Precision**: FP16 training with automatic loss scaling |
|
|
- **Advanced Scheduling**: Cosine annealing with warm restarts |
|
|
- **Memory Efficiency**: Gradient checkpointing and parameter grouping |
|
|
|
|
|
|
|
|
## π οΈ Installation & Setup |
|
|
|
|
|
For detailed installation instructions, training scripts, and advanced usage: |
|
|
|
|
|
**π Visit the [GitHub Repository](https://github.com/dixisouls/VelocityLM)** |
|
|
|
|
|
The repository includes: |
|
|
- Complete training pipeline |
|
|
- Inference utilities |
|
|
- Configuration management |
|
|
- Multi-GPU training support |
|
|
- Comprehensive documentation |
|
|
|
|
|
## π Roadmap |
|
|
|
|
|
Future enhancements planned: |
|
|
- Flash Attention 2.0 integration |
|
|
- Extended context length support (4K+) |
|
|
- Model quantization for efficient deployment |
|
|
- Fine-tuning capabilities for downstream tasks |
|
|
- ONNX export for production inference |