File size: 4,046 Bytes
1565a27
 
 
 
 
 
64d1f43
1565a27
 
 
64d1f43
1565a27
6d542c5
64d1f43
6d542c5
64d1f43
6d542c5
64d1f43
6d542c5
64d1f43
 
6d542c5
64d1f43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d542c5
1565a27
64d1f43
 
 
1565a27
 
6d542c5
64d1f43
1565a27
6d542c5
64d1f43
 
1565a27
6d542c5
64d1f43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d542c5
64d1f43
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: mit
tags:
- text-generation
- pytorch
- transformer
- rope
language:
- en
pipeline_tag: text-generation
library_name: pytorch
---

# VelocityLM πŸš€

A high-performance, custom transformer language model trained from scratch using modern architectural innovations. VelocityLM combines state-of-the-art techniques including RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE) to deliver efficient and scalable language modeling.

## 🎯 Quick Links

- **πŸš€ Try the Model**: [Interactive Demo Space](https://huggingface.co/spaces/dixisouls/VelocityLM)
- **πŸ’» Source Code**: [GitHub Repository](https://github.com/dixisouls/VelocityLM)

## πŸ—οΈ Model Architecture

VelocityLM features a custom transformer architecture optimized for performance and efficiency:

### Model Specifications
- **Parameters**: ~2B parameters 
- **Architecture**: Decoder-only transformer with causal attention
- **Hidden Size**: 2,048
- **Layers**: 24 transformer layers
- **Attention Heads**: 32 heads per layer
- **Vocabulary**: 50,257 tokens (GPT-2 tokenizer compatible)
- **Context Length**: 2,048 tokens
- **Intermediate Size**: 8,192 (4x hidden size)

### πŸ”¬ Key Innovations

#### RMSNorm (Root Mean Square Normalization)
- Replaces LayerNorm for improved training stability and efficiency
- Better gradient flow compared to traditional normalization

#### SwiGLU Activation Function
- Gated Linear Unit with Swish activation
- Superior performance compared to standard ReLU/GELU for language modeling
- Enhanced expressivity and gradient flow

#### Rotary Position Embeddings (RoPE)
- Relative position encoding with rotational invariance
- Better extrapolation capabilities to longer sequences
- More efficient than learned absolute position embeddings

## 🎯 Training Details

- **Dataset**: [Falcon RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) - high-quality web text
- **Training Steps**: 5,000+ completed
- **Optimization**: AdamW with cosine annealing schedule
- **Hardware**: Trained on 4x NVIDIA A100 (80GB) GPUs
- **Features**: Mixed precision (FP16), gradient checkpointing, distributed training

## πŸš€ Usage

### Basic Text Generation

```python
# Note: This model requires custom loading code
# See the GitHub repository for complete implementation

from transformers import AutoTokenizer
import torch

# Load tokenizer (GPT-2 compatible)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# For complete usage examples and model loading:
# Visit: https://github.com/dixisouls/VelocityLM
```

### Interactive Demo
Try the model immediately in our [Hugging Face Space](https://huggingface.co/spaces/dixisouls/VelocityLM) - no setup required!

## πŸ“Š Performance Features

### Generation Strategies
- Greedy decoding for deterministic output
- Top-k and top-p (nucleus) sampling
- Temperature control for creativity adjustment
- Repetition penalty to reduce repetitive text

### Memory Optimizations  
- Gradient checkpointing (40% memory reduction)
- Efficient causal attention implementation
- Streaming data processing

## πŸ”§ Technical Implementation

This model implements several cutting-edge techniques:

- **Distributed Training**: Multi-GPU support with PyTorch DDP
- **Mixed Precision**: FP16 training with automatic loss scaling  
- **Advanced Scheduling**: Cosine annealing with warm restarts
- **Memory Efficiency**: Gradient checkpointing and parameter grouping


## πŸ› οΈ Installation & Setup

For detailed installation instructions, training scripts, and advanced usage:

**πŸ‘‰ Visit the [GitHub Repository](https://github.com/dixisouls/VelocityLM)**

The repository includes:
- Complete training pipeline
- Inference utilities
- Configuration management
- Multi-GPU training support
- Comprehensive documentation

## πŸ“ˆ Roadmap

Future enhancements planned:
- Flash Attention 2.0 integration
- Extended context length support (4K+)
- Model quantization for efficient deployment
- Fine-tuning capabilities for downstream tasks
- ONNX export for production inference