TimurHromek commited on
Commit
fba23f2
·
verified ·
1 Parent(s): 62e5c96

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -79
README.md CHANGED
@@ -1,95 +1,98 @@
1
- # HROM - Hybrid Rotary Optimized Model
2
- *A Conversational AI Architecture with Enhanced Position Awareness*
3
 
4
- [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org)
5
- [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
6
 
7
- ## Overview
8
- HROM is a transformer-based language model specifically designed for dialogue systems, combining rotary positional embeddings with optimized architectural choices for efficient conversation processing. The model achieves strong performance while maintaining safety and computational efficiency.
9
 
10
- ## Key Features
11
 
12
- ### Core Innovations
13
- - **Rotary Position Encoding**
14
- Leverages rotational matrices for dynamic position awareness in attention mechanisms
 
 
15
 
16
- - **Hybrid Activation**
17
- SwiGLU (Sigmoid-Weighted Gated Linear Unit) feed-forward networks
 
 
 
18
 
19
- - **Conversation Structure**
20
- Special handling for multi-turn dialogues with user/assistant role tokens
 
 
 
21
 
22
- - **Safety First**
23
- Integrated content filtering and generation safeguards
24
 
25
- ## Model Architecture
26
-
27
- ### Structural Overview
28
- | Component | Specification |
29
- |--------------------------|----------------------------------------|
30
- | Layers | 6 transformer blocks |
31
- | Attention Heads | 8 per layer |
32
- | Hidden Dimension | 512 |
33
- | Feed-Forward Dimension | 2048 |
34
- | Context Window | 1024 tokens |
35
- | Vocabulary Size | 32,000 BPE tokens |
36
-
37
- ### Key Technical Components
38
- 1. **Positional Encoding**
39
- Rotary embeddings that preserve relative positional information through vector rotations
40
-
41
- 2. **Attention Mechanism**
42
- Multi-head attention with combined causal/padding masks
43
-
44
- 3. **Activation Strategy**
45
- SwiGLU non-linearity in feed-forward networks
46
-
47
- 4. **Safety Systems**
48
- Real-time content filtering and generation constraints
49
-
50
- ## Getting Started
51
-
52
- ### Installation
53
- 1. Install PyTorch 2.0+
54
- 2. Install supporting packages: `tokenizers` and `datasets`
55
- 3. Clone repository
56
-
57
- ### Basic Usage
58
- 1. **Initialization**
59
- Load pretrained tokenizer and model weights
60
-
61
- 2. **Text Generation**
62
- Process user input through the safety system and generate responses
63
-
64
- 3. **Training**
65
- Configure dataset paths and hyperparameters in training scripts
66
 
67
  ## Training Configuration
68
 
69
- ### Optimization Setup
70
- - **Batch Size**: 32 sequences
71
- - **Learning Rate**: 3e-4
72
- - **Epochs**: 50
73
- - **Regularization**:
74
- - 0.1 dropout rate
75
- - Gradient clipping at 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
- ### Dataset Handling
78
- - Processes multi-turn conversations from DailyDialog
79
- - Supports up to 6 dialogue turns per sample
80
- - Dynamic padding and memory-efficient batching
81
 
82
- ## Safety Systems
 
83
 
84
- ### Content Protection
85
- - Blocklist filtering for harmful phrases
86
- - Generation termination protocol
87
- - Interactive safety checks during response creation
88
 
89
- ## Performance
90
- - Efficient CUDA utilization via mixed-precision training
91
- - Checkpoint management system for long-running jobs
92
- - Memory-optimized attention masking
93
 
94
- ## License
95
- Apache License 2.0 - See LICENSE file for details. Commercial use requires prior authorization.
 
1
+ # HROM-V1.5: Hybrid Rotary-Optimized Model
 
2
 
3
+ ## Architectural Overview
 
4
 
5
+ HROM-V1.5 implements several key innovations in transformer architecture design:
 
6
 
7
+ ### Core Components
8
 
9
+ 1. **Rotary Position Embeddings (RoPE)**
10
+ - Position-aware attention mechanism without absolute position embeddings
11
+ - Relative position encoding via rotation matrices
12
+ - Stable gradient propagation for long sequences
13
+ - Dynamic sequence length handling (0-512 tokens)
14
 
15
+ 2. **SwiGLU Activation**
16
+ - Swish-gated linear unit variant
17
+ - 2/3 reduction in parameter count versus standard FFN
18
+ - Improved gradient flow compared to ReLU/GELU
19
+ - Formula: `SwiGLU(x) = x * gelu(gate)`
20
 
21
+ 3. **Attention Mechanism**
22
+ - 8-head attention with 96-dimension heads
23
+ - Combined causal + padding mask support
24
+ - Scaled dot-product with 1/√d_k normalization
25
+ - Attention dropout (p=0.1)
26
 
27
+ ### Model Specifications
 
28
 
29
+ | Component | Specification |
30
+ |--------------------|----------------------------------------|
31
+ | Layers | 8 |
32
+ | Hidden Dimension | 768 |
33
+ | FFN Dimension | 2048 (SwiGLU-activated) |
34
+ | Attention Heads | 8 |
35
+ | Head Dimension | 96 |
36
+ | Vocabulary Size | 32,000 |
37
+ | Max Sequence Length| 512 tokens |
38
+ | Dropout Rate | 0.1 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Training Configuration
41
 
42
+ ### Dataset Composition
43
+ - **DailyDialog**: 11k conversational samples
44
+ - **EmpatheticDialogues**: 18k emotionally-rich exchanges
45
+ - **BlendedSkillTalk**: 5k multi-skill interactions
46
+ - **Persona-Chat**: 18k personality-driven dialogues
47
+
48
+ ### Optimization Parameters
49
+ - **Batch Size**: 16 (effective 128 via 8-step gradient accumulation)
50
+ - **Learning Rate**: 2e-5 with linear warmup (1k steps)
51
+ - **Optimizer**: AdamW (β1=0.9, β2=0.95)
52
+ - **Weight Decay**: 0.1
53
+ - **Epochs**: 30
54
+ - **Gradient Clipping**: 1.0
55
+
56
+ ## Technical Implementation
57
+
58
+ ### Position Encoding
59
+ ```python
60
+ class RotaryEmbedding(nn.Module):
61
+ def __init__(self, dim):
62
+ super().__init__()
63
+ inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
64
+ self.register_buffer("inv_freq", inv_freq)
65
+
66
+ def forward(self, seq_len):
67
+ t = torch.arange(seq_len, device=self.inv_freq.device).type_as(self.inv_freq)
68
+ freqs = torch.einsum("i, j -> i j", t, self.inv_freq)
69
+ if seq_len == 0:
70
+ return torch.empty((0, self.inv_freq.shape[0] * 2), device=self.inv_freq.device)
71
+ # Defensive reshape only if necessary
72
+ if freqs.shape[0] != seq_len and seq_len > 0:
73
+ freqs = freqs.reshape(seq_len, -1)
74
+ elif seq_len == 0: # Handle edge case for empty sequences
75
+ return torch.empty((0, self.inv_freq.shape[0]*2), device=self.inv_freq.device, dtype=self.inv_freq.dtype)
76
+
77
+ return torch.cat((freqs, freqs), dim=-1)
78
+
79
+ ```
80
+
81
+ ### SwiGLU Implementation
82
+ ```python
83
+ class SwiGLU(nn.Module):
84
+ def forward(self, x):
85
+ x, gate = x.chunk(2, dim=-1)
86
+ return x * nn.functional.gelu(gate)
87
+ ```
88
 
89
+ ## License
 
 
 
90
 
91
+ Apache License 2.0
92
+ Copyright 2025 Timur Hromek
93
 
94
+ Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at
 
 
 
95
 
96
+ http://www.apache.org/licenses/LICENSE-2.0
 
 
 
97
 
98
+ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.