TimurHromek
/

HROM-V1

@@ -1,95 +1,98 @@
-# HROM - Hybrid Rotary Optimized Model
-*A Conversational AI Architecture with Enhanced Position Awareness*
-[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org)
-[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
-## Overview
-HROM is a transformer-based language model specifically designed for dialogue systems, combining rotary positional embeddings with optimized architectural choices for efficient conversation processing. The model achieves strong performance while maintaining safety and computational efficiency.
-## Key Features
-### Core Innovations
-- **Rotary Position Encoding**
-  Leverages rotational matrices for dynamic position awareness in attention mechanisms
-- **Hybrid Activation**
-  SwiGLU (Sigmoid-Weighted Gated Linear Unit) feed-forward networks
-- **Conversation Structure**
-  Special handling for multi-turn dialogues with user/assistant role tokens
-- **Safety First**
-  Integrated content filtering and generation safeguards
-## Model Architecture
-### Structural Overview
-| Component                | Specification                          |
-|--------------------------|----------------------------------------|
-| Layers                   | 6 transformer blocks                   |
-| Attention Heads          | 8 per layer                            |
-| Hidden Dimension         | 512                                    |
-| Feed-Forward Dimension   | 2048                                   |
-| Context Window           | 1024 tokens                            |
-| Vocabulary Size          | 32,000 BPE tokens                      |
-### Key Technical Components
-1. **Positional Encoding**
-   Rotary embeddings that preserve relative positional information through vector rotations
-2. **Attention Mechanism**
-   Multi-head attention with combined causal/padding masks
-3. **Activation Strategy**
-   SwiGLU non-linearity in feed-forward networks
-4. **Safety Systems**
-   Real-time content filtering and generation constraints
-## Getting Started
-### Installation
-1. Install PyTorch 2.0+
-2. Install supporting packages: `tokenizers` and `datasets`
-3. Clone repository
-### Basic Usage
-1. **Initialization**
-   Load pretrained tokenizer and model weights
-2. **Text Generation**
-   Process user input through the safety system and generate responses
-3. **Training**
-   Configure dataset paths and hyperparameters in training scripts
 ## Training Configuration
-### Optimization Setup
-- **Batch Size**: 32 sequences
-- **Learning Rate**: 3e-4
-- **Epochs**: 50
-- **Regularization**:
-  - 0.1 dropout rate
-  - Gradient clipping at 1.0
-### Dataset Handling
-- Processes multi-turn conversations from DailyDialog
-- Supports up to 6 dialogue turns per sample
-- Dynamic padding and memory-efficient batching
-## Safety Systems
-### Content Protection
-- Blocklist filtering for harmful phrases
-- Generation termination protocol
-- Interactive safety checks during response creation
-## Performance
-- Efficient CUDA utilization via mixed-precision training
-- Checkpoint management system for long-running jobs
-- Memory-optimized attention masking
-## License
-Apache License 2.0 - See LICENSE file for details. Commercial use requires prior authorization.

+# HROM-V1.5: Hybrid Rotary-Optimized Model
+## Architectural Overview
+HROM-V1.5 implements several key innovations in transformer architecture design:
+### Core Components
+1. **Rotary Position Embeddings (RoPE)**
+   - Position-aware attention mechanism without absolute position embeddings
+   - Relative position encoding via rotation matrices
+   - Stable gradient propagation for long sequences
+   - Dynamic sequence length handling (0-512 tokens)
+2. **SwiGLU Activation**
+   - Swish-gated linear unit variant
+   - 2/3 reduction in parameter count versus standard FFN
+   - Improved gradient flow compared to ReLU/GELU
+   - Formula: `SwiGLU(x) = x * gelu(gate)`
+3. **Attention Mechanism**
+   - 8-head attention with 96-dimension heads
+   - Combined causal + padding mask support
+   - Scaled dot-product with 1/√d_k normalization
+   - Attention dropout (p=0.1)
+### Model Specifications
+| Component          | Specification                          |
+|--------------------|----------------------------------------|
+| Layers             | 8                                      |
+| Hidden Dimension   | 768                                    |
+| FFN Dimension      | 2048 (SwiGLU-activated)                |
+| Attention Heads    | 8                                      |
+| Head Dimension     | 96                                     |
+| Vocabulary Size    | 32,000                                 |
+| Max Sequence Length| 512 tokens                             |
+| Dropout Rate       | 0.1                                    |
 ## Training Configuration
+### Dataset Composition
+- **DailyDialog**: 11k conversational samples
+- **EmpatheticDialogues**: 18k emotionally-rich exchanges
+- **BlendedSkillTalk**: 5k multi-skill interactions
+- **Persona-Chat**: 18k personality-driven dialogues
+### Optimization Parameters
+- **Batch Size**: 16 (effective 128 via 8-step gradient accumulation)
+- **Learning Rate**: 2e-5 with linear warmup (1k steps)
+- **Optimizer**: AdamW (β1=0.9, β2=0.95)
+- **Weight Decay**: 0.1
+- **Epochs**: 30
+- **Gradient Clipping**: 1.0
+## Technical Implementation
+### Position Encoding
+```python
+class RotaryEmbedding(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+    def forward(self, seq_len):
+        t = torch.arange(seq_len, device=self.inv_freq.device).type_as(self.inv_freq)
+        freqs = torch.einsum("i, j -> i j", t, self.inv_freq)
+        if seq_len == 0:
+             return torch.empty((0, self.inv_freq.shape[0] * 2), device=self.inv_freq.device)
+        # Defensive reshape only if necessary
+        if freqs.shape[0] != seq_len and seq_len > 0:
+             freqs = freqs.reshape(seq_len, -1)
+        elif seq_len == 0: # Handle edge case for empty sequences
+            return torch.empty((0, self.inv_freq.shape[0]*2), device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+        return torch.cat((freqs, freqs), dim=-1)
+```
+### SwiGLU Implementation
+```python
+class SwiGLU(nn.Module):
+    def forward(self, x):
+        x, gate = x.chunk(2, dim=-1)
+        return x * nn.functional.gelu(gate)
+```
+## License
+Apache License 2.0
+Copyright 2025 Timur Hromek
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.