FlameF0X commited on
Commit
6617ea1
·
verified ·
1 Parent(s): e1f00db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -58
README.md CHANGED
@@ -5,83 +5,167 @@ tags:
5
  - conversational
6
  - efficient
7
  - i3-architecture
 
 
8
  datasets:
 
 
9
  - starhopp3r/TinyChat
10
  library_name: transformers
11
  pipeline_tag: text-generation
12
  ---
13
 
14
- # i3 Model - Memory-Optimized Efficient Conversational Language Model
15
 
16
  ## Model Description
17
 
18
- The **i3 Model** is a memory-optimized language model designed for conversational understanding. This version uses streaming tokenization to minimize RAM usage during training.
19
 
20
  ## Model Statistics
21
 
22
- - **Vocabulary Size**: 4,466 (variable-length chunks)
23
- - **Hidden Dimension**: 512
24
- - **Number of Layers**: 24
 
 
 
25
  - **Max Sequence Length**: 256
26
- - **Total Parameters**: 22,640,626
27
  - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
28
 
29
- To use the model check the [user.py](https://huggingface.co/FlameF0X/i3-22m/blob/main/user.py).
 
 
 
 
 
 
 
 
 
30
 
31
  ### Key Features
32
 
33
- 1. **Memory-Optimized**: Streaming tokenization reduces RAM usage significantly
34
- 2. **Proprietary Hybrid Architecture**: Advanced sequence processing with linear complexity
35
- 3. **Variable-Length Tokenization**: Smart chunking strategy for better compression
36
- 4. **Conversational Focus**: Specialized for dialogue and emotional understanding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Training Details
39
 
40
- - **Dataset**: [TinyChat](https://huggingface.co/datasets/starhopp3r/TinyChat)
41
- - **Training Objective**: Next-token prediction with proprietary optimization
 
 
 
 
 
 
 
 
 
 
42
  - **Framework**: PyTorch
43
- - **Memory Optimization**: Streaming dataset processing
44
-
45
- # Technical Report: i3 Pre-training
46
- 1. Executive Summary
47
- The i3 model, a small-scale text generation architecture, successfully completed its initial pre-training phase. This training was conducted on an NVIDIA GeForce RTX 3060 and required approximately 17 hours of continuous processing. The resulting model artifacts are configured for deployment on the HuggingFace platform.
48
- The model is characterized by a compact architecture featuring 24 layers and a hidden dimension of 512, paired with a custom "chunk" tokenization strategy designed for efficiency on conversational data.
49
- 2. Model Configuration and Architecture
50
- The i3Model architecture is designed to be highly efficient, likely incorporating elements of a State Space Model (SSM) due to the low-rank and state-space parameters (rank and d_state).
51
- | Parameter | Value | Description |
52
- |---|---|---|
53
- | Model Type | i3Model | Custom, high-efficiency architecture (likely SSM-enhanced). |
54
- | Hidden Dimension (d_{model}) | 512 | The size of the vector space for internal representations. |
55
- | Number of Layers (n_{layers}) | 24 | The depth of the model's processing blocks. |
56
- | Attention Heads (n_{heads}) | 16 | The number of parallel attention mechanisms (if applicable). |
57
- | State Dimension (d_{state}) | 64 | Indicates the size of the recurrent state, common in SSMs. |
58
- | Rank | 128 | Potentially used for low-rank projection in attention or state mechanisms. |
59
- | Max Sequence Length | 256 | The maximum number of tokens/chunks the model can process at once. |
60
- | Vocabulary Size | 4,466 | The total number of unique chunks/tokens in the vocabulary. |
61
- 3. Training Environment and Duration
62
- The training phase was characterized by high hardware efficiency, achieving a complete pre-training run on consumer-grade hardware in a short timeframe.
63
- * Hardware Used: NVIDIA GeForce RTX 3060 (12GB VRAM assumed).
64
- * Total Training Time: Approximately 17 hours.
65
- * Framework: PyTorch (with HuggingFace Transformers for generation of final files).
66
- 4. Training Data and Procedure
67
- Dataset
68
- The model was pre-trained using the TinyChat dataset, which comprised 1,000,000 conversations. This suggests the model is optimized for rapid, short-form conversational tasks.
69
- Tokenization Strategy
70
- A crucial element of the model's efficiency is its custom tokenization approach:
71
- * Tokenizer Type: chunk
72
- * Strategy: variable_2_3
73
- * Vocabulary: The vocabulary size is notably small (4,466 chunks), indicating that the tokenizer is designed to aggregate common sequences of text into single tokens, significantly reducing the effective sequence length and computational cost during training.
74
- Performance Metrics
75
- Training showed consistent iteration steps, with the log reporting final metrics as the process concluded:
76
- | Metric | Range (Last 500 Iterations) | Observation |
77
- |---|---|---|
78
- | Loss | 1.98 - 2.27 | Training loss remained relatively stable, suggesting convergence towards the end of the run. |
79
- | Perplexity (PPL) | 7.29 - 9.70 | Perplexity is a measure of how well the model predicts the next token. This range is typical for raw pre-training logs and indicates the model has learned basic sequence dependencies. |
80
- | Time per Iteration | \sim 8.2 \text{s} - 12.7 \text{s} | Processing time per iteration shows a sustained and efficient training throughput. |
81
- 5. Deliverables
82
- Upon completion, the necessary files for deployment were generated into the i3_model_hf/ directory, ensuring immediate compatibility with the HuggingFace ecosystem:
83
- * pytorch_model.bin (Model Weights)
84
- * config.json (Model Configuration)
85
- * tokenizer.json (Vocabulary File)
86
- * tokenizer_config.json (Tokenizer Configuration)
87
- The model is now ready for fine-tuning on a specific downstream task or for evaluation of its foundational text generation capabilities.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - conversational
6
  - efficient
7
  - i3-architecture
8
+ - hybrid-model
9
+ - rwkv-mamba
10
  datasets:
11
+ - agentlans/high-quality-english-sentences
12
+ - roneneldan/TinyStories
13
  - starhopp3r/TinyChat
14
  library_name: transformers
15
  pipeline_tag: text-generation
16
  ---
17
 
18
+ # i3-80M - Hybrid Architecture Language Model
19
 
20
  ## Model Description
21
 
22
+ The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
23
 
24
  ## Model Statistics
25
 
26
+ - **Total Parameters**: ~80M
27
+ - **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
28
+ - **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
29
+ - **Hidden Dimension (d_model)**: 512
30
+ - **Attention Heads**: 16
31
+ - **State Dimension (d_state)**: 32
32
  - **Max Sequence Length**: 256
 
33
  - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
34
 
35
+ ### Architecture Breakdown
36
+ ```
37
+ Layers 1-10: RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
38
+ ├─ RWKVMambaHybrid (Time-mixing + State-space)
39
+ └─ Feed-Forward Network (4x expansion)
40
+
41
+ Layers 11-16: Full Attention Blocks
42
+ ├─ Multi-Head Attention (16 heads)
43
+ └─ Feed-Forward Network (4x expansion)
44
+ ```
45
 
46
  ### Key Features
47
 
48
+ 1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
49
+ - Early layers use RWKV-Mamba hybrid for efficient sequence processing
50
+ - Later layers use full multi-head attention for complex pattern recognition
51
+
52
+ 2. **Memory-Optimized Training**:
53
+ - Streaming vocabulary building (no full text storage)
54
+ - Vocabulary caching (build once, reuse)
55
+ - Efficient chunk frequency counting
56
+ - Automatic memory cleanup
57
+
58
+ 3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
59
+ - TinyStories: Narrative and storytelling
60
+ - TinyChat: Conversational dynamics
61
+ - High-Quality English Sentences: Linguistic diversity
62
+
63
+ 4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
64
+ - Total tokens processed: **3,000,000+**
65
+ - Handles unknown tokens gracefully with <UNK> token
66
 
67
  ## Training Details
68
 
69
+ ### Training Configuration
70
+
71
+ - **Datasets**:
72
+ - `agentlans/high-quality-english-sentences`
73
+ - `roneneldan/TinyStories`
74
+ - `starhopp3r/TinyChat`
75
+ - **Training Steps**: 5,000 iterations
76
+ - **Batch Size**: 4 (with gradient accumulation support)
77
+ - **Learning Rate**: 3e-4 (with warmup and cosine decay)
78
+ - **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
79
+ - **Hardware**: NVIDIA GeForce RTX 3060 (12GB VRAM)
80
+ - **Training Time**: ~17 hours
81
  - **Framework**: PyTorch
82
+
83
+ ### Training Dynamics
84
+
85
+ - **GPU Utilization**: Stable at ~15-20% during training
86
+ - **GPU Memory**: ~18% allocated (~2.2GB / 12GB)
87
+ - **Power Usage**: ~40W average
88
+ - **Throughput**: ~100-550 tokens/sec
89
+
90
+ ### Performance Metrics
91
+
92
+ | Metric | Initial | Final | Best |
93
+ |--------|---------|-------|------|
94
+ | Training Loss | ~6.0 | ~2.0 | 1.98 |
95
+ | Perplexity | ~400+ | ~7-10 | 7.29 |
96
+
97
+ The model shows strong convergence with stable training dynamics and efficient GPU utilization.
98
+
99
+ ## Usage
100
+ ```python
101
+ import torch
102
+ from transformers import AutoModelForCausalLM, AutoTokenizer
103
+
104
+ # Load model and tokenizer
105
+ model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-22m")
106
+ tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-22m")
107
+
108
+ # Generate text
109
+ prompt = "hello"
110
+ inputs = tokenizer(prompt, return_tensors="pt")
111
+ outputs = model.generate(
112
+ inputs.input_ids,
113
+ max_length=100,
114
+ temperature=0.8,
115
+ top_k=40
116
+ )
117
+ generated_text = tokenizer.decode(outputs[0])
118
+ print(generated_text)
119
+ ```
120
+
121
+ For custom usage with the original training code, check [user.py](https://huggingface.co/FlameF0X/i3-80m/blob/main/user.py).
122
+
123
+ ## Technical Innovations
124
+
125
+ 1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
126
+ - Linear complexity for long sequences
127
+ - Efficient recurrent processing
128
+ - State-space modeling for temporal dependencies
129
+
130
+ 2. **Hierarchical Processing**:
131
+ - Lower layers focus on local patterns (conv/recurrent)
132
+ - Upper layers capture global dependencies (attention)
133
+
134
+ 3. **Memory Efficiency**:
135
+ - Streaming tokenization during vocab building
136
+ - No full dataset storage in RAM
137
+ - Automatic cleanup of intermediate data
138
+
139
+ ## Model Files
140
+
141
+ - `pytorch_model.bin`: Model weights
142
+ - `config.json`: Model configuration
143
+ - `chunk_vocab_combined.json`: Tokenizer vocabulary
144
+
145
+ ## Training Tracking
146
+
147
+ This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
148
+ - Real-time loss and perplexity tracking
149
+ - Gradient norm monitoring
150
+ - Learning rate scheduling visualization
151
+ - Generation samples logged to tables
152
+ - Model checkpoints as artifacts
153
+ - System resource monitoring
154
+
155
+ ## Limitations
156
+
157
+ - Trained on English text only
158
+ - Limited to 256 token context window
159
+ - May require fine-tuning for specific downstream tasks
160
+ - Conversational style influenced by TinyChat dataset
161
+
162
+ ## Citation
163
+ ```bibtex
164
+ @misc{i3-80m,
165
+ author = {Daniel Fox},
166
+ title = {i3-80M: Hybrid Architecture Language Model},
167
+ year = {2025},
168
+ publisher = {HuggingFace},
169
+ howpublished = {\url{https://huggingface.co/YourUsername/i3-80m}}
170
+ }
171
+ ```