i3-lab
/

i3-22m

@@ -5,83 +5,167 @@ tags:
 - conversational
 - efficient
 - i3-architecture
 datasets:
 - starhopp3r/TinyChat
 library_name: transformers
 pipeline_tag: text-generation
 ---
-# i3 Model - Memory-Optimized Efficient Conversational Language Model
 ## Model Description
-The **i3 Model** is a memory-optimized language model designed for conversational understanding. This version uses streaming tokenization to minimize RAM usage during training.
 ## Model Statistics
-- **Vocabulary Size**: 4,466 (variable-length chunks)
-- **Hidden Dimension**: 512
-- **Number of Layers**: 24
 - **Max Sequence Length**: 256
-- **Total Parameters**: 22,640,626
 - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
-To use the model check the [user.py](https://huggingface.co/FlameF0X/i3-22m/blob/main/user.py).
 ### Key Features
-1. **Memory-Optimized**: Streaming tokenization reduces RAM usage significantly
-2. **Proprietary Hybrid Architecture**: Advanced sequence processing with linear complexity
-3. **Variable-Length Tokenization**: Smart chunking strategy for better compression
-4. **Conversational Focus**: Specialized for dialogue and emotional understanding
 ## Training Details
-- **Dataset**: [TinyChat](https://huggingface.co/datasets/starhopp3r/TinyChat)
-- **Training Objective**: Next-token prediction with proprietary optimization
 - **Framework**: PyTorch
-- **Memory Optimization**: Streaming dataset processing
-# Technical Report: i3 Pre-training
-1. Executive Summary
-The i3 model, a small-scale text generation architecture, successfully completed its initial pre-training phase. This training was conducted on an NVIDIA GeForce RTX 3060 and required approximately 17 hours of continuous processing. The resulting model artifacts are configured for deployment on the HuggingFace platform.
-The model is characterized by a compact architecture featuring 24 layers and a hidden dimension of 512, paired with a custom "chunk" tokenization strategy designed for efficiency on conversational data.
-2. Model Configuration and Architecture
-The i3Model architecture is designed to be highly efficient, likely incorporating elements of a State Space Model (SSM) due to the low-rank and state-space parameters (rank and d_state).
-| Parameter | Value | Description |
-|---|---|---|
-| Model Type | i3Model | Custom, high-efficiency architecture (likely SSM-enhanced). |
-| Hidden Dimension (d_{model}) | 512 | The size of the vector space for internal representations. |
-| Number of Layers (n_{layers}) | 24 | The depth of the model's processing blocks. |
-| Attention Heads (n_{heads}) | 16 | The number of parallel attention mechanisms (if applicable). |
-| State Dimension (d_{state}) | 64 | Indicates the size of the recurrent state, common in SSMs. |
-| Rank | 128 | Potentially used for low-rank projection in attention or state mechanisms. |
-| Max Sequence Length | 256 | The maximum number of tokens/chunks the model can process at once. |
-| Vocabulary Size | 4,466 | The total number of unique chunks/tokens in the vocabulary. |
-3. Training Environment and Duration
-The training phase was characterized by high hardware efficiency, achieving a complete pre-training run on consumer-grade hardware in a short timeframe.
- * Hardware Used: NVIDIA GeForce RTX 3060 (12GB VRAM assumed).
- * Total Training Time: Approximately 17 hours.
- * Framework: PyTorch (with HuggingFace Transformers for generation of final files).
-4. Training Data and Procedure
-Dataset
-The model was pre-trained using the TinyChat dataset, which comprised 1,000,000 conversations. This suggests the model is optimized for rapid, short-form conversational tasks.
-Tokenization Strategy
-A crucial element of the model's efficiency is its custom tokenization approach:
- * Tokenizer Type: chunk
- * Strategy: variable_2_3
- * Vocabulary: The vocabulary size is notably small (4,466 chunks), indicating that the tokenizer is designed to aggregate common sequences of text into single tokens, significantly reducing the effective sequence length and computational cost during training.
-Performance Metrics
-Training showed consistent iteration steps, with the log reporting final metrics as the process concluded:
-| Metric | Range (Last 500 Iterations) | Observation |
-|---|---|---|
-| Loss | 1.98 - 2.27 | Training loss remained relatively stable, suggesting convergence towards the end of the run. |
-| Perplexity (PPL) | 7.29 - 9.70 | Perplexity is a measure of how well the model predicts the next token. This range is typical for raw pre-training logs and indicates the model has learned basic sequence dependencies. |
-| Time per Iteration | \sim 8.2 \text{s} - 12.7 \text{s} | Processing time per iteration shows a sustained and efficient training throughput. |
-5. Deliverables
-Upon completion, the necessary files for deployment were generated into the i3_model_hf/ directory, ensuring immediate compatibility with the HuggingFace ecosystem:
- * pytorch_model.bin (Model Weights)
- * config.json (Model Configuration)
- * tokenizer.json (Vocabulary File)
- * tokenizer_config.json (Tokenizer Configuration)
-The model is now ready for fine-tuning on a specific downstream task or for evaluation of its foundational text generation capabilities.

 - conversational
 - efficient
 - i3-architecture
+- hybrid-model
+- rwkv-mamba
 datasets:
+- agentlans/high-quality-english-sentences
+- roneneldan/TinyStories
 - starhopp3r/TinyChat
 library_name: transformers
 pipeline_tag: text-generation
 ---
+# i3-80M - Hybrid Architecture Language Model
 ## Model Description
+The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
 ## Model Statistics
+- **Total Parameters**: ~80M
+- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
+- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
+- **Hidden Dimension (d_model)**: 512
+- **Attention Heads**: 16
+- **State Dimension (d_state)**: 32
 - **Max Sequence Length**: 256
 - **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
+### Architecture Breakdown
+```
+Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
+              ├─ RWKVMambaHybrid (Time-mixing + State-space)
+              └─ Feed-Forward Network (4x expansion)
+Layers 11-16: Full Attention Blocks
+              ├─ Multi-Head Attention (16 heads)
+              └─ Feed-Forward Network (4x expansion)
+```
 ### Key Features
+1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
+   - Early layers use RWKV-Mamba hybrid for efficient sequence processing
+   - Later layers use full multi-head attention for complex pattern recognition
+2. **Memory-Optimized Training**:
+   - Streaming vocabulary building (no full text storage)
+   - Vocabulary caching (build once, reuse)
+   - Efficient chunk frequency counting
+   - Automatic memory cleanup
+3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
+   - TinyStories: Narrative and storytelling
+   - TinyChat: Conversational dynamics
+   - High-Quality English Sentences: Linguistic diversity
+4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
+   - Total tokens processed: **3,000,000+**
+   - Handles unknown tokens gracefully with <UNK> token
 ## Training Details
+### Training Configuration
+- **Datasets**:
+  - `agentlans/high-quality-english-sentences`
+  - `roneneldan/TinyStories`
+  - `starhopp3r/TinyChat`
+- **Training Steps**: 5,000 iterations
+- **Batch Size**: 4 (with gradient accumulation support)
+- **Learning Rate**: 3e-4 (with warmup and cosine decay)
+- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
+- **Hardware**: NVIDIA GeForce RTX 3060 (12GB VRAM)
+- **Training Time**: ~17 hours
 - **Framework**: PyTorch
+### Training Dynamics
+- **GPU Utilization**: Stable at ~15-20% during training
+- **GPU Memory**: ~18% allocated (~2.2GB / 12GB)
+- **Power Usage**: ~40W average
+- **Throughput**: ~100-550 tokens/sec
+### Performance Metrics
+| Metric | Initial | Final | Best |
+|--------|---------|-------|------|
+| Training Loss | ~6.0 | ~2.0 | 1.98 |
+| Perplexity | ~400+ | ~7-10 | 7.29 |
+The model shows strong convergence with stable training dynamics and efficient GPU utilization.
+## Usage
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-22m")
+tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-22m")
+# Generate text
+prompt = "hello"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(
+    inputs.input_ids,
+    max_length=100,
+    temperature=0.8,
+    top_k=40
+)
+generated_text = tokenizer.decode(outputs[0])
+print(generated_text)
+```
+For custom usage with the original training code, check [user.py](https://huggingface.co/FlameF0X/i3-80m/blob/main/user.py).
+## Technical Innovations
+1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
+   - Linear complexity for long sequences
+   - Efficient recurrent processing
+   - State-space modeling for temporal dependencies
+2. **Hierarchical Processing**:
+   - Lower layers focus on local patterns (conv/recurrent)
+   - Upper layers capture global dependencies (attention)
+3. **Memory Efficiency**:
+   - Streaming tokenization during vocab building
+   - No full dataset storage in RAM
+   - Automatic cleanup of intermediate data
+## Model Files
+- `pytorch_model.bin`: Model weights
+- `config.json`: Model configuration
+- `chunk_vocab_combined.json`: Tokenizer vocabulary
+## Training Tracking
+This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
+- Real-time loss and perplexity tracking
+- Gradient norm monitoring
+- Learning rate scheduling visualization
+- Generation samples logged to tables
+- Model checkpoints as artifacts
+- System resource monitoring
+## Limitations
+- Trained on English text only
+- Limited to 256 token context window
+- May require fine-tuning for specific downstream tasks
+- Conversational style influenced by TinyChat dataset
+## Citation
+```bibtex
+@misc{i3-80m,
+  author = {Daniel Fox},
+  title = {i3-80M: Hybrid Architecture Language Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/YourUsername/i3-80m}}
+}
+```