i3-lab
/

i3-80m

@@ -1,7 +1,171 @@
 ---
 license: apache-2.0
 datasets:
 - agentlans/high-quality-english-sentences
 - roneneldan/TinyStories
 - starhopp3r/TinyChat
----

 ---
+language: en
 license: apache-2.0
+tags:
+- conversational
+- efficient
+- i3-architecture
+- hybrid-model
+- rwkv-mamba
 datasets:
 - agentlans/high-quality-english-sentences
 - roneneldan/TinyStories
 - starhopp3r/TinyChat
+library_name: transformers
+pipeline_tag: text-generation
+---
+# i3-80M - Hybrid Architecture Language Model
+## Model Description
+The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
+## Model Statistics
+- **Total Parameters**: ~80M
+- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
+- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
+- **Hidden Dimension (d_model)**: 512
+- **Attention Heads**: 16
+- **State Dimension (d_state)**: 32
+- **Max Sequence Length**: 256
+- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
+### Architecture Breakdown
+```
+Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
+              ├─ RWKVMambaHybrid (Time-mixing + State-space)
+              └─ Feed-Forward Network (4x expansion)
+Layers 11-16: Full Attention Blocks
+              ├─ Multi-Head Attention (16 heads)
+              └─ Feed-Forward Network (4x expansion)
+```
+### Key Features
+1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
+   - Early layers use RWKV-Mamba hybrid for efficient sequence processing
+   - Later layers use full multi-head attention for complex pattern recognition
+2. **Memory-Optimized Training**:
+   - Streaming vocabulary building (no full text storage)
+   - Vocabulary caching (build once, reuse)
+   - Efficient chunk frequency counting
+   - Automatic memory cleanup
+3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
+   - TinyStories: Narrative and storytelling
+   - TinyChat: Conversational dynamics
+   - High-Quality English Sentences: Linguistic diversity
+4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
+   - Total tokens processed: **3,000,000+**
+   - Handles unknown tokens gracefully with <UNK> token
+## Training Details
+### Training Configuration
+- **Datasets**:
+  - `agentlans/high-quality-english-sentences`
+  - `roneneldan/TinyStories`
+  - `starhopp3r/TinyChat`
+- **Training Steps**: 5,000 iterations
+- **Batch Size**: 4 (with gradient accumulation support)
+- **Learning Rate**: 3e-4 (with warmup and cosine decay)
+- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
+- **Hardware**: NVIDIA GeForce RTX 3060 (12GB VRAM)
+- **Training Time**: ~17 hours
+- **Framework**: PyTorch
+### Training Dynamics
+- **GPU Utilization**: Stable at ~15-20% during training
+- **GPU Memory**: ~18% allocated (~2.2GB / 12GB)
+- **Power Usage**: ~40W average
+- **Throughput**: ~100-550 tokens/sec
+### Performance Metrics
+| Metric | Initial | Final | Best |
+|--------|---------|-------|------|
+| Training Loss | ~6.0 | ~2.0 | 1.98 |
+| Perplexity | ~400+ | ~7-10 | 7.29 |
+The model shows strong convergence with stable training dynamics and efficient GPU utilization.
+## Usage
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-22m")
+tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-22m")
+# Generate text
+prompt = "hello"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(
+    inputs.input_ids,
+    max_length=100,
+    temperature=0.8,
+    top_k=40
+)
+generated_text = tokenizer.decode(outputs[0])
+print(generated_text)
+```
+For custom usage with the original training code, check [user.py](https://huggingface.co/FlameF0X/i3-80m/blob/main/user.py).
+## Technical Innovations
+1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
+   - Linear complexity for long sequences
+   - Efficient recurrent processing
+   - State-space modeling for temporal dependencies
+2. **Hierarchical Processing**:
+   - Lower layers focus on local patterns (conv/recurrent)
+   - Upper layers capture global dependencies (attention)
+3. **Memory Efficiency**:
+   - Streaming tokenization during vocab building
+   - No full dataset storage in RAM
+   - Automatic cleanup of intermediate data
+## Model Files
+- `pytorch_model.bin`: Model weights
+- `config.json`: Model configuration
+- `chunk_vocab_combined.json`: Tokenizer vocabulary
+## Training Tracking
+This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
+- Real-time loss and perplexity tracking
+- Gradient norm monitoring
+- Learning rate scheduling visualization
+- Generation samples logged to tables
+- Model checkpoints as artifacts
+- System resource monitoring
+## Limitations
+- Trained on English text only
+- Limited to 256 token context window
+- May require fine-tuning for specific downstream tasks
+- Conversational style influenced by TinyChat dataset
+## Citation
+```bibtex
+@misc{i3-80m,
+  author = {Daniel Fox},
+  title = {i3-80M: Hybrid Architecture Language Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/YourUsername/i3-80m}}
+}
+```