i3-lab
/

i3-22m

@@ -1,171 +0,0 @@
----
-language: en
-license: mit
-tags:
-- conversational
-- efficient
-- i3-architecture
-- hybrid-model
-- rwkv-mamba
-datasets:
-- agentlans/high-quality-english-sentences
-- roneneldan/TinyStories
-- starhopp3r/TinyChat
-library_name: transformers
-pipeline_tag: text-generation
----
-# i3-80M - Hybrid Architecture Language Model
-## Model Description
-The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.
-## Model Statistics
-- **Total Parameters**: ~80M
-- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
-- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
-- **Hidden Dimension (d_model)**: 512
-- **Attention Heads**: 16
-- **State Dimension (d_state)**: 32
-- **Max Sequence Length**: 256
-- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)
-### Architecture Breakdown
-```
-Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
-              ├─ RWKVMambaHybrid (Time-mixing + State-space)
-              └─ Feed-Forward Network (4x expansion)
-Layers 11-16: Full Attention Blocks
-              ├─ Multi-Head Attention (16 heads)
-              └─ Feed-Forward Network (4x expansion)
-```
-### Key Features
-1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
-   - Early layers use RWKV-Mamba hybrid for efficient sequence processing
-   - Later layers use full multi-head attention for complex pattern recognition
-2. **Memory-Optimized Training**:
-   - Streaming vocabulary building (no full text storage)
-   - Vocabulary caching (build once, reuse)
-   - Efficient chunk frequency counting
-   - Automatic memory cleanup
-3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
-   - TinyStories: Narrative and storytelling
-   - TinyChat: Conversational dynamics
-   - High-Quality English Sentences: Linguistic diversity
-4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
-   - Total tokens processed: **3,000,000+**
-   - Handles unknown tokens gracefully with <UNK> token
-## Training Details
-### Training Configuration
-- **Datasets**:
-  - `agentlans/high-quality-english-sentences`
-  - `roneneldan/TinyStories`
-  - `starhopp3r/TinyChat`
-- **Training Steps**: 5,000 iterations
-- **Batch Size**: 4 (with gradient accumulation support)
-- **Learning Rate**: 3e-4 (with warmup and cosine decay)
-- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
-- **Hardware**: NVIDIA GeForce RTX 3060 (12GB VRAM)
-- **Training Time**: ~17 hours
-- **Framework**: PyTorch
-### Training Dynamics
-- **GPU Utilization**: Stable at ~15-20% during training
-- **GPU Memory**: ~18% allocated (~2.2GB / 12GB)
-- **Power Usage**: ~40W average
-- **Throughput**: ~100-550 tokens/sec
-### Performance Metrics
-| Metric | Initial | Final | Best |
-|--------|---------|-------|------|
-| Training Loss | ~6.0 | ~2.0 | 1.98 |
-| Perplexity | ~400+ | ~7-10 | 7.29 |
-The model shows strong convergence with stable training dynamics and efficient GPU utilization.
-## Usage
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-# Load model and tokenizer
-model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-22m")
-tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-22m")
-# Generate text
-prompt = "hello"
-inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(
-    inputs.input_ids,
-    max_length=100,
-    temperature=0.8,
-    top_k=40
-)
-generated_text = tokenizer.decode(outputs[0])
-print(generated_text)
-```
-For custom usage with the original training code, check [user.py](https://huggingface.co/FlameF0X/i3-80m/blob/main/user.py).
-## Technical Innovations
-1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
-   - Linear complexity for long sequences
-   - Efficient recurrent processing
-   - State-space modeling for temporal dependencies
-2. **Hierarchical Processing**:
-   - Lower layers focus on local patterns (conv/recurrent)
-   - Upper layers capture global dependencies (attention)
-3. **Memory Efficiency**:
-   - Streaming tokenization during vocab building
-   - No full dataset storage in RAM
-   - Automatic cleanup of intermediate data
-## Model Files
-- `pytorch_model.bin`: Model weights
-- `config.json`: Model configuration
-- `chunk_vocab_combined.json`: Tokenizer vocabulary
-## Training Tracking
-This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
-- Real-time loss and perplexity tracking
-- Gradient norm monitoring
-- Learning rate scheduling visualization
-- Generation samples logged to tables
-- Model checkpoints as artifacts
-- System resource monitoring
-## Limitations
-- Trained on English text only
-- Limited to 256 token context window
-- May require fine-tuning for specific downstream tasks
-- Conversational style influenced by TinyChat dataset
-## Citation
-```bibtex
-@misc{i3-80m,
-  author = {Daniel Fox},
-  title = {i3-80M: Hybrid Architecture Language Model},
-  year = {2025},
-  publisher = {HuggingFace},
-  howpublished = {\url{https://huggingface.co/YourUsername/i3-80m}}
-}
-```