i3-80m

File size: 8,158 Bytes

---
language: en
license: apache-2.0
tags:
- i3-architecture
- hybrid-model
- rwkv-mamba
- custom_code
datasets:
- agentlans/high-quality-english-sentences
- roneneldan/TinyStories
- starhopp3r/TinyChat
library_name: transformers
pipeline_tag: text-generation
---

# i3-80M - Hybrid Architecture Language Model

## Model Description

The **i3-80M Model** is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

This is the second model in the i3 series, scaling up from the original [i3-22M](https://huggingface.co/FlameF0X/i3-22m) with improved architecture and multi-dataset training.

> [!NOTE]
> To use the model try it [here](https://huggingface.co/spaces/FlameF0X/i3-80m).
> 
> [Citește aici în Română :)](https://huggingface.co/FlameF0X/i3-80m/blob/main/CITE%C8%98TEM%C4%82.md)

## Model Statistics

- **Total Parameters**: ~82.77M (82,765,160)
- **Architecture**: 10 Hybrid (RWKV-Mamba) + 6 Full Attention Layers = 16 Total Layers
- **Vocabulary Size**: 35,560 tokens (variable-length chunks with <UNK> token)
- **Hidden Dimension (d_model)**: 512
- **Attention Heads**: 16
- **State Dimension (d_state)**: 32
- **Max Sequence Length**: 256
- **Tokenization**: Memory-efficient variable-length chunking (2-3 characters)

### Architecture Breakdown
```
Layers 1-10:  RWKV-Mamba Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network (4x expansion)

Layers 11-16: Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network (4x expansion)
```

## Comparison with i3-22M

| Feature | i3-22M | i3-80M (This Model) |
|---------|--------|---------------------|
| **Parameters** | 22.6M | 82.77M |
| **Architecture** | 24 Hybrid Layers | 10 Hybrid + 6 Attention Layers |
| **Hidden Dimension** | 512 | 512 |
| **Vocabulary Size** | 4,466 | 35,560 |
| **Training Dataset** | TinyChat only | TinyStories + TinyChat + HQ Sentences |
| **Total Tokens** | ~1M conversations | ~3M+ tokens |
| **Final Loss** | ~2.0 | ~2.0 |
| **Final Perplexity** | 7.29-9.70 | 7.29-10.0 |
| **Training Time** | ~17 hours | ~2-4 hours |
| **Attention Layers** | None (Pure Hybrid) | 6 Full Attention Layers |

### Key Improvements Over i3-22M

1. **Hybrid Architecture**: Introduces full multi-head attention in upper layers for better long-range dependencies
2. **Larger Vocabulary**: 8x larger vocabulary (35,560 vs 4,466) for better token coverage
3. **Multi-Dataset Training**: Trained on 3 diverse datasets vs single dataset
4. **Better Generalization**: Exposure to narratives (TinyStories), conversations (TinyChat), and formal text (HQ Sentences)
5. **Enhanced Unknown Token Handling**: Robust <UNK> token system for out-of-vocabulary words

### When to Use Each Model

**Use i3-22M if you need:**
- Smaller model size (~22M params)
- Pure conversational focus (TinyChat specialized)
- Lower memory footprint
- Faster inference

**Use i3-80M if you need:**
- Better general-purpose text generation
- Stronger attention-based reasoning (6 attention layers)
- Larger vocabulary coverage
- Multi-domain text understanding (stories, chat, formal text)

### Key Features

1. **Hybrid Architecture**: Combines the efficiency of recurrent/convolutional processing with the power of attention
   - Early layers use RWKV-Mamba hybrid for efficient sequence processing
   - Later layers use full multi-head attention for complex pattern recognition

2. **Memory-Optimized Training**: 
   - Streaming vocabulary building (no full text storage)
   - Vocabulary caching (build once, reuse)
   - Efficient chunk frequency counting
   - Automatic memory cleanup

3. **Multi-Dataset Pre-training**: Trained on diverse text sources for robust language understanding
   - TinyStories: Narrative and storytelling
   - TinyChat: Conversational dynamics
   - High-Quality English Sentences: Linguistic diversity

4. **Smart Tokenization**: Variable-length chunking (2-3 chars) with common trigram optimization
   - Total tokens processed: **3,000,000+**
   - Handles unknown tokens gracefully with <UNK> token

## Training Details

### Training Configuration

- **Datasets**: 
  - `agentlans/high-quality-english-sentences`
  - `roneneldan/TinyStories`
  - `starhopp3r/TinyChat`
- **Training Steps**: 5,000 iterations
- **Batch Size**: 4 (with gradient accumulation support)
- **Learning Rate**: 3e-4 (with warmup and cosine decay)
- **Optimizer**: AdamW with gradient clipping (max norm: 1.0)
- **Hardware**: NVIDIA P100 (16GB VRAM)
- **Training Time**: ~2-4 hours
- **Framework**: PyTorch

### Training Dynamics

- **GPU Utilization**: Stable at ~15-20% during training
- **GPU Memory**: \~18% allocated (~2.2GB / 12GB)
- **Power Usage**: ~40W average
- **Throughput**: ~100-550 tokens/sec

### Performance Metrics

| Metric | Initial | Final |
|--------|---------|-------|
| Training Loss | ~10.0 | ~1.7 |
| Perplexity | ~4000+ | ~6 |

![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/ugtJGyEkQfbGieURP2W78.png)
> [!NOTE]
> I dont know why the logging starts at step 4.6k .

**i3-22m** and **i3-80m** comparation?

![image](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/utj6B7AE_gMMI9jnHc37Z.png)

The model shows strong convergence with stable training dynamics and efficient GPU utilization.

## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FlameF0X/i3-80m")
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/i3-80m")

# Generate text
prompt = "hello"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.8,
    top_k=40
)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)
```


## Technical Innovations

1. **RWKV-Mamba Hybrid Recurrence**: Combines RWKV's time-mixing with Mamba's state-space dynamics
   - Linear complexity for long sequences
   - Efficient recurrent processing
   - State-space modeling for temporal dependencies

2. **Hierarchical Processing**: 
   - Lower layers focus on local patterns (conv/recurrent)
   - Upper layers capture global dependencies (attention)

3. **Memory Efficiency**: 
   - Streaming tokenization during vocab building
   - No full dataset storage in RAM
   - Automatic cleanup of intermediate data

## Model Files

- `pytorch_model.bin`: Model weights
- `config.json`: Model configuration
- `chunk_vocab_combined.json`: Tokenizer vocabulary

## Training Tracking

This model was tracked using Weights & Biases (WandB) with comprehensive metrics:
- Real-time loss and perplexity tracking
- Gradient norm monitoring
- Learning rate scheduling visualization
- Generation samples logged to tables
- Model checkpoints as artifacts
- System resource monitoring

## Limitations

- Trained on English text only
- Limited to 256 token context window
- May require fine-tuning for specific downstream tasks
- Conversational style influenced by TinyChat dataset

## Model Series

- [i3-22M](https://huggingface.co/FlameF0X/i3-22m) - Original model with pure hybrid architecture
- **i3-80M** (This model) - Scaled version with attention layers and multi-dataset training

## Citation
```bibtex
@misc{i3-80m,
  author = {FlameF0X},
  title = {i3-80M: Hybrid Architecture Language Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FlameF0X/i3-80m}}
}
```
```bibtex
@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}
@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}

```