---
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
base_model:
- i3-lab/i3-500m
pipeline_tag: text-generation
---

# i3-4096ctx

## Model Description

**i3-4096ctx** is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel **Latent Context Compression** system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window.

### Architecture Overview

The model employs a unique two-tier context processing strategy:

- **Base Processing**: 512-token kernel window for direct token-level computation
- **Extended Context**: 4096-token effective context through latent compression
- **Hybrid Layers**: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning
- **Model Size**: 1,180 dimensional embeddings, ~340M parameters

### Key Innovation: Latent Context Compression

The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window:

```
Compression Ratio: 512:32 (16:1 compression)
Max Compressed Chunks: 8 chunks
Effective Context: 4,096 tokens
Latent Tokens per Chunk: 32 tokens
```

**How it works:**
1. Input text is processed in 512-token chunks
2. Each chunk is compressed into 32 latent tokens using cross-attention
3. Up to 8 compressed chunks (256 latent tokens) are maintained as context
4. New chunks attend to both current tokens and compressed history

This approach provides several advantages:
- **Memory Efficient**: Stores 4K tokens in just 256 latent representations
- **Computationally Efficient**: Avoids quadratic attention over long sequences
- **Semantically Rich**: Learned compression preserves relevant information

## Model Specifications

| Attribute | Value |
|-----------|-------|
| Architecture | Hybrid RWKV-Attention with Latent Compression |
| Parameters | ~340M |
| Embedding Dimension | 1,180 |
| RWKV Layers | 12 |
| Attention Layers | 2 |
| Attention Heads | 8 |
| Kernel Window | 512 tokens |
| Effective Context | 4,096 tokens (via compression) |
| Vocabulary Size | 32,000 (BPE) |
| Training Data | FineWeb-Edu (10BT sample) |

## Performance

**Final Training Metrics (Iteration 270):**
- **Loss**: 0.0933
- **Perplexity**: 1.14
- **Training Speed**: 202 tokens/second
- **Compression**: 256 latent tokens active

The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression.

## Architecture Details

### Layer Configuration

**RWKV Layers (12 layers):**
- Linear-time complexity for sequential processing
- Time-mixing and channel-mixing mechanisms
- JIT-optimized parallel implementation
- Efficient for base token processing

**Attention Layers (2 layers):**
- Full multi-head attention with 8 heads
- 4× FFN expansion ratio
- Causal masking for autoregressive generation
- High-level reasoning and long-range dependencies

**Compression Module:**
- Learnable latent query vectors (32 per chunk)
- Cross-attention based compression
- Layer normalization and feedforward refinement
- Automatic head count adjustment for dimension compatibility

### Training Configuration

- **Sequence Length**: 512 tokens (aligned with kernel window)
- **Batch Size**: 4
- **Gradient Accumulation**: 8 steps
- **Learning Rate**: 4e-4 (cosine schedule with warmup)
- **Compression Warmup**: Enabled after 100 iterations, 50-iteration warmup period
- **Optimization**: AdamW with gradient clipping, mixed precision training

## Tokenizer

- **Type**: Byte-Pair Encoding (BPE)
- **Vocabulary Size**: 32,000 tokens
- **Special Tokens**: Includes `<UNK>`, `<PAD>`, `<BOS>`, `<EOS>`, `<|im_start|>`, `<|im_end|>`, `<|system|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`, `<|eot_id|>`, `[INST]`, `[/INST]`

- **Pre-tokenizer**: ByteLevel encoding

## Intended Use

This model is designed for:
- Research into efficient long-context language modeling
- Applications requiring extended context understanding with limited compute
- Exploration of hybrid RWKV-attention architectures
- Investigation of learned compression techniques for language models

## Limitations

- Context beyond 4,096 tokens is not accessible even through compression
- Compression is lossy and may not preserve all fine-grained details from distant context
- Generation speed depends on maintaining compressed history
- Trained primarily on English text (FineWeb-Edu)

## Technical Notes

**Memory Management:**
- Compressed history is detached from computation graph to prevent backpropagation through time
- Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens)
- Automatic pruning when history exceeds capacity

**Inference Behavior:**
- During generation, compressed history accumulates progressively
- Each 512-token chunk adds 32 latent tokens to context
- Oldest chunks are dropped when exceeding 4,096 token equivalent