--- license: apache-2.0 datasets: - HuggingFaceFW/fineweb-edu language: - en base_model: - i3-lab/i3-500m pipeline_tag: text-generation --- # i3-4096ctx ## Model Description **i3-4096ctx** is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel **Latent Context Compression** system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window. ### Architecture Overview The model employs a unique two-tier context processing strategy: - **Base Processing**: 512-token kernel window for direct token-level computation - **Extended Context**: 4096-token effective context through latent compression - **Hybrid Layers**: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning - **Model Size**: 1,180 dimensional embeddings, ~340M parameters ### Key Innovation: Latent Context Compression The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window: ``` Compression Ratio: 512:32 (16:1 compression) Max Compressed Chunks: 8 chunks Effective Context: 4,096 tokens Latent Tokens per Chunk: 32 tokens ``` **How it works:** 1. Input text is processed in 512-token chunks 2. Each chunk is compressed into 32 latent tokens using cross-attention 3. Up to 8 compressed chunks (256 latent tokens) are maintained as context 4. New chunks attend to both current tokens and compressed history This approach provides several advantages: - **Memory Efficient**: Stores 4K tokens in just 256 latent representations - **Computationally Efficient**: Avoids quadratic attention over long sequences - **Semantically Rich**: Learned compression preserves relevant information ## Model Specifications | Attribute | Value | |-----------|-------| | Architecture | Hybrid RWKV-Attention with Latent Compression | | Parameters | ~340M | | Embedding Dimension | 1,180 | | RWKV Layers | 12 | | Attention Layers | 2 | | Attention Heads | 8 | | Kernel Window | 512 tokens | | Effective Context | 4,096 tokens (via compression) | | Vocabulary Size | 32,000 (BPE) | | Training Data | FineWeb-Edu (10BT sample) | ## Performance **Final Training Metrics (Iteration 270):** - **Loss**: 0.0933 - **Perplexity**: 1.14 - **Training Speed**: 202 tokens/second - **Compression**: 256 latent tokens active The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression. ## Architecture Details ### Layer Configuration **RWKV Layers (12 layers):** - Linear-time complexity for sequential processing - Time-mixing and channel-mixing mechanisms - JIT-optimized parallel implementation - Efficient for base token processing **Attention Layers (2 layers):** - Full multi-head attention with 8 heads - 4× FFN expansion ratio - Causal masking for autoregressive generation - High-level reasoning and long-range dependencies **Compression Module:** - Learnable latent query vectors (32 per chunk) - Cross-attention based compression - Layer normalization and feedforward refinement - Automatic head count adjustment for dimension compatibility ### Training Configuration - **Sequence Length**: 512 tokens (aligned with kernel window) - **Batch Size**: 4 - **Gradient Accumulation**: 8 steps - **Learning Rate**: 4e-4 (cosine schedule with warmup) - **Compression Warmup**: Enabled after 100 iterations, 50-iteration warmup period - **Optimization**: AdamW with gradient clipping, mixed precision training ## Tokenizer - **Type**: Byte-Pair Encoding (BPE) - **Vocabulary Size**: 32,000 tokens - **Special Tokens**: Includes ``, ``, ``, ``, `<|im_start|>`, `<|im_end|>`, `<|system|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`, `<|eot_id|>`, `[INST]`, `[/INST]` - **Pre-tokenizer**: ByteLevel encoding ## Intended Use This model is designed for: - Research into efficient long-context language modeling - Applications requiring extended context understanding with limited compute - Exploration of hybrid RWKV-attention architectures - Investigation of learned compression techniques for language models ## Limitations - Context beyond 4,096 tokens is not accessible even through compression - Compression is lossy and may not preserve all fine-grained details from distant context - Generation speed depends on maintaining compressed history - Trained primarily on English text (FineWeb-Edu) ## Technical Notes **Memory Management:** - Compressed history is detached from computation graph to prevent backpropagation through time - Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens) - Automatic pruning when history exceeds capacity **Inference Behavior:** - During generation, compressed history accumulates progressively - Each 512-token chunk adds 32 latent tokens to context - Oldest chunks are dropped when exceeding 4,096 token equivalent