| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - HuggingFaceFW/fineweb-edu |
| | language: |
| | - en |
| | base_model: |
| | - i3-lab/i3-500m |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # i3-4096ctx |
| |
|
| | ## Model Description |
| |
|
| | **i3-4096ctx** is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel **Latent Context Compression** system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window. |
| |
|
| | ### Architecture Overview |
| |
|
| | The model employs a unique two-tier context processing strategy: |
| |
|
| | - **Base Processing**: 512-token kernel window for direct token-level computation |
| | - **Extended Context**: 4096-token effective context through latent compression |
| | - **Hybrid Layers**: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning |
| | - **Model Size**: 1,180 dimensional embeddings, ~340M parameters |
| |
|
| | ### Key Innovation: Latent Context Compression |
| |
|
| | The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window: |
| |
|
| | ``` |
| | Compression Ratio: 512:32 (16:1 compression) |
| | Max Compressed Chunks: 8 chunks |
| | Effective Context: 4,096 tokens |
| | Latent Tokens per Chunk: 32 tokens |
| | ``` |
| |
|
| | **How it works:** |
| | 1. Input text is processed in 512-token chunks |
| | 2. Each chunk is compressed into 32 latent tokens using cross-attention |
| | 3. Up to 8 compressed chunks (256 latent tokens) are maintained as context |
| | 4. New chunks attend to both current tokens and compressed history |
| |
|
| | This approach provides several advantages: |
| | - **Memory Efficient**: Stores 4K tokens in just 256 latent representations |
| | - **Computationally Efficient**: Avoids quadratic attention over long sequences |
| | - **Semantically Rich**: Learned compression preserves relevant information |
| |
|
| | ## Model Specifications |
| |
|
| | | Attribute | Value | |
| | |-----------|-------| |
| | | Architecture | Hybrid RWKV-Attention with Latent Compression | |
| | | Parameters | ~340M | |
| | | Embedding Dimension | 1,180 | |
| | | RWKV Layers | 12 | |
| | | Attention Layers | 2 | |
| | | Attention Heads | 8 | |
| | | Kernel Window | 512 tokens | |
| | | Effective Context | 4,096 tokens (via compression) | |
| | | Vocabulary Size | 32,000 (BPE) | |
| | | Training Data | FineWeb-Edu (10BT sample) | |
| |
|
| | ## Performance |
| |
|
| | **Final Training Metrics (Iteration 270):** |
| | - **Loss**: 0.0933 |
| | - **Perplexity**: 1.14 |
| | - **Training Speed**: 202 tokens/second |
| | - **Compression**: 256 latent tokens active |
| |
|
| | The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression. |
| |
|
| | ## Architecture Details |
| |
|
| | ### Layer Configuration |
| |
|
| | **RWKV Layers (12 layers):** |
| | - Linear-time complexity for sequential processing |
| | - Time-mixing and channel-mixing mechanisms |
| | - JIT-optimized parallel implementation |
| | - Efficient for base token processing |
| |
|
| | **Attention Layers (2 layers):** |
| | - Full multi-head attention with 8 heads |
| | - 4× FFN expansion ratio |
| | - Causal masking for autoregressive generation |
| | - High-level reasoning and long-range dependencies |
| |
|
| | **Compression Module:** |
| | - Learnable latent query vectors (32 per chunk) |
| | - Cross-attention based compression |
| | - Layer normalization and feedforward refinement |
| | - Automatic head count adjustment for dimension compatibility |
| |
|
| | ### Training Configuration |
| |
|
| | - **Sequence Length**: 512 tokens (aligned with kernel window) |
| | - **Batch Size**: 4 |
| | - **Gradient Accumulation**: 8 steps |
| | - **Learning Rate**: 4e-4 (cosine schedule with warmup) |
| | - **Compression Warmup**: Enabled after 100 iterations, 50-iteration warmup period |
| | - **Optimization**: AdamW with gradient clipping, mixed precision training |
| |
|
| | ## Tokenizer |
| |
|
| | - **Type**: Byte-Pair Encoding (BPE) |
| | - **Vocabulary Size**: 32,000 tokens |
| | - **Special Tokens**: Includes `<UNK>`, `<PAD>`, `<BOS>`, `<EOS>`, `<|im_start|>`, `<|im_end|>`, `<|system|>`, `<|user|>`, `<|assistant|>`, `<|endoftext|>`, `<|eot_id|>`, `[INST]`, `[/INST]` |
| |
|
| | - **Pre-tokenizer**: ByteLevel encoding |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed for: |
| | - Research into efficient long-context language modeling |
| | - Applications requiring extended context understanding with limited compute |
| | - Exploration of hybrid RWKV-attention architectures |
| | - Investigation of learned compression techniques for language models |
| |
|
| | ## Limitations |
| |
|
| | - Context beyond 4,096 tokens is not accessible even through compression |
| | - Compression is lossy and may not preserve all fine-grained details from distant context |
| | - Generation speed depends on maintaining compressed history |
| | - Trained primarily on English text (FineWeb-Edu) |
| |
|
| | ## Technical Notes |
| |
|
| | **Memory Management:** |
| | - Compressed history is detached from computation graph to prevent backpropagation through time |
| | - Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens) |
| | - Automatic pruning when history exceeds capacity |
| |
|
| | **Inference Behavior:** |
| | - During generation, compressed history accumulates progressively |
| | - Each 512-token chunk adds 32 latent tokens to context |
| | - Oldest chunks are dropped when exceeding 4,096 token equivalent |