Update README.md

aeea835 verified about 2 months ago

5.01 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceFW/fineweb-edu
	language:
	- en
	base_model:
	- i3-lab/i3-500m
	pipeline_tag: text-generation
	---

	# i3-4096ctx

	## Model Description

	i3-4096ctx is a hybrid language model that combines RWKV (Receptance Weighted Key Value) layers with standard attention mechanisms, enhanced by a novel Latent Context Compression system. This architecture enables the model to efficiently process extended contexts far beyond its base kernel window.

	### Architecture Overview

	The model employs a unique two-tier context processing strategy:

	- Base Processing: 512-token kernel window for direct token-level computation
	- Extended Context: 4096-token effective context through latent compression
	- Hybrid Layers: 12 RWKV layers for efficient sequential processing + 2 attention layers for high-level reasoning
	- Model Size: 1,180 dimensional embeddings, ~340M parameters

	### Key Innovation: Latent Context Compression

	The model's distinguishing feature is its compression mechanism that allows it to "remember" contexts 8× larger than its kernel window:

	```
	Compression Ratio: 512:32 (16:1 compression)
	Max Compressed Chunks: 8 chunks
	Effective Context: 4,096 tokens
	Latent Tokens per Chunk: 32 tokens
	```

	How it works:
	1. Input text is processed in 512-token chunks
	2. Each chunk is compressed into 32 latent tokens using cross-attention
	3. Up to 8 compressed chunks (256 latent tokens) are maintained as context
	4. New chunks attend to both current tokens and compressed history

	This approach provides several advantages:
	- Memory Efficient: Stores 4K tokens in just 256 latent representations
	- Computationally Efficient: Avoids quadratic attention over long sequences
	- Semantically Rich: Learned compression preserves relevant information

	## Model Specifications

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Architecture \| Hybrid RWKV-Attention with Latent Compression \|
	\| Parameters \| ~340M \|
	\| Embedding Dimension \| 1,180 \|
	\| RWKV Layers \| 12 \|
	\| Attention Layers \| 2 \|
	\| Attention Heads \| 8 \|
	\| Kernel Window \| 512 tokens \|
	\| Effective Context \| 4,096 tokens (via compression) \|
	\| Vocabulary Size \| 32,000 (BPE) \|
	\| Training Data \| FineWeb-Edu (10BT sample) \|

	## Performance

	Final Training Metrics (Iteration 270):
	- Loss: 0.0933
	- Perplexity: 1.14
	- Training Speed: 202 tokens/second
	- Compression: 256 latent tokens active

	The model achieved convergence at a perplexity of 1.14, demonstrating strong language modeling capabilities while maintaining efficient context compression.

	## Architecture Details

	### Layer Configuration

	RWKV Layers (12 layers):
	- Linear-time complexity for sequential processing
	- Time-mixing and channel-mixing mechanisms
	- JIT-optimized parallel implementation
	- Efficient for base token processing

	Attention Layers (2 layers):
	- Full multi-head attention with 8 heads
	- 4× FFN expansion ratio
	- Causal masking for autoregressive generation
	- High-level reasoning and long-range dependencies

	Compression Module:
	- Learnable latent query vectors (32 per chunk)
	- Cross-attention based compression
	- Layer normalization and feedforward refinement
	- Automatic head count adjustment for dimension compatibility

	### Training Configuration

	- Sequence Length: 512 tokens (aligned with kernel window)
	- Batch Size: 4
	- Gradient Accumulation: 8 steps
	- Learning Rate: 4e-4 (cosine schedule with warmup)
	- Compression Warmup: Enabled after 100 iterations, 50-iteration warmup period
	- Optimization: AdamW with gradient clipping, mixed precision training

	## Tokenizer

	- Type: Byte-Pair Encoding (BPE)
	- Vocabulary Size: 32,000 tokens
	- Special Tokens: Includes `<UNK>`, `<PAD>`, `<BOS>`, `<EOS>`, `<\|im_start\|>`, `<\|im_end\|>`, `<\|system\|>`, `<\|user\|>`, `<\|assistant\|>`, `<\|endoftext\|>`, `<\|eot_id\|>`, `[INST]`, `[/INST]`

	- Pre-tokenizer: ByteLevel encoding

	## Intended Use

	This model is designed for:
	- Research into efficient long-context language modeling
	- Applications requiring extended context understanding with limited compute
	- Exploration of hybrid RWKV-attention architectures
	- Investigation of learned compression techniques for language models

	## Limitations

	- Context beyond 4,096 tokens is not accessible even through compression
	- Compression is lossy and may not preserve all fine-grained details from distant context
	- Generation speed depends on maintaining compressed history
	- Trained primarily on English text (FineWeb-Edu)

	## Technical Notes

	Memory Management:
	- Compressed history is detached from computation graph to prevent backpropagation through time
	- Maximum history maintained: 256 latent tokens (8 chunks × 32 tokens)
	- Automatic pruning when history exceeds capacity

	Inference Behavior:
	- During generation, compressed history accumulates progressively
	- Each 512-token chunk adds 32 latent tokens to context
	- Oldest chunks are dropped when exceeding 4,096 token equivalent