|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- FlameF0X/i3-BERT |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# [i3-BERT](https://github.com/FlameF0X/open-i3/tree/main/src/fill-mask/BERT): Hybrid RWKV-Transformer for Efficient Pre-training |
|
|
|
|
|
A novel hybrid language model architecture combining the efficiency of RWKV's linear attention with the global reasoning capabilities of standard transformers, designed for BERT-style masked language modeling tasks. |
|
|
|
|
|
## Architecture Overview |
|
|
|
|
|
**i3-BERT** implements a two-tier architecture: |
|
|
|
|
|
- **Bottom Layers (Bi-RWKV)**: Process local context efficiently using bidirectional RWKV blocks with O(T) complexity |
|
|
- **Top Layers (Full Attention)**: Perform global reasoning and long-range dependencies with O(T²) multi-head attention |
|
|
|
|
|
This design philosophy leverages the strengths of both approaches: RWKV handles syntactic structure and local patterns efficiently, while attention layers enable global information retrieval and complex reasoning. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Bidirectional RWKV**: Novel implementation running RWKV in both forward and backward directions for non-causal tasks |
|
|
- **JIT-Optimized WKV Kernel**: Compiled linear attention mechanism for faster training |
|
|
- **Hybrid Layer Stack**: Configurable ratio of 4 Bi-RWKV to 4 Attention layers |
|
|
- **Standard BERT Pre-training**: MLM (Masked Language Modeling) + NSP (Next Sentence Prediction) |
|
|
- **Streaming Data Pipeline**: Handles large datasets without memory issues |
|
|
- **116M Parameters**: Educational-scale model suitable for consumer GPUs |
|
|
|
|
|
Iter 0 | Loss: 11.2089 | MLM: 10.4452 | NSP: 0.7637 |
|
|
... |
|
|
Iter 4990 | Loss: 0.1881 | MLM: 0.1489 | NSP: 0.0392 |