i3-Nano
Collection
3 items
โข
Updated
A novel hybrid language model architecture combining the efficiency of RWKV's linear attention with the global reasoning capabilities of standard transformers, designed for BERT-style masked language modeling tasks.
i3-BERT implements a two-tier architecture:
This design philosophy leverages the strengths of both approaches: RWKV handles syntactic structure and local patterns efficiently, while attention layers enable global information retrieval and complex reasoning.
Iter 0 | Loss: 11.2089 | MLM: 10.4452 | NSP: 0.7637 ... Iter 4990 | Loss: 0.1881 | MLM: 0.1489 | NSP: 0.0392
Base model
i3-lab/i3-BERT-v1