i3-BERT

A novel hybrid language model architecture combining the efficiency of RWKV's linear attention with the global reasoning capabilities of standard transformers, designed for BERT-style masked language modeling tasks.

Architecture Overview

i3-BERT implements a two-tier architecture:

  • Bottom Layers (Bi-RWKV): Process local context efficiently using bidirectional RWKV blocks with O(T) complexity
  • Top Layers (Full Attention): Perform global reasoning and long-range dependencies with O(Tยฒ) multi-head attention

This design philosophy leverages the strengths of both approaches: RWKV handles syntactic structure and local patterns efficiently, while attention layers enable global information retrieval and complex reasoning.

Key Features

  • Bidirectional RWKV: Novel implementation running RWKV in both forward and backward directions for non-causal tasks
  • JIT-Optimized WKV Kernel: Compiled linear attention mechanism for faster training
  • Hybrid Layer Stack: Configurable ratio of 4 Bi-RWKV to 4 Attention layers
  • Standard BERT Pre-training: MLM (Masked Language Modeling) + NSP (Next Sentence Prediction)
  • Streaming Data Pipeline: Handles large datasets without memory issues
  • 116M Parameters: Educational-scale model suitable for consumer GPUs

Iter 0 | Loss: 11.2089 | MLM: 10.4452 | NSP: 0.7637 ... Iter 4990 | Loss: 0.1881 | MLM: 0.1489 | NSP: 0.0392

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for i3-lab/i3-BERT-v2

Base model

i3-lab/i3-BERT-v1
Finetuned
(1)
this model

Space using i3-lab/i3-BERT-v2 1

Collection including i3-lab/i3-BERT-v2