i3-lab
/

i3-BERT-v2

Model card Files Files and versions

i3-BERT-v2 / README.md

FlameF0X's picture

Update README.md

8391f7c verified 26 days ago

|

history blame contribute delete

1.64 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- FlameF0X/i3-BERT
	pipeline_tag: fill-mask
	---

	# [i3-BERT](https://github.com/FlameF0X/open-i3/tree/main/src/fill-mask/BERT): Hybrid RWKV-Transformer for Efficient Pre-training

	A novel hybrid language model architecture combining the efficiency of RWKV's linear attention with the global reasoning capabilities of standard transformers, designed for BERT-style masked language modeling tasks.

	## Architecture Overview

	i3-BERT implements a two-tier architecture:

	- Bottom Layers (Bi-RWKV): Process local context efficiently using bidirectional RWKV blocks with O(T) complexity
	- Top Layers (Full Attention): Perform global reasoning and long-range dependencies with O(T²) multi-head attention

	This design philosophy leverages the strengths of both approaches: RWKV handles syntactic structure and local patterns efficiently, while attention layers enable global information retrieval and complex reasoning.

	## Key Features

	- Bidirectional RWKV: Novel implementation running RWKV in both forward and backward directions for non-causal tasks
	- JIT-Optimized WKV Kernel: Compiled linear attention mechanism for faster training
	- Hybrid Layer Stack: Configurable ratio of 4 Bi-RWKV to 4 Attention layers
	- Standard BERT Pre-training: MLM (Masked Language Modeling) + NSP (Next Sentence Prediction)
	- Streaming Data Pipeline: Handles large datasets without memory issues
	- 116M Parameters: Educational-scale model suitable for consumer GPUs

	Iter 0 \| Loss: 11.2089 \| MLM: 10.4452 \| NSP: 0.7637
	...
	Iter 4990 \| Loss: 0.1881 \| MLM: 0.1489 \| NSP: 0.0392