File size: 709 Bytes
c23a9cf
 
 
 
c4d4555
c23a9cf
c4d4555
 
c23a9cf
c4d4555
c23a9cf
 
c4d4555
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
---
library_name: transformers
tags: []
---
# FW Tiny

This is the tiny version of the bilinear transformers trained on FineWeb-edu.
The primary purpose of this model is interpretability, most design choices were made with that in mind.

The code to run this custom model can be found [here](https://github.com/tdooms/bilinear-decomposition), along with many utility functions for weight-based interpretability.

## Model Details
- 125 million parameters
- 8 layers
- 12 attention heads
- model dimension 768
- bilinear MLP with expansion factor 4
- context length of 512
- trained for 16B tokens
- rotary positional embedding
- Mixtral [tokenizer](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1)