|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
# FW Medium |
|
|
|
|
|
This is the medium version of the bilinear transformers trained on FineWeb-edu. |
|
|
The primary purpose of this model is interpretability, most design choices were made with that in mind. |
|
|
|
|
|
The code to run this custom model can be found [here](https://github.com/tdooms/bilinear-decomposition), along with many utility functions for weight-based interpretability. |
|
|
|
|
|
## Model Details |
|
|
- 335 million parameters |
|
|
- 16 layers |
|
|
- 16 attention heads |
|
|
- model dimension 1024 |
|
|
- bilinear MLP with expansion factor 4 |
|
|
- context length of 512 |
|
|
- trained for 32B tokens |
|
|
- rotary positional embedding |
|
|
- Mixtral [tokenizer](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1) |