--- library_name: transformers tags: [] --- # FW Medium This is the medium version of the bilinear transformers trained on FineWeb-edu. The primary purpose of this model is interpretability, most design choices were made with that in mind. The code to run this custom model can be found [here](https://github.com/tdooms/bilinear-decomposition), along with many utility functions for weight-based interpretability. ## Model Details - 335 million parameters - 16 layers - 16 attention heads - model dimension 1024 - bilinear MLP with expansion factor 4 - context length of 512 - trained for 32B tokens - rotary positional embedding - Mixtral [tokenizer](https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1)