Commit ·
9ee8b6b
1
Parent(s): 02466e7
add paper link
Browse files
README.md
CHANGED
|
@@ -9,7 +9,7 @@ tags:
|
|
| 9 |
license: mit
|
| 10 |
---
|
| 11 |
DCFormer-2.8B is a pretrained language model on the Pile with 300B tokens, which is a parameter and computation efficient attention architecture that tackles the shortcomings of MHA
|
| 12 |
-
and increases the expressive power of the model by dynamically composing attention heads. It is short for DCFormer++2.8B and please see downstrem evaluations and more details in the paper[(Improving Transformers with Dynamically Composable Multi-Head Attention)](). In addition, we open-source Jax training code on [(Github)](https://github.com/Caiyun-AI/DCFormer/).
|
| 13 |
|
| 14 |
We recommend <strong>compiled version</strong> of DCFormer with *torch.compile* for inference acceleration. Please refer to Generation section for compile implementation.
|
| 15 |
|
|
|
|
| 9 |
license: mit
|
| 10 |
---
|
| 11 |
DCFormer-2.8B is a pretrained language model on the Pile with 300B tokens, which is a parameter and computation efficient attention architecture that tackles the shortcomings of MHA
|
| 12 |
+
and increases the expressive power of the model by dynamically composing attention heads. It is short for DCFormer++2.8B and please see downstrem evaluations and more details in the paper[(Improving Transformers with Dynamically Composable Multi-Head Attention)](https://arxiv.org/abs/2405.08553). In addition, we open-source Jax training code on [(Github)](https://github.com/Caiyun-AI/DCFormer/).
|
| 13 |
|
| 14 |
We recommend <strong>compiled version</strong> of DCFormer with *torch.compile* for inference acceleration. Please refer to Generation section for compile implementation.
|
| 15 |
|