kd13
/

RoPERT-MLM-small

Model card Files Files and versions

kd13 commited on 28 days ago

Commit

e50c1e7

·

verified ·

1 Parent(s): 664d6f8

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -26,6 +26,8 @@ A compact BERT-style masked language model trained entirely from scratch on Book
 **Embedding tying.** The MLM decoder projection matrix shares weights with the token embedding table, which reduces parameter count and typically improves token prediction quality.
 ---
 ## Training Details

 **Embedding tying.** The MLM decoder projection matrix shares weights with the token embedding table, which reduces parameter count and typically improves token prediction quality.
+**SwiGLU** activation function. This gated linear unit replaces the standard feed-forward network with a combination of Swish and GLU, improving training stability and model performance by providing a more expressive non-linearity.
 ---
 ## Training Details