kd13 commited on
Commit
e50c1e7
·
verified ·
1 Parent(s): 664d6f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -26,6 +26,8 @@ A compact BERT-style masked language model trained entirely from scratch on Book
26
 
27
  **Embedding tying.** The MLM decoder projection matrix shares weights with the token embedding table, which reduces parameter count and typically improves token prediction quality.
28
 
 
 
29
  ---
30
 
31
  ## Training Details
 
26
 
27
  **Embedding tying.** The MLM decoder projection matrix shares weights with the token embedding table, which reduces parameter count and typically improves token prediction quality.
28
 
29
+ **SwiGLU** activation function. This gated linear unit replaces the standard feed-forward network with a combination of Swish and GLU, improving training stability and model performance by providing a more expressive non-linearity.
30
+
31
  ---
32
 
33
  ## Training Details