throbbey
/

crate-d12-base

Model card Files Files and versions

throbbey commited on Mar 7

Commit

1e35a07

·

verified ·

1 Parent(s): bd7abe2

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -46,7 +46,7 @@ those of standard transformers.
 The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
 as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
 This is the theoretically "correct" proximal operator for L1-regularized sparse
-coding, but it caused training instability at scale.
 CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:

 The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
 as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
 This is the theoretically "correct" proximal operator for L1-regularized sparse
+coding, but it caused training instability at scale. The git repo has options to use either.
 CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling: