Update README.md
Browse files
README.md
CHANGED
|
@@ -46,7 +46,7 @@ those of standard transformers.
|
|
| 46 |
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
|
| 47 |
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
|
| 48 |
This is the theoretically "correct" proximal operator for L1-regularized sparse
|
| 49 |
-
coding, but it caused training instability at scale.
|
| 50 |
|
| 51 |
CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:
|
| 52 |
|
|
|
|
| 46 |
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
|
| 47 |
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
|
| 48 |
This is the theoretically "correct" proximal operator for L1-regularized sparse
|
| 49 |
+
coding, but it caused training instability at scale. The git repo has options to use either.
|
| 50 |
|
| 51 |
CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:
|
| 52 |
|