Clarification on SwiGLU Implementation and intermediate_size Configuration
Hello!I’m working with the RNABert model and encountered a dimension mismatch error when using the SwiGLU activation function. I believe the root cause lies in the ambiguity of how intermediate_size is defined in the config file when SwiGLU is enabled. Here’s a detailed breakdown:
Problem Description
Model Structure
RnaBertIntermediate.dense:Linear(hidden_size=1280 -> intermediate_size=3392)- Activation:
SwiGLUsplits the output into two chunks (3392 → 1696 + 1696), processes them, and returns a tensor of shape[..., 1696]. RnaBertOutput.dense: Expects input dim3392but receives1696, causing a matrix multiplication error:RuntimeError: mat1 and mat2 shapes cannot be multiplied (24576x1696 and 3392x1280)
Configuration Ambiguity
- The config parameter
intermediate_size=3392seems to conflict with SwiGLU’s design. - For SwiGLU,
intermediate_sizeshould represent the post-activation dimension (e.g.,1696), while the pre-activation layer should output2 * intermediate_size = 3392.
- The config parameter
Proposed Fixes
To resolve this, either:
- Option 1: Keep
intermediate_size=3392in the config but modifyRnaBertIntermediate.denseto output2 * 3392 = 6784, then split into3392 + 3392for SwiGLU. - Option 2: Set
intermediate_size=1696in the config, and letRnaBertIntermediate.denseoutput2 * 1696 = 3392(current code behavior), compatible withRnaBertOutput.dense.
Question:
Could you clarify the intended definition of intermediate_size when using SwiGLU? Should it represent the pre-activation dimension (e.g., 3392, requiring code adjustments) or the post-activation dimension (e.g., 1696, requiring config adjustments)?
Thank you for your guidance!
ps:Because I am afraid that I cannot describe the key to the problem clearly, this question was written by DeepSeek-R1, sorry.