| In the next frame we have Dropout which renormalizes |
| the weights, after it zeroed some of the elements, which pushes the absolute max value to more than 64K, and we get an |
| overflow (inf). |
| As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 |
| numbers. |
| Let's match the report to the code from models/t5/modeling_t5.py: |
| thon |
| class T5DenseGatedGeluDense(nn.Module): |
| def init(self, config): |
| super().init() |
| self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False) |
| self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False) |
| self.wo = nn.Linear(config.d_ff, config.d_model, bias=False) |
| self.dropout = nn.Dropout(config.dropout_rate) |
| self.gelu_act = ACT2FN["gelu_new"] |
| def forward(self, hidden_states): |
| hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) |
| hidden_linear = self.wi_1(hidden_states) |
| hidden_states = hidden_gelu * hidden_linear |
| hidden_states = self.dropout(hidden_states) |
| hidden_states = self.wo(hidden_states) |
| return hidden_states |
|
|
| Now it's easy to see the dropout call, and all the previous calls as well. |
| Since the detection is happening in a forward hook, these reports are printed immediately after each forward |
| returns. |
| Going back to the full report, to act on it and to fix the problem, we need to go a few frames up where the numbers |
| started to go up and most likely switch to the fp32 mode here, so that the numbers don't overflow when multiplied |
| or summed up. |