feat: selective activation checkpointing

#16

by Markus28 - opened Mar 25, 2024

base: refs/heads/main

←

from: refs/pr/16

Discussion Files changed

+33

-9

This PR is in draft mode

Markus28

Mar 25, 2024

•

edited Mar 25, 2024

This PR hasn't been tested yet

This PR adds selective activation checkpointing to the BERT model.
By passing activation_checkpoint_lvl in the config, you can set how many of the BERT layers will be checkpointed if gradient_checkpointing_enable() is called. Reducing this number will save computation at the cost of increased VRAM usage. Checkpointing will not go into effect until gradient_checkpointing_enable() is called.

By default, the value is 100, which means that for any reasonable architecture, all layers will be checkpointed. For the base model, it might make sense to set this to something like 6 to checkpoint half of the layers.
We enforce that MLP checkpointing cannot occur within a checkpointed layer.

For pretraining, I think it would make sense to set this parameter to 0, even though nothing should happen before gradient_checkpointing_enable() is called. But better safe than sorry.

feat: added selective activation checkpointingb5634693

Set activation_checkpoint_lvl to 100 by default535ad9a4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Publish this branch

This branch is in draft mode, publish it to be able to merge.

· Sign up or log in to comment