wikimedia/wikipedia
Viewer • Updated • 61.6M • 260k • 1.22k
Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.
GPT2LMHeadModelGPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-5): 6 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2FlashAttention2(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
GPT2LMHeadModel -> GPT2LMHeadModel--- teacher model modules
+++ student model modules
@@ -4,7 +4,7 @@
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
- (0-11): 12 x GPT2Block(
+ (0-5): 6 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2FlashAttention2(
(c_attn): Conv1D()
Trained on 525,568,142 tokens from the wikimedia/wikipedia dataset.
998,00020231101.entrainDistillationObjective(
logits_loss_component=LossComponent(
weight=1,
loss_fn='kl'
),
hs_loss_component=LossComponent(
weight=0
),
attn_loss_component=LossComponent(
weight=5.0,
loss_fn='raw_mse',
layer_mapper='layer-2',
norm='layernorm_teacher_only_affine',
projector='orthogonal'
)
)
The following hyperparameters were used during training:
0.000216842Adam with betas=(0.9,0.999) and epsilon=1e-08constant1.0DistillationObjective( logits_loss_component=LossComponent( weight=1, loss_fn='kl' ), hs_loss_component=LossComponent( weight=0 ), attn_loss_component=LossComponent( weight=5.0, loss_fn='raw_mse', layer_mapper='layer-2', norm='layernorm_teacher_only_affine', projector='orthogonal' ) )<torch.optim.lr_scheduler.LambdaLR object at 0x7fc16d843f40>Nonedistilbert/distilgpt2NoneNone[('lm_head', False)]Falsegpt2FalseFalsewikimedia/wikipedia20231101.entraintext10000000.002False42False10.01.00.00TrueBase model
openai-community/gpt2