| | --- |
| | base_model: gpt2 |
| | datasets: |
| | - wikimedia/wikipedia |
| | library_name: Distily |
| | license: mit |
| | tags: |
| | - generated_from_trainer |
| | model-index: |
| | - name: distily_modelcard_try |
| | results: [] |
| | --- |
| | |
| |
|
| | # Summary |
| |
|
| | Distilled with [Distily](https://github.com/lapp0/distily) library |
| | using teacher model [gpt2](https://huggingface.co/gpt2) |
| | on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia). |
| |
|
| | <!-- This model card has been generated automatically according to the information the Trainer had access to. You |
| | should probably proofread and complete it, then remove this comment. |
| |
|
| | # Model description |
| |
|
| | More information needed |
| |
|
| | # Intended uses & limitations |
| |
|
| | More information needed |
| | --> |
| |
|
| | # Model Architecture: |
| | - **Architecture**: `GPT2LMHeadModel` |
| | - **Total Parameters**: 124,439,808 |
| | - **Data Type (dtype)**: torch.bfloat16 |
| | - **Model Size**: 0.24 GB |
| |
|
| |
|
| | # Evaluation Metrics Comparison |
| |
|
| | | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl | |
| | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
| | | **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 | |
| | | 0 | 0 | 949187772416.0 | 76416058130432.0 | 21.75 | 0.1221 | 16.381 | 8.191 | 3556769792.0 | 13950053777408.0 | |
| | | 20 | 1.0 | 13248.0 | 64000.0 | 5.6562 | 0.0646 | 30.969 | 15.485 | 7712.0 | 181248.0 | |
| |
|
| |
|
| | # Resource Usage Comparison |
| |
|
| | - VRAM Use: 7.9388 GB |
| |
|
| | `# Distillation (Teacher -> Student) Architecture Difference: |
| |
|
| | - **Architecture**: `GPT2LMHeadModel` -> `GPT2LMHeadModel` |
| | - **Total Parameters**: 124,439,808 -> 124,439,808 |
| | - **Data Type (dtype)**: 124439808 -> torch.bfloat16 |
| | - **Model Size**: 0.16 GB -> 0.24 GB |
| |
|
| | <details> |
| | <summary>Module Diff Details</summary> |
| |
|
| | ```diff |
| | --- teacher model modules |
| | +++ student model modules |
| | @@ -7,15 +7,15 @@ |
| | (0-11): 12 x GPT2Block( |
| | (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
| | (attn): GPT2FlashAttention2( |
| | - (c_attn): Linear8bitLt(in_features=768, out_features=2304, bias=True) |
| | - (c_proj): Linear8bitLt(in_features=768, out_features=768, bias=True) |
| | + (c_attn): Conv1D() |
| | + (c_proj): Conv1D() |
| | (attn_dropout): Dropout(p=0.1, inplace=False) |
| | (resid_dropout): Dropout(p=0.1, inplace=False) |
| | ) |
| | (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) |
| | (mlp): GPT2MLP( |
| | - (c_fc): Linear8bitLt(in_features=768, out_features=3072, bias=True) |
| | - (c_proj): Linear8bitLt(in_features=3072, out_features=768, bias=True) |
| | + (c_fc): Conv1D() |
| | + (c_proj): Conv1D() |
| | (act): NewGELUActivation() |
| | (dropout): Dropout(p=0.1, inplace=False) |
| | ) |
| | |
| | ``` |
| |
|
| | </details> |
| | <br/> |
| |
|
| | # Train Dataset |
| | Trained on 149,632 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset. |
| |
|
| | - Num Samples: `158` |
| | - Subset: `20231101.en` |
| | - Split: `train` |
| |
|
| |
|
| | # Training Objective |
| |
|
| | ``` |
| | DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl)) |
| | ``` |
| |
|
| | # Hyperparameters |
| | The following hyperparameters were used during training: |
| |
|
| | <details> |
| | <summary>Expand</summary> |
| |
|
| | - learning_rate: `0.0001` |
| | - train_batch_size: `8` |
| | - eval_batch_size: `8` |
| | - seed: `42` |
| | - optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08` |
| | - lr_scheduler_type: `constant` |
| | - lr_scheduler_warmup_ratio: `0.2` |
| | - num_epochs: `1.0` |
| | - distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))` |
| | - train_embeddings: `True` |
| | - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f80845a7190>` |
| | - student_model_name_or_path: `None` |
| | - student_config_name_or_path: `None` |
| | - student_model_config: `None` |
| | - reinitialize_weights: `None` |
| | - copy_teacher_modules: `[('lm_head', False)]` |
| | - student_model_as_bitnet: `False` |
| | - student_model_compile: `False` |
| | - dropout: `None` |
| | - teacher_model_name_or_path: `gpt2` |
| | - teacher_load_in_8bit: `True` |
| | - teacher_load_in_4bit: `False` |
| | - teacher_model_compile: `False` |
| | - dataset_uri: `wikimedia/wikipedia` |
| | - dataset_subset: `20231101.en` |
| | - dataset_split: `train` |
| | - dataset_column_name: `text` |
| | - dataset_sample_size: `160` |
| | - dataset_test_size: `0.01` |
| | - gradient_accumulation_steps: `1` |
| | - weight_decay: `0.0` |
| | - max_grad_norm: `1.0` |
| | - warmup_ratio: `0.2` |
| | - warmup_steps: `0` |
| | - gradient_checkpointing: `True` |
| |
|
| | </details> |
| | <br/> |
| |
|
| |
|
| | # Framework Versions |
| | - Distily 0.2.0 |
| | - Transformers 4.44.0 |
| | - Pytorch 2.3.0 |
| | - Datasets 2.21.0 |
| |
|