File size: 4,500 Bytes
f70aff9
02c835e
7200652
 
 
 
f70aff9
 
 
 
 
 
 
33899bd
bcc8b6a
 
5584d7d
5a00a05
 
f70aff9
7200652
 
96327e8
7200652
96327e8
02c835e
96327e8
7200652
96327e8
02c835e
7200652
 
bd9f128
7200652
 
 
 
 
 
bd9f128
7200652
bd9f128
5256f30
bd9f128
5a00a05
 
7200652
28c3066
bd9f128
7200652
3367bda
7200652
bd9f128
7200652
bd9f128
 
 
 
7200652
 
bd9f128
7200652
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a00a05
f670d78
 
 
 
7200652
 
 
 
 
 
 
 
 
f70aff9
bd9f128
 
 
 
f670d78
 
 
 
 
 
 
 
 
 
5a00a05
f670d78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f70aff9
bd9f128
 
 
aaf1356
7200652
 
f70aff9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
base_model: gpt2
datasets:
- wikimedia/wikipedia
library_name: Distily
license: mit
tags:
- generated_from_trainer
model-index:
- name: distily_modelcard_try
  results: []
---


# Summary

Distilled with [Distily](https://github.com/lapp0/distily) library
using teacher model [gpt2](https://huggingface.co/gpt2)
on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

# Model description

More information needed

# Intended uses & limitations

More information needed
-->

# Model Architecture:
- **Architecture**: `GPT2LMHeadModel`
- **Total Parameters**: 124,439,808
- **Data Type (dtype)**: torch.bfloat16
- **Model Size**: 0.24 GB


# Evaluation Metrics Comparison

| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
| 0 | 0 | 949187772416.0 | 76416058130432.0 | 21.75 | 0.1221 | 16.381 | 8.191 | 3556769792.0 | 13950053777408.0 |
| 20 | 1.0 | 13248.0 | 64000.0 | 5.6562 | 0.0646 | 30.969 | 15.485 | 7712.0 | 181248.0 |


# Resource Usage Comparison

- VRAM Use: 7.9388 GB

`# Distillation (Teacher -> Student) Architecture Difference:

- **Architecture**: `GPT2LMHeadModel` -> `GPT2LMHeadModel`
- **Total Parameters**: 124,439,808 -> 124,439,808
- **Data Type (dtype)**: 124439808 -> torch.bfloat16
- **Model Size**: 0.16 GB -> 0.24 GB

<details>
<summary>Module Diff Details</summary>

```diff
--- teacher model modules
+++ student model modules
@@ -7,15 +7,15 @@
       (0-11): 12 x GPT2Block(
         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
         (attn): GPT2FlashAttention2(
-          (c_attn): Linear8bitLt(in_features=768, out_features=2304, bias=True)
-          (c_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
+          (c_attn): Conv1D()
+          (c_proj): Conv1D()
           (attn_dropout): Dropout(p=0.1, inplace=False)
           (resid_dropout): Dropout(p=0.1, inplace=False)
         )
         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
         (mlp): GPT2MLP(
-          (c_fc): Linear8bitLt(in_features=768, out_features=3072, bias=True)
-          (c_proj): Linear8bitLt(in_features=3072, out_features=768, bias=True)
+          (c_fc): Conv1D()
+          (c_proj): Conv1D()
           (act): NewGELUActivation()
           (dropout): Dropout(p=0.1, inplace=False)
         )

```

</details>
<br/>

# Train Dataset
Trained on 149,632 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.

- Num Samples: `158`
- Subset: `20231101.en`
- Split: `train`


# Training Objective

```
DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))
```

# Hyperparameters
The following hyperparameters were used during training:

<details>
<summary>Expand</summary>

- learning_rate: `0.0001`
- train_batch_size: `8`
- eval_batch_size: `8`
- seed: `42`
- optimizer: `Adam with betas=(0.9,0.999) and epsilon=1e-08`
- lr_scheduler_type: `constant`
- lr_scheduler_warmup_ratio: `0.2`
- num_epochs: `1.0`
- distillation_objective: `DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl))`
- train_embeddings: `True`
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7f80845a7190>`
- student_model_name_or_path: `None`
- student_config_name_or_path: `None`
- student_model_config: `None`
- reinitialize_weights: `None`
- copy_teacher_modules: `[('lm_head', False)]`
- student_model_as_bitnet: `False`
- student_model_compile: `False`
- dropout: `None`
- teacher_model_name_or_path: `gpt2`
- teacher_load_in_8bit: `True`
- teacher_load_in_4bit: `False`
- teacher_model_compile: `False`
- dataset_uri: `wikimedia/wikipedia`
- dataset_subset: `20231101.en`
- dataset_split: `train`
- dataset_column_name: `text`
- dataset_sample_size: `160`
- dataset_test_size: `0.01`
- gradient_accumulation_steps: `1`
- weight_decay: `0.0`
- max_grad_norm: `1.0`
- warmup_ratio: `0.2`
- warmup_steps: `0`
- gradient_checkpointing: `True`

</details>
<br/>


# Framework Versions
- Distily 0.2.0
- Transformers 4.44.0
- Pytorch 2.3.0
- Datasets 2.21.0