File size: 8,137 Bytes
fd25b5a
fac0402
637d893
fac0402
 
 
 
fd25b5a
fac0402
 
fd25b5a
 
 
 
fac0402
fd25b5a
fac0402
fd25b5a
 
 
 
 
 
 
fac0402
 
 
fd25b5a
fac0402
 
 
 
 
 
fd25b5a
 
 
 
 
fac0402
fd25b5a
 
 
 
 
 
 
90814e9
fac0402
fd25b5a
fac0402
fd25b5a
fac0402
 
 
31ad728
fac0402
fd25b5a
fac0402
fd25b5a
fac0402
 
fd25b5a
fac0402
 
fd25b5a
fac0402
 
fd25b5a
fac0402
 
fd25b5a
fac0402
 
fd25b5a
fac0402
 
 
fd25b5a
fac0402
fd25b5a
fac0402
fd25b5a
fac0402
 
 
 
 
 
 
fd25b5a
 
 
 
fac0402
fd25b5a
a8d360a
 
 
 
fd25b5a
 
 
fac0402
fd25b5a
 
 
 
d201bb9
 
 
 
 
fd25b5a
 
fac0402
fd25b5a
fac0402
efddbd5
fac0402
fd25b5a
fac0402
c880ef7
fac0402
fd25b5a
 
 
 
 
 
efddbd5
b36efcd
fd25b5a
 
 
fac0402
fd25b5a
fac0402
 
 
 
 
fd25b5a
fac0402
fd25b5a
fac0402
 
 
fd25b5a
 
 
fac0402
 
 
 
 
fd25b5a
 
fac0402
fd25b5a
fac0402
fd25b5a
fac0402
fd25b5a
 
 
fac0402
fd25b5a
 
 
fac0402
fd25b5a
fac0402
fd25b5a
 
fac0402
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
base_model:
- distilbert/distilbert-base-uncased
datasets:
- stanfordnlp/imdb
language:
- en
library_name: transformers
metrics:
- perplexity
---

# Model Card for Model ID

🍿🎥Welcome to the DistilBERT-DeNiro model card!🎞️📽️

We domain adapt (fine-tune) the DistilBERT base model [DistilBERT/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the IMDB movies dataset for a whole word masked language modeling task.


## Model Details

### Model Description


The DistilBERT base model is fined-tuned using a custom PyTorch training loop. We supervise a training of DistilBERT for the purpose of masked language modeling on [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb), an open source 
dateset from Stanford NLP available through the 🤗 hub [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset of movie reviews was concatenated into a single string before being chunked and padded to a size of 256 tokens. 
We fed these chunks into the model during training with a batch size of 32. After training, the ultimate model is domain adapted to fill `[MASK]` tokens in an input string with terms and lingo common to movies. 


- **Developed by:** John Graham Reynolds
- **Funded by:** Vanderbilt University
- **Model type:** Masked Language Model
- **Language(s) (NLP):** English
- **Finetuned from model:** "DistilBERT/distilbert-base-uncased"

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/johngrahamreynolds/DistilBERT-DeNiro

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

In order to query the model effectively, one must pass it a string containing a `[MASK]` token to be filled. An example is `text = "This is a great [MASK]!"`. 
The domain-adapted model will attempt to fill the mask with a token relevant to movies, cinema, tv, etc.

## How to Use and Query the Model

Use the code below to get started with the model. Users pass a `text` string detailing a sentence with a `[MASK]` token. The model will provide options 
to fill the mask based on the sentence context and its background of knowledge. Note - the DistilBERT base model was trained on a very large general corpus of text.
In our training, we have fine-tuned the model on the large IMDB movie review dataset. That is, the model is now accustomed to filling `[MASK]` tokens with words related to
the domain of movies/tv/films. To see the model's afinity for cinematic lingo, it is best to be considerate in one's prompt engineering. Specifically, to most likely generate movie related text, 
one should ideally pass a masked `text` string that could reasonably be found in someone's review of a movie. See the example below:

``` python

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("MarioBarbeque/DistilBERT-DeNiro").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

# Pass a unique string with a [MASK] token for the model to fill
text = "This is a great [MASK]!"

tokenized_text = tokenizer(text, return_tensors="pt").to("cuda")
token_logits = model(**tokenized_text).logits

mask_token_index = torch.where(tokenized_text["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
  print(text.replace(tokenizer.mask_token, tokenizer.decode(token)))

```

This code outputs the following:

``` python
This is a great movie!
This is a great film!
This is a great idea!
This is a great show!
This is a great documentary!
```


## Training Details

### Training Data / Preprocessing

The data used comes from the Stanford NLP 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the 
following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that 
applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
and passed to a `DataCollator` with the default collating function.

### Training Procedure

The model was trained locally on a single-node with one 16GB Nvidia T4 using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of 🤗 Accelerate. 


#### Training Hyperparameters

- **Precision:** We use FP32 precision, as follows immediately from the precision inhereted for the original "DistilBERT/distilbert-base-uncased" model.
- **Optimizer:** AdamW
- **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
- **Batch Size:** 32
- **Number of Training Steps**: 2877 steps over the course of 3 epochs


## Evaluation / Metrics

We evaluate our masked language model's performance using the `perplexity` metric, which has a few mathematical defitions. We define the perplexity as the exponential of the cross-entropy. 
To remove randomness in our metrics, we premask our evaluation dataset with a single masking function. This ensures we are evaluating with respect to the same set of labels each epoch.
See the wikipedia links for perplexity and cross-entropy below for more a detailed discussion and various other definitions.

Cross-entropy: [https://en.wikipedia.org/wiki/Cross-entropy](https://en.wikipedia.org/wiki/Cross-entropy)

Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/wiki/Perplexity)


### Testing Data, Factors & Metrics

#### Testing Data

The IMDB dataset from Stanford NLP comes pre-split into training and testing data of 25k reviews each. Our preprocessing, which included the chunking of concatenated, tokenized inputs
into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above.

### Results

We find the following perplexity metrics over 3 training epochs:

 | epoch | perplexity |
 |-------|------------|
 |0      |  17.38 |
 |1      |  16.28 |
 |2      |  15.78 |

#### Summary

We train this model for the purpose of attempting a local training of a masked language model using both the 🤗 ecosystem and a custom PyTorch training and evaluation loop. 
We look forward to further fine-tuning this model on more film/actor/cinema related data in order to further improve the model's knowledge and ability in this domain - 
indeed cinema is one of the author's favorite things.

## Environmental Impact

- **Hardware Type:** Nvidia Tesla T4 16GB
- **Hours used:** 1.2
- **Cloud Provider:** Microsoft Azure
- **Compute Region:** EastUS
- **Carbon Emitted:** 0.03 kgCO2


Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 1.2 hours of computation was performed on hardware of type T4 (TDP of 70W).

Total emissions are estimated to be 0.03 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.

Estimations were conducted using the MachineLearning Impact calculator presented in  [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

#### Hardware

The model was trained locally in an Azure Databricks workspace using a single node with 1 16GB Nvidia T4 GPU for 1.2 GPU Hours.

#### Software

Training utilized PyTorch, 🤗 Transformers, 🤗 Tokenizers, 🤗 Datasets, 🤗 Accelerate, and more in an Azure Databricks execution environment.

#### Citations


@article{lacoste2019quantifying,
  title={Quantifying the Carbon Emissions of Machine Learning},
  author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
  journal={arXiv preprint arXiv:1910.09700},
  year={2019}
}