|
|
--- |
|
|
base_model: |
|
|
- distilbert/distilbert-base-uncased |
|
|
datasets: |
|
|
- stanfordnlp/imdb |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
metrics: |
|
|
- perplexity |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
πΏπ₯Welcome to the DistilBERT-DeNiro model card!ποΈπ½οΈ |
|
|
|
|
|
We domain adapt (fine-tune) the DistilBERT base model [DistilBERT/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the IMDB movies dataset for a whole word masked language modeling task. |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
|
|
|
The DistilBERT base model is fined-tuned using a custom PyTorch training loop. We supervise a training of DistilBERT for the purpose of masked language modeling on [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb), an open source |
|
|
dateset from Stanford NLP available through the π€ hub [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset of movie reviews was concatenated into a single string before being chunked and padded to a size of 256 tokens. |
|
|
We fed these chunks into the model during training with a batch size of 32. After training, the ultimate model is domain adapted to fill `[MASK]` tokens in an input string with terms and lingo common to movies. |
|
|
|
|
|
|
|
|
- **Developed by:** John Graham Reynolds |
|
|
- **Funded by:** Vanderbilt University |
|
|
- **Model type:** Masked Language Model |
|
|
- **Language(s) (NLP):** English |
|
|
- **Finetuned from model:** "DistilBERT/distilbert-base-uncased" |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://github.com/johngrahamreynolds/DistilBERT-DeNiro |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
In order to query the model effectively, one must pass it a string containing a `[MASK]` token to be filled. An example is `text = "This is a great [MASK]!"`. |
|
|
The domain-adapted model will attempt to fill the mask with a token relevant to movies, cinema, tv, etc. |
|
|
|
|
|
## How to Use and Query the Model |
|
|
|
|
|
Use the code below to get started with the model. Users pass a `text` string detailing a sentence with a `[MASK]` token. The model will provide options |
|
|
to fill the mask based on the sentence context and its background of knowledge. Note - the DistilBERT base model was trained on a very large general corpus of text. |
|
|
In our training, we have fine-tuned the model on the large IMDB movie review dataset. That is, the model is now accustomed to filling `[MASK]` tokens with words related to |
|
|
the domain of movies/tv/films. To see the model's afinity for cinematic lingo, it is best to be considerate in one's prompt engineering. Specifically, to most likely generate movie related text, |
|
|
one should ideally pass a masked `text` string that could reasonably be found in someone's review of a movie. See the example below: |
|
|
|
|
|
``` python |
|
|
|
|
|
import torch |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForMaskedLM.from_pretrained("MarioBarbeque/DistilBERT-DeNiro").to("cuda") |
|
|
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") |
|
|
|
|
|
# Pass a unique string with a [MASK] token for the model to fill |
|
|
text = "This is a great [MASK]!" |
|
|
|
|
|
tokenized_text = tokenizer(text, return_tensors="pt").to("cuda") |
|
|
token_logits = model(**tokenized_text).logits |
|
|
|
|
|
mask_token_index = torch.where(tokenized_text["input_ids"] == tokenizer.mask_token_id)[1] |
|
|
mask_token_logits = token_logits[0, mask_token_index, :] |
|
|
|
|
|
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() |
|
|
for token in top_5_tokens: |
|
|
print(text.replace(tokenizer.mask_token, tokenizer.decode(token))) |
|
|
|
|
|
``` |
|
|
|
|
|
This code outputs the following: |
|
|
|
|
|
``` python |
|
|
This is a great movie! |
|
|
This is a great film! |
|
|
This is a great idea! |
|
|
This is a great show! |
|
|
This is a great documentary! |
|
|
``` |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data / Preprocessing |
|
|
|
|
|
The data used comes from the Stanford NLP π€ hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the |
|
|
following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that |
|
|
applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating, |
|
|
and passed to a `DataCollator` with the default collating function. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
The model was trained locally on a single-node with one 16GB Nvidia T4 using π€ Transformers, π€ Tokenizers, and a custom PyTorch training loop that made use of π€ Accelerate. |
|
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Precision:** We use FP32 precision, as follows immediately from the precision inhereted for the original "DistilBERT/distilbert-base-uncased" model. |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5 |
|
|
- **Batch Size:** 32 |
|
|
- **Number of Training Steps**: 2877 steps over the course of 3 epochs |
|
|
|
|
|
|
|
|
## Evaluation / Metrics |
|
|
|
|
|
We evaluate our masked language model's performance using the `perplexity` metric, which has a few mathematical defitions. We define the perplexity as the exponential of the cross-entropy. |
|
|
To remove randomness in our metrics, we premask our evaluation dataset with a single masking function. This ensures we are evaluating with respect to the same set of labels each epoch. |
|
|
See the wikipedia links for perplexity and cross-entropy below for more a detailed discussion and various other definitions. |
|
|
|
|
|
Cross-entropy: [https://en.wikipedia.org/wiki/Cross-entropy](https://en.wikipedia.org/wiki/Cross-entropy) |
|
|
|
|
|
Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/wiki/Perplexity) |
|
|
|
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
The IMDB dataset from Stanford NLP comes pre-split into training and testing data of 25k reviews each. Our preprocessing, which included the chunking of concatenated, tokenized inputs |
|
|
into chunks of 256 tokens, increased these respective splits by approximately ~5k records each. We apply a single masking function to the evaluation dataset before testing as mentioned above. |
|
|
|
|
|
### Results |
|
|
|
|
|
We find the following perplexity metrics over 3 training epochs: |
|
|
|
|
|
| epoch | perplexity | |
|
|
|-------|------------| |
|
|
|0 | 17.38 | |
|
|
|1 | 16.28 | |
|
|
|2 | 15.78 | |
|
|
|
|
|
#### Summary |
|
|
|
|
|
We train this model for the purpose of attempting a local training of a masked language model using both the π€ ecosystem and a custom PyTorch training and evaluation loop. |
|
|
We look forward to further fine-tuning this model on more film/actor/cinema related data in order to further improve the model's knowledge and ability in this domain - |
|
|
indeed cinema is one of the author's favorite things. |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** Nvidia Tesla T4 16GB |
|
|
- **Hours used:** 1.2 |
|
|
- **Cloud Provider:** Microsoft Azure |
|
|
- **Compute Region:** EastUS |
|
|
- **Carbon Emitted:** 0.03 kgCO2 |
|
|
|
|
|
|
|
|
Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 1.2 hours of computation was performed on hardware of type T4 (TDP of 70W). |
|
|
|
|
|
Total emissions are estimated to be 0.03 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider. |
|
|
|
|
|
Estimations were conducted using the MachineLearning Impact calculator presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
The model was trained locally in an Azure Databricks workspace using a single node with 1 16GB Nvidia T4 GPU for 1.2 GPU Hours. |
|
|
|
|
|
#### Software |
|
|
|
|
|
Training utilized PyTorch, π€ Transformers, π€ Tokenizers, π€ Datasets, π€ Accelerate, and more in an Azure Databricks execution environment. |
|
|
|
|
|
#### Citations |
|
|
|
|
|
|
|
|
@article{lacoste2019quantifying, |
|
|
title={Quantifying the Carbon Emissions of Machine Learning}, |
|
|
author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, |
|
|
journal={arXiv preprint arXiv:1910.09700}, |
|
|
year={2019} |
|
|
} |
|
|
|