MarioBarbeque
/

DistilBERT-DeNiro

@@ -1,37 +1,43 @@
 ---
 library_name: transformers
-tags: []
 ---
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
@@ -39,161 +45,132 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
 #### Hardware
-[More Information Needed]
 #### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+base_model:
+- distilBERT/distilbert-base-uncased
+datasets:
+- stanfordnlp/imdb
+language:
+- en
 library_name: transformers
+metrics:
+- perplexity
 ---
 # Model Card for Model ID
+🍿🎥Welcome to the DistilBERT-DeNiro model card!🎞️📽️
+We domain adapt (fine-tune) the DistilBERT base model [DistilBERT/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) on the IMDB movies dataset for a whole word masked language modeling task.
 ## Model Details
 ### Model Description
+The DistilBERT base model is fined-tuned using a custom PyTorch training loop. We supervise a training of DistilBERT for the purpose of masked language modeling on [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb), an open source
+dateset from Stanford NLP available through the 🤗 hub [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset of movie reviews was concatenated into a single string before being chunked and padded to a size of 256 tokens.
+We fed these chunks into the model during training with a batch size of 32. After training, the ultimate model is domain adapted to fill `[MASK]` tokens in an input string with terms and lingo common to movies.
+- **Developed by:** John Graham Reynolds
+- **Funded by:** Vanderbilt University
+- **Model type:** Masked Language Model
+- **Language(s) (NLP):** English
+- **Finetuned from model:** "DistilBERT/distilbert-base-uncased"
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/johngrahamreynolds/DistilBERT-DeNiro
 ## Uses
 ### Direct Use
+In order to query the model effectively, one must pass it a string containing a `[MASK]` token to be filled. An example is `text = "This is a great [MASK]!".
+The domain-adapted model will attempt to fill the mask with a token relevant to movies, cinema, tv, etc.
+## How to Use and Query the Model
+Use the code below to get started with the model. Users pass a `text` string detailing a sentence with a `[MASK]` token. The model will provide options
+to fill the mask based on the sentence context and its background of knowledge. Note - the DistilBERT base model was trained on a very large general corpus of text.
+In our training, we have fine-tuned the model on the large IMDB movie review dataset. That is, the model is now accustomed to filling `[MASK]` tokens with words related to
+the domain of movies/tv/films. To see the model's afinity for cinematic lingo, it is best to be considerate in one's prompt engineering. That is, to most likely generate movie related text,
+one should ideally pass a masked `text` string that could reasonably be found in someone's review of a movie. See the example below:
+``` python
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+model = AutoModelForMaskedLM.from_pretrained("MarioBarbeque/DistilBERT-DeNiro").to("cuda")
+tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
+# Pass a unique string with a [MASK] token for the model to fill
+text = "This is a great [MASK]!"
+tokenized_text = tokenizer(text, return_tensors="pt").to("cuda")
+token_logits = model(**tokenized_text).logits
+mask_token_index = torch.where(tokenized_text["input_ids"] == tokenizer.mask_token_id)[1]
+mask_token_logits = token_logits[0, mask_token_index, :]
+top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+for token in top_5_tokens:
+  print(text.replace(tokenizer.mask_token, tokenizer.decode(token)))
+```
+This code outputs the following:
+``` python
+This is a great movie!
+This is a great film!
+This is a great idea!
+This is a great show!
+This is a great documentary!
+```
 ## Training Details
+### Training Data / Preprocessing
+The data used comes from the Stanford NLP 🤗 hub. It has been preprocessed to only contain reviews at least 13 or more words in length. The model card
+can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb).
 ### Training Procedure
+The model was trained locally on a single-node with one 16GB Nvidia T4 using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of 🤗 Accelerate.
 #### Training Hyperparameters
+- **Training regime:** We use FP32 precision, as follows immediately from the precision inhereted for the original "DistilBERT/distilbert-base-uncased" model.
+## Evaluation / Metrics
+We evaluate our masked language model's performance using the `perplexity` metric, which has a few mathematical defitions. We define the perplexity as the exponential of the cross-entropy.
+See the wikipedia links for perplexity and cross-entropy below for more a detailed discussion and various other definitions.
+Cross-entropy: [https://en.wikipedia.org/wiki/Cross-entropy](https://en.wikipedia.org/wiki/Cross-entropy)
+Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/wiki/Perplexity)
 ### Testing Data, Factors & Metrics
 #### Testing Data
+The IMDB dataset from Stanford NLP comes pre-split into training and testing data of 25k reviews each. Our preprocessing
+We configured a train/test split using the standard 80/20 rule of thumb on the shuffled UC Irvine data set. The dataset [model card](https://huggingface.co/datasets/MarioBarbeque/UCI_drug_reviews)
+contains in its base form a `DataDict` with splits for train, validation, and test. The dataset used for testing can be found there in the test split.
 ### Results
+We find the following perplexity metrics over 3 training epochs:
+ | epoch | perplexity |
+ |-------|------------|
+ |0      |  17.38 |
+ |1      |  16.28 |
+ |2      |  15.78 |
+#### Summary
+We train this model for the purpose of attempting a local training of a masked language model using both the 🤗 ecosystem and a custom PyTorch training and evaluation loop.
+We look forward to further fine-tuning this model on more film/actor/cinema related data in order to further improve the model's knowledge and ability in this domain -
+indeed cinema is one of the author's favorite things.
 ## Environmental Impact
+- **Hardware Type:** Nvidia Tesla T4 16GB
+- **Hours used:** 1.2
+- **Cloud Provider:** Microsoft Azure
+- **Compute Region:** EastUS
+- **Carbon Emitted:** 0.03 kgCO2
+Experiments were conducted using Azure in region eastus, which has a carbon efficiency of 0.37 kgCO$_2$eq/kWh. A cumulative of 1.2 hours of computation was performed on hardware of type T4 (TDP of 70W).
+Total emissions are estimated to be 0.03 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.
+Estimations were conducted using the MachineLearning Impact calculator presented in  [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 #### Hardware
+The model was trained locally in an Azure Databricks workspace using a single node with 1 16GB Nvidia T4 GPU for 1.2 GPU Hours.
 #### Software
+Training utilized PyTorch, 🤗 Transformers, 🤗 Tokenizers, 🤗 Datasets, 🤗 Accelerate, and more in an Azure Databricks execution environment.
+#### Citations
+@article{lacoste2019quantifying,
+  title={Quantifying the Carbon Emissions of Machine Learning},
+  author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
+  journal={arXiv preprint arXiv:1910.09700},
+  year={2019}
+}