MarioBarbeque commited on
Commit
a8d360a
·
verified ·
1 Parent(s): efddbd5
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -94,8 +94,10 @@ This is a great documentary!
94
 
95
  ### Training Data / Preprocessing
96
 
97
- The data used comes from the Stanford NLP 🤗 hub. It has been preprocessed to only contain reviews at least 13 or more words in length. The model card
98
- can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb).
 
 
99
 
100
  ### Training Procedure
101
 
 
94
 
95
  ### Training Data / Preprocessing
96
 
97
+ The data used comes from the Stanford NLP 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the
98
+ following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
99
+ applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
100
+ and passed to a `DataCollator` with the default collating function.
101
 
102
  ### Training Procedure
103