update
Browse files
README.md
CHANGED
|
@@ -94,8 +94,10 @@ This is a great documentary!
|
|
| 94 |
|
| 95 |
### Training Data / Preprocessing
|
| 96 |
|
| 97 |
-
The data used comes from the Stanford NLP 🤗 hub.
|
| 98 |
-
|
|
|
|
|
|
|
| 99 |
|
| 100 |
### Training Procedure
|
| 101 |
|
|
|
|
| 94 |
|
| 95 |
### Training Data / Preprocessing
|
| 96 |
|
| 97 |
+
The data used comes from the Stanford NLP 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the
|
| 98 |
+
following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
|
| 99 |
+
applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
|
| 100 |
+
and passed to a `DataCollator` with the default collating function.
|
| 101 |
|
| 102 |
### Training Procedure
|
| 103 |
|