MarioBarbeque
/

DistilBERT-DeNiro

Model card Files Files and versions

MarioBarbeque commited on Nov 23, 2024

Commit

a8d360a

·

verified ·

1 Parent(s): efddbd5

update

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -94,8 +94,10 @@ This is a great documentary!
 ### Training Data / Preprocessing
-The data used comes from the Stanford NLP 🤗 hub. It has been preprocessed to only contain reviews at least 13 or more words in length. The model card
-can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb).
 ### Training Procedure

 ### Training Data / Preprocessing
+The data used comes from the Stanford NLP 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/stanfordnlp/imdb). This dataset is preprocessed in the
+following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
+applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
+and passed to a `DataCollator` with the default collating function.
 ### Training Procedure