CORAL Data Obfuscation
Collection
Encoder models trained on obfuscated versions of the fineweb-edu-1B dataset, as part of the InfAI CORAL project.
•
9 items
•
Updated
This model was trained for the purposes of analysing model utility when trained on various Derived Text Formats.
These are versions of the same text that are adjusted to reduce the chances that the original text can ever be extracted from the model, with applications in privacy and copyright infringement protection.
In this case, the model was trained on the dataset after lemmatizing (i.e. converting to base forms) all words.
The dataset used for these experiments is codelion/fineweb-edu-1B, with all obfuscated formats found here.
The model was trained using the following key hyperparameters:
Base model
google-bert/bert-base-cased