Model Card

This model was trained for the purposes of analysing model utility when trained on various Derived Text Formats.
These are versions of the same text that are adjusted to reduce the chances that the original text can ever be extracted from the model, with applications in privacy and copyright infringement protection. In this case, the model was trained on the dataset after lemmatizing (i.e. converting to base forms) all words.

The dataset used for these experiments is codelion/fineweb-edu-1B, with all obfuscated formats found here.

Training Configuration

The model was trained using the following key hyperparameters:

Model Architecture

Base Architecture: BERT (base, cased)
Hidden Size: 768
Number of Layers: 12
Attention Heads: 12
Intermediate Size: 3072
Max Sequence Length: 512
Activation Function: GELU
Normalization: Layer normalization (pre-norm)
Position Embeddings: Learned

Training Hyperparameters

Objective: Masked Language Modeling (MLM)
Optimizer: AdamW (8-bit quantized)
Learning Rate: 1e-4
Weight Decay: 1e-5
Warmup Steps: 10,000
Warmup Decay: 0.1
Max Steps: 150,000
Precision: bfloat16 mixed precision
Batch Size: 16 (train and validation)

Dataset

Training Data: DanielGallagherIRE/fineweb-edu-1B-obfuscated
Tokenizer: bert-base-cased

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanielGallagherIRE/obfuscated-bert-fineweb-1B-lemmatized

Base model

google-bert/bert-base-cased

Finetuned

(2833)

this model

Dataset used to train DanielGallagherIRE/obfuscated-bert-fineweb-1B-lemmatized

Collection including DanielGallagherIRE/obfuscated-bert-fineweb-1B-lemmatized

CORAL Structural Obfuscation

Collection

Encoder models trained on obfuscated versions of the fineweb-edu-1B dataset, as part of the InfAI CORAL project. • 9 items • Updated Feb 25