Model Card

This model was trained for the purposes of analysing model utility when trained on various Derived Text Formats.
These are versions of the same text that are adjusted to reduce the chances that the original text can ever be extracted from the model, with applications in privacy and copyright infringement protection. In this case, the model was trained on the dataset after lemmatizing (i.e. converting to base forms) all words.

The dataset used for these experiments is codelion/fineweb-edu-1B, with all obfuscated formats found here.

Training Configuration

The model was trained using the following key hyperparameters:

Model Architecture

  • Base Architecture: BERT (base, cased)
  • Hidden Size: 768
  • Number of Layers: 12
  • Attention Heads: 12
  • Intermediate Size: 3072
  • Max Sequence Length: 512
  • Activation Function: GELU
  • Normalization: Layer normalization (pre-norm)
  • Position Embeddings: Learned

Training Hyperparameters

  • Objective: Masked Language Modeling (MLM)
  • Optimizer: AdamW (8-bit quantized)
  • Learning Rate: 1e-4
  • Weight Decay: 1e-5
  • Warmup Steps: 10,000
  • Warmup Decay: 0.1
  • Max Steps: 150,000
  • Precision: bfloat16 mixed precision
  • Batch Size: 16 (train and validation)

Dataset

  • Training Data: DanielGallagherIRE/fineweb-edu-1B-obfuscated
  • Tokenizer: bert-base-cased
Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanielGallagherIRE/obfuscated-bert-fineweb-1B-lemmatized

Finetuned
(2774)
this model

Dataset used to train DanielGallagherIRE/obfuscated-bert-fineweb-1B-lemmatized

Collection including DanielGallagherIRE/obfuscated-bert-fineweb-1B-lemmatized