DanielGallagherIRE
/

obfuscated-bert-fineweb-1B-noun

Model card Files Files and versions

Model Card

This model was trained for the purposes of analysing model utility when trained on various Derived Text Formats.
These are versions of the same text that are adjusted to reduce the chances that the original text can ever be extracted from the model, with applications in privacy and copyright infringement protection. In this case, the model was trained on only the dataset's nouns.

The dataset used for these experiments is codelion/fineweb-edu-1B, with all obfuscated formats found here.

Training Configuration

The model was trained using the following key hyperparameters:

Model Architecture

Base Architecture: BERT (base, cased)
Hidden Size: 768
Number of Layers: 12
Attention Heads: 12
Intermediate Size: 3072
Max Sequence Length: 512
Activation Function: GELU
Normalization: Layer normalization (pre-norm)
Position Embeddings: Learned

Training Hyperparameters

Objective: Masked Language Modeling (MLM)
Optimizer: AdamW (8-bit quantized)
Learning Rate: 1e-4
Weight Decay: 1e-5
Warmup Steps: 10,000
Warmup Decay: 0.1
Max Steps: 150,000
Precision: bfloat16 mixed precision
Batch Size: 16 (train and validation)

Dataset

Training Data: DanielGallagherIRE/fineweb-edu-1B-obfuscated
Tokenizer: bert-base-cased

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

·

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanielGallagherIRE/obfuscated-bert-fineweb-1B-noun

Base model

google-bert/bert-base-cased

Finetuned

(2924)

this model

Collection including DanielGallagherIRE/obfuscated-bert-fineweb-1B-noun

CORAL Structural Obfuscation (OLD)

Encoder models trained on obfuscated versions of the fineweb-edu-1B dataset, as part of the InfAI CORAL project. • 9 items • Updated 22 days ago