BERT-PRETRAINED-EDU
Overview
This is a BERT-style masked language model pretrained from scratch using streaming data from FineWeb-Edu.
Architecture
- Hidden size: 384
- Layers: 6
- Heads: 6
- Sequence length: 128
- Objective: Masked Language Modeling (MLM)
Training
- Dataset: HuggingFaceFW/fineweb-edu (streaming)
- Steps: ~20,000
- GPUs: Dual GPU (DDP)
- Mixed Precision Training
Intended Use
- Fine-tuning for:
- Sentiment classification
- Document classification
- Retrieval / RAG encoder
- NLP research
Limitations
- Not instruction-tuned
- Not chat-optimized