BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)
Model Description
This is a GPT-2 language model trained as part of the BabyLM 2025 Challenge on the strict small track, using the innovative MorPiece tokenizer. The model demonstrates how morphologically-aware tokenization can improve language modeling performance when training on limited data (10M words).
- Developed by: NeTS Lab
- Model type: Autoregressive Language Model (GPT-2 architecture)
- Language(s): English
- License: MIT
- Parent Model: GPT-2
- Tokenizer: MorPiece (cristianochesi/morpiece)
Key Features
- Morphologically-aware tokenization via MorPiece for better handling of word structure (hyperparameter used: min_freq=10, bf=2, cutoff=100)
- Strict data constraints (10M words) following BabyLM 2025 Strict Small track
- Optimized for data efficiency default BabyLM 2025 baseline hyperparameter tuning
- 768-dimensional embeddings with 12 attention heads and 12 layers
Model Details
Architecture
- Base Architecture: GPT-2 (12 layers, 12 attention heads)
- Hidden Size: 768
- Vocabulary Size: 23,405 (MorPiece tokens)
- Context Length: 1,024 tokens
- Parameters: ~~104M (estimated)
Training Configuration
- Training Type: Strict (BabyLM 2025 guidelines)
- Dataset Size: 10M words maximum
- Sequence Length: 512 tokens
- Batch Size: 16
- Learning Rate: 5e-5
- Training Steps: 200,000
- Warmup Steps: 2,000
- Epochs: 10
- Weight Decay: 0.0
- Gradient Clipping: 1.0
Tokenization
This model uses the MorPiece tokenizer, a split-based tokenizer (cristianochesi/morpiece)
Training Data
The model was trained on the BabyLM 2025 strict small track dataset, which includes:
- Size: 10M words maximum
- Sources: Child-directed speech and age-appropriate text
- Language: English
- Preprocessing: Tokenized using MorPiece tokenizer
Intended Uses
Primary Use Cases
- Research into data-efficient language modeling
- Comparative studies of tokenization methods in low-resource settings
- Baseline model for BabyLM 2025 Challenge participants
- Educational purposes for understanding morphological tokenization
Out-of-Scope Uses
- Production deployments requiring robust, general-purpose language understanding
- Safety-critical applications
- Tasks requiring knowledge beyond the training data scope
Performance
The model was trained following BabyLM 2025 Challenge protocols:
- Training loss: 3.28548
- Convergence: Achieved after 200,000 training steps
Usage
Loading the Model
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm-mop-10m-gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm-mop-10m-gpt2")
# Generate text
input_text = "The child played with"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Text Generation Parameters
- Max Length: 50 tokens (default)
- Sampling: Enabled by default
- Temperature: Adjustable (0.8 recommended)
Limitations and Biases
Known Limitations
- Limited training data (10M words) may result in knowledge gaps
- Domain specificity due to child-directed speech focus
- Vocabulary constraints from MorPiece tokenization
- Context window limited to 1,024 tokens
Potential Biases
- Age-appropriate content bias from training data selection
- English language bias (monolingual training)
- Morphological bias toward Indo-European language patterns
- Dataset composition bias inherent in BabyLM data curation
Technical Specifications
Training Infrastructure
- Framework: PyTorch + Transformers
- Precision: float32
- Gradient Accumulation: Configured for effective batch size
- Monitoring: Weights & Biases integration
Model Configuration
{
"activation_function": "gelu_new",
"architectures": ["GPT2LMHeadModel"],
"attn_pdrop": 0.1,
"embd_pdrop": 0.1,
"layer_norm_epsilon": 1e-05,
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"vocab_size": 23405
}
Citation
If you use this model in your research, please cite:
@misc{babylm2025-gpt2-morpiece,
title={undereview},
author={[Your Name]},
year={2025},
url={https://huggingface.co/NeTS-lab/babylm-mop-10m-gpt2}
}
Also consider citing the original BabyLM Challenge and MorPiece tokenizer:
@misc{morpiece2024,
title={MorPiece: Morphologically-aware Piece Tokenization},
author={Cristiano Chesi & NeTS Lab},
year={2024},
url={https://github.com/cristianochesi/morpiece}
}
Acknowledgments
- BabyLM 2025 Challenge organizers for providing the framework and dataset
- MorPiece developers for split-based tokenization approach
- Hugging Face Transformers team for the modeling infrastructure
Contact
For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].
This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.
- Downloads last month
- 2,324