BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)

Model Description

This is a GPT-2 language model trained as part of the BabyLM 2025 Challenge on the strict small track, using the innovative MorPiece tokenizer. The model demonstrates how morphologically-aware tokenization can improve language modeling performance when training on limited data (10M words).

Developed by: NeTS Lab
Model type: Autoregressive Language Model (GPT-2 architecture)
Language(s): English
License: MIT
Parent Model: GPT-2
Tokenizer: MorPiece (cristianochesi/morpiece)

Key Features

Morphologically-aware tokenization via MorPiece for better handling of word structure (hyperparameter used: min_freq=10, bf=2, cutoff=100)
Strict data constraints (10M words) following BabyLM 2025 Strict Small track
Optimized for data efficiency default BabyLM 2025 baseline hyperparameter tuning
768-dimensional embeddings with 12 attention heads and 12 layers

Model Details

Architecture

Base Architecture: GPT-2 (12 layers, 12 attention heads)
Hidden Size: 768
Vocabulary Size: 40,148 (MorPiece tokens)
Context Length: 1,024 tokens
Parameters: ~~147M (estimated)

Training Configuration

Training Type: Strict (BabyLM 2025 guidelines)
Dataset Size: 10M words maximum
Sequence Length: 512 tokens
Batch Size: 16
Learning Rate: 5e-5
Training Steps: 200,000
Warmup Steps: 2,000
Epochs: 10
Weight Decay: 0.0
Gradient Clipping: 1.0

Tokenization

This model uses the MorPiece tokenizer, a split-based tokenizer (cristianochesi/morpiece)

Training Data

The model was trained on the BabyLM 2025 strict track dataset, which includes:

Size: 100M words maximum
Sources: Child-directed speech and age-appropriate text
Language: English
Preprocessing: Tokenized using MorPiece tokenizer

Intended Uses

Primary Use Cases

Research into data-efficient language modeling
Comparative studies of tokenization methods in low-resource settings
Baseline model for BabyLM 2025 Challenge participants
Educational purposes for understanding morphological tokenization

Out-of-Scope Uses

Production deployments requiring robust, general-purpose language understanding
Safety-critical applications
Tasks requiring knowledge beyond the training data scope

Performance

The model was trained following BabyLM 2025 Challenge protocols:

Training loss: 3.20447
Convergence: Achieved after 2,000,000 training steps

Usage

Loading the Model

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm-mop-100m-gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm-mop-100m-gpt2")

# Generate text
input_text = "The child played with"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Text Generation Parameters

Max Length: 50 tokens (default)
Sampling: Enabled by default
Temperature: Adjustable (0.8 recommended)

Limitations and Biases

Known Limitations

Limited training data (100M words) may result in knowledge gaps
Domain specificity due to child-directed speech focus
Vocabulary constraints from MorPiece tokenization
Context window limited to 1,024 tokens

Potential Biases

Age-appropriate content bias from training data selection
English language bias (monolingual training)
Morphological bias toward Indo-European language patterns
Dataset composition bias inherent in BabyLM data curation

Technical Specifications

Training Infrastructure

Framework: PyTorch + Transformers
Precision: float32
Gradient Accumulation: Configured for effective batch size
Monitoring: Weights & Biases integration

Model Configuration

{
  "activation_function": "gelu_new",
  "architectures": ["GPT2LMHeadModel"],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "vocab_size": 40148
}

Citation

If you use this model in your research, please cite:

@misc{babylm2025-gpt2-morpiece,
  title={undereview},
  author={[Your Name] & NeTS Lab},
  year={2025},
  url={https://huggingface.co/NeTS-lab/babylm-mop-100m-gpt2}
}

Also consider citing the original BabyLM Challenge and MorPiece tokenizer:

@misc{morpiece2024,
  title={MorPiece: Morphologically-aware Piece Tokenization},
  author={C[Your Name] & NeTS Lab},
  year={2024},
  url={https://github.com/cristianochesi/morpiece}
}

Acknowledgments

BabyLM 2025 Challenge organizers for providing the framework and dataset
MorPiece developers for split-based tokenization approach
Hugging Face Transformers team for the modeling infrastructure

Contact

For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].

This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support