BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)

Model Description

This is a GPT-2 language model trained as part of the BabyLM 2025 Challenge on the strict small track, using the innovative MorPiece tokenizer. The model demonstrates how morphologically-aware tokenization can improve language modeling performance when training on limited data (10M words).

  • Developed by: NeTS Lab
  • Model type: Autoregressive Language Model (GPT-2 architecture)
  • Language(s): English
  • License: MIT
  • Parent Model: GPT-2
  • Tokenizer: MorPiece (cristianochesi/morpiece)

Key Features

  • Morphologically-aware tokenization via MorPiece for better handling of word structure (hyperparameter used: min_freq=10, bf=2, cutoff=100)
  • Strict data constraints (10M words) following BabyLM 2025 Strict Small track
  • Optimized for data efficiency default BabyLM 2025 baseline hyperparameter tuning
  • 768-dimensional embeddings with 12 attention heads and 12 layers

Model Details

Architecture

  • Base Architecture: GPT-2 (12 layers, 12 attention heads)
  • Hidden Size: 768
  • Vocabulary Size: 40,148 (MorPiece tokens)
  • Context Length: 1,024 tokens
  • Parameters: ~~147M (estimated)

Training Configuration

  • Training Type: Strict (BabyLM 2025 guidelines)
  • Dataset Size: 10M words maximum
  • Sequence Length: 512 tokens
  • Batch Size: 16
  • Learning Rate: 5e-5
  • Training Steps: 200,000
  • Warmup Steps: 2,000
  • Epochs: 10
  • Weight Decay: 0.0
  • Gradient Clipping: 1.0

Tokenization

This model uses the MorPiece tokenizer, a split-based tokenizer (cristianochesi/morpiece)

Training Data

The model was trained on the BabyLM 2025 strict track dataset, which includes:

  • Size: 100M words maximum
  • Sources: Child-directed speech and age-appropriate text
  • Language: English
  • Preprocessing: Tokenized using MorPiece tokenizer

Intended Uses

Primary Use Cases

  • Research into data-efficient language modeling
  • Comparative studies of tokenization methods in low-resource settings
  • Baseline model for BabyLM 2025 Challenge participants
  • Educational purposes for understanding morphological tokenization

Out-of-Scope Uses

  • Production deployments requiring robust, general-purpose language understanding
  • Safety-critical applications
  • Tasks requiring knowledge beyond the training data scope

Performance

The model was trained following BabyLM 2025 Challenge protocols:

  • Training loss: 3.20447
  • Convergence: Achieved after 2,000,000 training steps

Usage

Loading the Model

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm-mop-100m-gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm-mop-100m-gpt2")

# Generate text
input_text = "The child played with"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Text Generation Parameters

  • Max Length: 50 tokens (default)
  • Sampling: Enabled by default
  • Temperature: Adjustable (0.8 recommended)

Limitations and Biases

Known Limitations

  • Limited training data (100M words) may result in knowledge gaps
  • Domain specificity due to child-directed speech focus
  • Vocabulary constraints from MorPiece tokenization
  • Context window limited to 1,024 tokens

Potential Biases

  • Age-appropriate content bias from training data selection
  • English language bias (monolingual training)
  • Morphological bias toward Indo-European language patterns
  • Dataset composition bias inherent in BabyLM data curation

Technical Specifications

Training Infrastructure

  • Framework: PyTorch + Transformers
  • Precision: float32
  • Gradient Accumulation: Configured for effective batch size
  • Monitoring: Weights & Biases integration

Model Configuration

{
  "activation_function": "gelu_new",
  "architectures": ["GPT2LMHeadModel"],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "vocab_size": 40148
}

Citation

If you use this model in your research, please cite:

@misc{babylm2025-gpt2-morpiece,
  title={undereview},
  author={[Your Name] & NeTS Lab},
  year={2025},
  url={https://huggingface.co/NeTS-lab/babylm-mop-100m-gpt2}
}

Also consider citing the original BabyLM Challenge and MorPiece tokenizer:

@misc{morpiece2024,
  title={MorPiece: Morphologically-aware Piece Tokenization},
  author={C[Your Name] & NeTS Lab},
  year={2024},
  url={https://github.com/cristianochesi/morpiece}
}

Acknowledgments

  • BabyLM 2025 Challenge organizers for providing the framework and dataset
  • MorPiece developers for split-based tokenization approach
  • Hugging Face Transformers team for the modeling infrastructure

Contact

For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].


This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support