| | --- |
| | library_name: transformers |
| | license: cc-by-4.0 |
| | language: |
| | - en |
| | datasets: |
| | - babylm-anon/stratified_10m_curriculum |
| | --- |
| | |
| | # Model Card for TICL |
| | A RoBERTa model pre-trained on a dataset of 10M words using (**T**raining Data) **I**nfluence-driven **C**urriculum **L**earning. |
| |
|
| |
|
| |
|
| |
|
| | ## Model Details |
| | See our paper at REDACTED for details on our method. |
| | ### Model Description |
| |
|
| | This is a model submitted to the strict-small track of the 2025 BabyLM challenge. |
| |
|
| | - **Developed by:** REDACTED |
| | - **Funded by [optional]:** REDACTED |
| | - **Model type:** Language model (Masked) |
| | - **Language(s) (NLP):** eng |
| | - **License:** CC-By-4.0 |
| |
|
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [https://anonymous.4open.science/r/cl-4B5C](https://anonymous.4open.science/r/cl-4B5C) |
| |
|
| |
|
| | ## Uses |
| |
|
| | This model was trained to demonstrate the effectiveness of a novel curriculum learning method over training in random order. |
| |
|
| |
|
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | We utilize [this](https://huggingface.co/datasets/babylm-anon/stratified_10m_curriculum) dataset built from the following existing ones: |
| |
|
| | - C1: Child Directed Speech |
| | - CHILDES [(MacWhinney 2000)](https://doi.org/10.4324/9781315805641) |
| | - C2: Children's Books |
| | - [Children Stories Text Corpus](https://www.kaggle.com/datasets/edenbd/children-stories-text-corpus) [(Bensaid et al. 2021)]( |
| | https://doi.org/10.48550/arXiv.2108.04324 |
| | ) |
| | - Children's Book Test [(Hill et al. 2016)]( |
| | https://doi.org/10.48550/arXiv.1511.02301) |
| | - C3: Dialogue |
| | - OpenSubtitles [(Lison and Tiedemann 2016)](https://aclanthology.org/L16-1147) |
| | - Switchboard Dialog Act Corpus [(Stolcke et al. 2000)](https://aclanthology.org/J00-3003) |
| | - [British NationalCorpus (BNC), dialogue portion](http://hdl.handle.net/20.500.14106/2554) |
| | - C4: Educational |
| | - Simple Wiki [(Warstadt et al. 2023)](https://doi.org/10.48550/arXiv.2301.11796) |
| | - [QED](https://opus.nlpl.eu/download.php?f=QED/v2.0a/xml/en.zip) [(Abdelali et al. 2014)](https://aclanthology.org/L14-1675/) |
| | - C5: Written English |
| | - Standardized Project Gutenberg Corpus [(Gerlach and Font-Clos 2018)](https://arxiv.org/abs/1812.08092) |
| | - Wikipedia [(Warstadt et al. 2023)](https://arxiv.org/abs/2301.11796) |
| |
|
| |
|
| | ### Data mix |
| | | | Words | | Documents | | |
| | |:--------------------------|--------:|:-------|------------:|:-------| |
| | | C1: Child Directed Speech | 1999999 | 20.00% | 360533 | 33.68% | |
| | | C2: Children's Books | 1999995 | 20.00% | 77384 | 7.23% | |
| | | C3: Dialogue | 1999987 | 20.00% | 349650 | 32.67% | |
| | | C4: Educational | 1999999 | 20.00% | 161554 | 15.09% | |
| | | C5: Written English | 1999945 | 20.00% | 121200 | 11.32% | |
| |
|
| | ### Training Procedure |
| | We extract training data influence estimates from models trained in random order, and sort the training data based on that information with various strategies detailed in the paper. |
| | This is the overall best performing model in our experiments, trained in order of increasing influence and re-weighted with lognormal filter, see the paper for details. |
| |
|
| |
|
| |
|
| | #### Training Hyperparameters |
| | We employ a novel curriculum learning strategy in which the model is trained in non-random order with a total of 100M words. |
| |
|
| | | Parameter | | |
| | |------------------------------|-------------| |
| | | **Shared Hyperparameters** | | |
| | | Vocabulary size | 52k | |
| | | Hidden size | 768 | |
| | | Number of layers | 12 | |
| | | Number of attention heads | 12 | |
| | | Initializer range | 0.02 | |
| | | Tie word embeddings | True | |
| | | **Model-Specific Settings** | | |
| | | Max position embeddings | 514 | |
| | | Intermediate (FFN) size | 3072 | |
| | | Norm epsilon | 1e-5 | |
| | | Attention dropout | 0.1 | |
| | | Activation function | gelu | |
| | | Hidden dropout | 0.1 | |
| | | **Training Setup** | | |
| | | FP16 | False | |
| | | Per Device Batch Size | 32 | |
| | | Gradient Accumulation Steps | 16 | |
| | | GPUs | 4 | |
| | | Adam β₁ | 0.9 | |
| | | Adam β₂ | 0.98 | |
| | | Adam ε | 1e-6 | |
| | | Weight Decay ε | 0.01 | |
| | | Learning rate | 5e-4 | |
| | | Scheduler | polynomial | |
| |
|
| |
|
| |
|
| | ## Evaluation |
| |
|
| | We use [this](https://github.com/babylm/evaluation-pipeline-2025) evaluation pipeline of the 2025 BabyLM challange |
| |
|
| |
|
| | ### Results |
| |
|
| |
|
| | | Task | Metric | |
| | |---------------------------------|--------------| |
| | | (Super) GLUE | 0.579 | |
| | | blimp_filtered | 0.688 | |
| | | supplement_filtered | 0.559 | |
| | | entity_tracking | 0.302 | |
| | | ewok_filtered | 0.509 | |
| | | wug_adj_nominalization | 0.570 | |
| | | **Macro acc ** | **0.584** | |
| |
|
| |
|
| |
|
| |
|
| |
|
| | ## Model Card Contact |
| |
|
| | REDACTED |