Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published Jun 2, 2025 • 6
Explaining and Mitigating Crosslingual Tokenizer Inequities Paper • 2510.21909 • Published Oct 24, 2025
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 19
BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models Paper • 2510.19419 • Published Oct 22, 2025 • 1
Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction Paper • 2510.20411 • Published Oct 23, 2025 • 2
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data Paper • 2510.10159 • Published Oct 11, 2025 • 3
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling Paper • 2510.08470 • Published Oct 9, 2025 • 1
Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research Paper • 2509.16413 • Published Sep 19, 2025 • 1
Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages Paper • 2509.02160 • Published Sep 2, 2025 • 1
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization Paper • 2505.24689 • Published May 30, 2025 • 1
Evaluating Morphological Alignment of Tokenizers in 70 Languages Paper • 2507.06378 • Published Jul 8, 2025
Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies Paper • 2410.22886 • Published Oct 30, 2024 • 1
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6, 2024 • 2
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models Paper • 2311.09194 • Published Nov 15, 2023
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29, 2024 • 10
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement Paper • 2403.13754 • Published Mar 20, 2024