view article Article Releasing the largest multilingual open pretraining dataset Pclanglais • Nov 13, 2024 • 108
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29, 2024 • 11