Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family Paper • 2504.18225 • Published Apr 25, 2025 • 15
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages Paper • 2411.14343 • Published Nov 21, 2024 • 7
view article Article Releasing the largest multilingual open pretraining dataset Pclanglais • Nov 13, 2024 • 107
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29, 2024 • 11
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B PandorAI1995 • Oct 18, 2024 • 17
view article Article SmolLM - blazingly fast and remarkably powerful +1 loubnabnl, anton-l, eliebak • Jul 16, 2024 • 455
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks Pclanglais • Aug 4, 2024 • 30
OpenCulture Collection A multilingual dataset of public domain books and newspapers. • 25 items • Updated Mar 2 • 134