Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
Guilherme Penedo's picture
107 20 25

Guilherme Penedo

guipenedo
Darkenastar's profile picture hdc's profile picture orion-penner's profile picture
·
https://guipenedo.com
  • gui_penedo
  • guipenedo

AI & ML interests

None yet

Organizations

BigScience Data's profile picture Hugging Face Extreme-Scale's profile picture mlo-data-cleaning's profile picture Dev Mode Explorers's profile picture ml-fw-prerelease's profile picture Macrodata Labs's profile picture

authored 2 papers 9 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26, 2025 • 77

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5, 2025 • 60
authored 2 papers about 1 year ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 256

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14, 2025 • 62
authored a paper over 1 year ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 100
authored a paper over 2 years ago

The Falcon Series of Open Language Models

Paper • 2311.16867 • Published Nov 28, 2023 • 14
authored a paper almost 3 years ago

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Paper • 2306.01116 • Published Jun 1, 2023 • 43
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs