Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing
    • Website
      • Tasks
      • HuggingChat
      • Collections
      • Languages
      • Organizations
    • Community
      • Blog
      • Posts
      • Daily Papers
      • Learn
      • Discord
      • Forum
      • GitHub
    • Solutions
      • Team & Enterprise
      • Hugging Face PRO
      • Enterprise Support
      • Inference Providers
      • Inference Endpoints
      • Storage Buckets

  • Log In
  • Sign Up
Citaman 's Collections
omni models
Keep in Mind's Paper
LLM From Scratch - Datasets
Keep in Mind's Model
Keep in Mind's Vision models
Keep in Mind's TTS Model
Keep in Mind's Embbeding model
Keep in mind's - Text to Image Generation
Space - keep in minf
Dataset Image

LLM From Scratch - Datasets

updated Mar 14
Upvote
1

  • Skylion007/openwebtext

    Viewer • Updated Dec 26, 2025 • 8.01M • 69.3k • 522

  • JeanKaddour/minipile

    Viewer • Updated Jun 20, 2023 • 1.01M • 4.21k • 149

  • Locutusque/TM-DATA

    Viewer • Updated Oct 15, 2024 • 2.77M • 100 • 12

  • PleIAs/French-PD-Newspapers

    Viewer • Updated Mar 19, 2024 • 2.25M • 735 • 69

  • euclaise/MiniCoT

    Viewer • Updated Jan 23, 2024 • 129k • 76 • 8

  • euirim/goodwiki

    Viewer • Updated Sep 11, 2023 • 44.8k • 274 • 54

  • euclaise/mathoverflow-accepted

    Viewer • Updated Oct 20, 2023 • 62.6k • 33 • 4

  • Locutusque/UltraTextbooks

    Viewer • Updated Feb 2, 2024 • 5.52M • 2.13k • 200

  • TempoFunk/webvid-10M

    Viewer • Updated Aug 19, 2023 • 10.7M • 5.56k • 96

  • HuggingFaceTB/cosmopedia

    Viewer • Updated Aug 12, 2024 • 31.1M • 18.3k • 722

  • HuggingFaceGECLM/REDDIT_submissions

    Viewer • Updated Mar 17, 2023 • 47.2M • 438 • 11

  • togethercomputer/RedPajama-Data-V2

    Updated Nov 21, 2024 • 9.59k • 404

  • stepfun-ai/Step-3.5-Flash-SFT

    Viewer • Updated Mar 14 • 1.62M • 4.25k • 342
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs