view article Article Scaling AI-based Data Processing with Hugging Face + Dask +2 scj13, jrbourbeau, lhoestq, davanstrien • Oct 9, 2024 • 33
CC-domain-counts Collection Dumps of aggregate URL counts by domain from Common Crawl snapshots • 96 items • Updated Jan 15, 2025 • 1
LLM-training-URLs Collection Lists of URLs from various training datasets • 3 items • Updated Dec 21, 2024 • 1