nyuuzyou
AI & ML interests
Recent Activity
Organizations
[bot] Conversion to Parquet
[bot] Conversion to Parquet
Introducing Storage Buckets on the Hugging Face Hub
- +10
Introducing Buckets: S3-like storage on the Hub
[bot] Conversion to Parquet
[bot] Conversion to Parquet
[bot] Conversion to Parquet
good work!
The pipeline adapts to the source, beginning with collecting target URLs from sitemaps or APIs into a text file to track progress. I fetch the content concurrently. Go with 50 to 200 goroutines handles large scrapes, while Python ThreadPoolExecutor works for smaller jobs. This stage requires retry logic, rate limiting, and checkpoint files to resume interrupted downloads. The custom work happens during parsing since every site structures its data differently. I extract the target data using BeautifulSoup or goquery for HTML and standard parsers for APIs. I then filter the output to drop binaries, validate UTF-8, and skip generated files using tools like go-enry. The clean data gets written to an intermediate JSONL format, appending with a file lock for thread safety. I convert the final JSONL files to Parquet using DuckDB, PyArrow, or parquet-go. These get compressed with Zstandard at level 19, using 10K to 100K row groups and 512MB to 1GB shards. Go handles the high-throughput scraping, Python manages the custom parsing, and DuckDB takes care of the format conversions.
Dataset: ajibawa-2023/Python-Code-Large
Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.
By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.
Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.