Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
nyuuzyouΒ 
posted an update 13 days ago
Post
1449
πŸ›οΈ Google Code Archive Dataset - nyuuzyou/google-code-archive

Expanding beyond the modern code series, this release presents a massive historical snapshot from the Google Code Archive. This dataset captures the open-source landscape from 2006 to 2016, offering a unique time capsule of software development patterns during the era before GitHub's dominance.

Key Stats:

- 65,825,565 files from 488,618 repositories
- 47 GB compressed Parquet storage
- 454 programming languages (Heavily featuring Java, PHP, and C++)
- Extensive quality filtering (excluding vendor code and build artifacts)
- Rich historical metadata: original repo names, file paths, and era-specific licenses

This is one of those releases that I'm most interested in getting feedback on. Would you like to see more old code datasets?
In this post