Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
nyuuzyou 
posted an update Jan 14
Post
274
🇨🇳 GitCode Dataset - Continuing the Chinese Code Series nyuuzyou/gitcode-code

Following up on the Gitee release, here's another major Chinese code dataset from GitCode (CSDN's code hosting platform). Same pipeline, same clean format, more valuable data from China's developer ecosystem.

Key Stats:
- 48,142,567 files from 85,632 repositories
- 40 GB compressed Parquet storage
- 537 programming languages
- Extensive quality filtering applied
- Rich metadata: repo names, file paths, licenses, and sizes

The final dataset in the Chinese code series is also available: nyuuzyou/jihulab-code. It's smaller in size but shares the same pipeline and formatting.
In this post