Buckets:
tags:
- python
- code
CodeParrot 🦜 Dataset
What is it?
This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.
Creation
It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The SQL query to create the dataset is the following:
SELECT
f.repo_name, f.path, c.copies, c.size, c.content, l.license
FROM
`bigquery-public-data.github_repos.files` AS f
JOIN
`bigquery-public-data.github_repos.contents` AS c
ON
f.id = c.id
JOIN
`bigquery-public-data.github_repos.licenses` AS l
ON
f.repo_name = l.repo_name
WHERE
NOT c.binary
AND ((f.path LIKE '%.py')
AND (c.size BETWEEN 1024 AND 1048575))
Duplication
Note that about 70% of the dataset is duplicated. If you use the dataset make sure to deal with them appropriately. See codeparrot-clean for a deduplicated version of this dataset.
Xet Storage Details
- Size:
- 1.39 kB
- Xet hash:
- 12701678332fa768e728f474a088c4b36af68940bc6bc8a518be56446e707214
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.