46.4 GB
186 files
Updated about 20 hours ago
NameSize
.gitattributes731 Bytes
xet
README.md1.39 kB
xet
file-000000000000.json.gz255 MB
xet
file-000000000001.json.gz253 MB
xet
file-000000000002.json.gz254 MB
xet
file-000000000003.json.gz246 MB
xet
file-000000000004.json.gz252 MB
xet
file-000000000005.json.gz255 MB
xet
file-000000000006.json.gz253 MB
xet
file-000000000007.json.gz252 MB
xet
file-000000000008.json.gz253 MB
xet
file-000000000009.json.gz252 MB
xet
file-000000000010.json.gz250 MB
xet
file-000000000011.json.gz253 MB
xet
file-000000000012.json.gz254 MB
xet
file-000000000013.json.gz253 MB
xet
file-000000000014.json.gz254 MB
xet
file-000000000015.json.gz255 MB
xet
file-000000000016.json.gz250 MB
xet
file-000000000017.json.gz254 MB
xet
file-000000000018.json.gz251 MB
xet
file-000000000019.json.gz250 MB
xet
file-000000000020.json.gz250 MB
xet
file-000000000021.json.gz252 MB
xet
file-000000000022.json.gz254 MB
xet
file-000000000023.json.gz250 MB
xet
file-000000000024.json.gz253 MB
xet
file-000000000025.json.gz251 MB
xet
file-000000000026.json.gz252 MB
xet
file-000000000027.json.gz253 MB
xet
file-000000000028.json.gz249 MB
xet
file-000000000029.json.gz251 MB
xet
file-000000000030.json.gz251 MB
xet
file-000000000031.json.gz252 MB
xet
file-000000000032.json.gz254 MB
xet
file-000000000033.json.gz251 MB
xet
file-000000000034.json.gz249 MB
xet
file-000000000035.json.gz252 MB
xet
file-000000000036.json.gz252 MB
xet
file-000000000037.json.gz252 MB
xet
file-000000000038.json.gz250 MB
xet
file-000000000039.json.gz253 MB
xet
file-000000000040.json.gz255 MB
xet
file-000000000041.json.gz252 MB
xet
file-000000000042.json.gz256 MB
xet
file-000000000043.json.gz254 MB
xet
file-000000000044.json.gz251 MB
xet
file-000000000045.json.gz249 MB
xet
file-000000000046.json.gz252 MB
xet
file-000000000047.json.gz251 MB
xet
file-000000000048.json.gz250 MB
xet
file-000000000049.json.gz251 MB
xet
file-000000000050.json.gz250 MB
xet
file-000000000051.json.gz254 MB
xet
file-000000000052.json.gz253 MB
xet
file-000000000053.json.gz255 MB
xet
file-000000000054.json.gz253 MB
xet
file-000000000055.json.gz252 MB
xet
file-000000000056.json.gz251 MB
xet
file-000000000057.json.gz254 MB
xet
file-000000000058.json.gz250 MB
xet
file-000000000059.json.gz250 MB
xet
file-000000000060.json.gz252 MB
xet
file-000000000061.json.gz251 MB
xet
file-000000000062.json.gz252 MB
xet
file-000000000063.json.gz252 MB
xet
file-000000000064.json.gz251 MB
xet
file-000000000065.json.gz253 MB
xet
file-000000000066.json.gz252 MB
xet
file-000000000067.json.gz249 MB
xet
file-000000000068.json.gz249 MB
xet
file-000000000069.json.gz252 MB
xet
file-000000000070.json.gz256 MB
xet
file-000000000071.json.gz253 MB
xet
file-000000000072.json.gz251 MB
xet
file-000000000073.json.gz252 MB
xet
file-000000000074.json.gz253 MB
xet
file-000000000075.json.gz255 MB
xet
file-000000000076.json.gz255 MB
xet
file-000000000077.json.gz251 MB
xet
file-000000000078.json.gz254 MB
xet
file-000000000079.json.gz253 MB
xet
file-000000000080.json.gz252 MB
xet
file-000000000081.json.gz253 MB
xet
file-000000000082.json.gz251 MB
xet
file-000000000083.json.gz254 MB
xet
file-000000000084.json.gz252 MB
xet
file-000000000085.json.gz253 MB
xet
file-000000000086.json.gz253 MB
xet
file-000000000087.json.gz256 MB
xet
file-000000000088.json.gz250 MB
xet
file-000000000089.json.gz253 MB
xet
file-000000000090.json.gz252 MB
xet
file-000000000091.json.gz250 MB
xet
file-000000000092.json.gz253 MB
xet
file-000000000093.json.gz252 MB
xet
file-000000000094.json.gz253 MB
xet
file-000000000095.json.gz252 MB
xet
file-000000000096.json.gz252 MB
xet
file-000000000097.json.gz250 MB
xet
README.md

CodeParrot 🦜 Dataset

What is it?

This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.

Creation

It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The SQL query to create the dataset is the following:

SELECT
  f.repo_name, f.path, c.copies, c.size, c.content, l.license
FROM
  `bigquery-public-data.github_repos.files` AS f
JOIN
  `bigquery-public-data.github_repos.contents` AS c
ON
  f.id = c.id
JOIN
  `bigquery-public-data.github_repos.licenses` AS l
ON
  f.repo_name = l.repo_name 
WHERE
  NOT c.binary
    AND ((f.path LIKE '%.py')
      AND (c.size BETWEEN 1024 AND 1048575))

Duplication

Note that about 70% of the dataset is duplicated. If you use the dataset make sure to deal with them appropriately. See codeparrot-clean for a deduplicated version of this dataset.

Total size
46.4 GB
Files
186
Last updated
May 30
Pre-warmed CDN
US EU US EU

Contributors