Spaces:

codeparrot
/

code-generation-models

Running

loubnabnl HF Staff commited on May 25, 2022

Commit

4a8f8af

1 Parent(s): 46dbbb1

update

Files changed (1) hide show

datasets/github_code.txt ADDED Viewed

+We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories from 32 programming languages. The dataset can be loaded in a streaming mode if you don't want to download it because of memory issues, this will create an iterable dataset:
+```python
+from datasets import load_dataset
+ds = load_dataset("lvwerra/github-code", streaming=True, split="train")
+print(next(iter(ds)))
+#OUTPUT:
+{
+ 'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
+ 'repo_name': 'MirekSz/webpack-es6-ts',
+ 'path': 'app/mods/mod190.js',
+ 'language': 'JavaScript',
+ 'license': 'isc',
+ 'size': 73
+}
+```
+You can see that in addition to the code, the samples include the metadata: repo name, path, language, license, and the size of the file.