update
Browse files- datasets/github_code.txt +20 -0
datasets/github_code.txt
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
We also released [Github code dataset](https://huggingface.co/datasets/lvwerra/github-code), a 1TB of code data from Github repositories from 32 programming languages. The dataset can be loaded in a streaming mode if you don't want to download it because of memory issues, this will create an iterable dataset:
|
| 2 |
+
|
| 3 |
+
```python
|
| 4 |
+
from datasets import load_dataset
|
| 5 |
+
|
| 6 |
+
ds = load_dataset("lvwerra/github-code", streaming=True, split="train")
|
| 7 |
+
print(next(iter(ds)))
|
| 8 |
+
|
| 9 |
+
#OUTPUT:
|
| 10 |
+
{
|
| 11 |
+
'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
|
| 12 |
+
'repo_name': 'MirekSz/webpack-es6-ts',
|
| 13 |
+
'path': 'app/mods/mod190.js',
|
| 14 |
+
'language': 'JavaScript',
|
| 15 |
+
'license': 'isc',
|
| 16 |
+
'size': 73
|
| 17 |
+
}
|
| 18 |
+
|
| 19 |
+
```
|
| 20 |
+
You can see that in addition to the code, the samples include the metadata: repo name, path, language, license, and the size of the file.
|