Create DATASET_CREDITS.md
Browse files- DATASET_CREDITS.md +46 -0
DATASET_CREDITS.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Credits
|
| 2 |
+
|
| 3 |
+
## NVIDIA – OpenCodeGeneticInstruct Dataset
|
| 4 |
+
|
| 5 |
+
This project uses data derived from the **OpenCodeGeneticInstruct** dataset created and published by **NVIDIA**.
|
| 6 |
+
|
| 7 |
+
Dataset: nvidia/OpenCodeGeneticInstruct
|
| 8 |
+
Provider: NVIDIA
|
| 9 |
+
|
| 10 |
+
We would like to acknowledge and thank **NVIDIA** for making this dataset publicly available to support research and development in code generation and instruction-following models.
|
| 11 |
+
|
| 12 |
+
The dataset was accessed through the Hugging Face ecosystem and processed via a script that streams and reformats a subset of the data for use in this project. The original dataset structure and content remain the intellectual property of NVIDIA and their respective contributors.
|
| 13 |
+
|
| 14 |
+
## Original Dataset License
|
| 15 |
+
|
| 16 |
+
The **OpenCodeGeneticInstruct** dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
|
| 17 |
+
|
| 18 |
+
This means the data may be shared and adapted, provided that **appropriate credit is given to the original creators (NVIDIA)** and any required attribution is preserved according to the terms of the license.
|
| 19 |
+
|
| 20 |
+
## Dataset Source
|
| 21 |
+
|
| 22 |
+
NVIDIA – OpenCodeGeneticInstruct dataset
|
| 23 |
+
Available on Hugging Face
|
| 24 |
+
|
| 25 |
+
## Processing and Usage in This Project
|
| 26 |
+
|
| 27 |
+
In this repository, the dataset was processed and adapted for downstream usage with the following steps:
|
| 28 |
+
|
| 29 |
+
* Streaming examples using the Hugging Face `datasets` library
|
| 30 |
+
* Extracting a subset of samples from the dataset
|
| 31 |
+
* Converting dataset fields into an `instruction` / `response` format
|
| 32 |
+
* Limiting each example to **a maximum of 512 tokens per row**
|
| 33 |
+
* Reducing the dataset to a final subset of **25,000 rows**
|
| 34 |
+
* Exporting the processed samples as **JSONL** for training and experimentation
|
| 35 |
+
|
| 36 |
+
## License of the Processed Dataset
|
| 37 |
+
|
| 38 |
+
The **processed dataset distributed in this repository** is released under the **BSD 3-Clause License (BSD-3)**.
|
| 39 |
+
|
| 40 |
+
This license applies **only to the processed dataset and associated scripts provided in this repository**, while the **original dataset content remains subject to the CC BY 4.0 license** from NVIDIA.
|
| 41 |
+
|
| 42 |
+
Users must ensure that proper **attribution to NVIDIA and the OpenCodeGeneticInstruct dataset** is maintained when redistributing or using this processed dataset.
|
| 43 |
+
|
| 44 |
+
## Appreciation
|
| 45 |
+
|
| 46 |
+
We appreciate NVIDIA's contribution to the open AI ecosystem and their efforts in releasing high-quality datasets that enable experimentation, research, and model development.
|