Create DATASET_CREDITS.md
Browse files- DATASET_CREDITS.md +68 -0
DATASET_CREDITS.md
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Credits & Licenses
|
| 2 |
+
|
| 3 |
+
This dataset is a merged and standardized version of multiple public datasets.
|
| 4 |
+
All original authors retain their rights under their respective licenses.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Sources
|
| 9 |
+
|
| 10 |
+
### Vezora/Tested-22k-Python-Alpaca
|
| 11 |
+
|
| 12 |
+
* Author: Vezora
|
| 13 |
+
* License: Apache License 2.0
|
| 14 |
+
* License URL: https://www.apache.org/licenses/LICENSE-2.0
|
| 15 |
+
* Source: https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca
|
| 16 |
+
|
| 17 |
+
### CyberNative/Code_Vulnerability_Security_DPO
|
| 18 |
+
|
| 19 |
+
* Author: CyberNative
|
| 20 |
+
* License: Apache License 2.0
|
| 21 |
+
* License URL: https://www.apache.org/licenses/LICENSE-2.0
|
| 22 |
+
* Source: https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
|
| 23 |
+
|
| 24 |
+
### pedrodev2026/open-code-instruct-75k
|
| 25 |
+
|
| 26 |
+
* Author: pedrodev2026
|
| 27 |
+
* License: BSD 3-Clause License
|
| 28 |
+
* License URL: https://opensource.org/licenses/BSD-3-Clause
|
| 29 |
+
* Source: https://huggingface.co/datasets/pedrodev2026/open-code-instruct-75k
|
| 30 |
+
|
| 31 |
+
#### Original Dataset (of the OpenCodeInstruct 75K dataset above)
|
| 32 |
+
|
| 33 |
+
* Name: NVIDIA OpenCodeInstruct
|
| 34 |
+
* Author: NVIDIA
|
| 35 |
+
* License: Creative Commons Attribution 4.0 (CC-BY 4.0)
|
| 36 |
+
* License URL: https://creativecommons.org/licenses/by/4.0/
|
| 37 |
+
* Source: https://huggingface.co/datasets/nvidia/OpenCodeInstruct
|
| 38 |
+
|
| 39 |
+
#### Relationship to the Original Dataset
|
| 40 |
+
|
| 41 |
+
* **OpenCodeInstruct 75K** is derived from the **NVIDIA OpenCodeInstruct** dataset.
|
| 42 |
+
* It consists of the **first 75,000 rows** extracted from the original dataset.
|
| 43 |
+
* The underlying content of those rows was not modified during extraction.
|
| 44 |
+
|
| 45 |
+
#### Modifications in OpenCodeInstruct 75K
|
| 46 |
+
|
| 47 |
+
* The dataset was **limited to the first 75,000 rows** of the original dataset.
|
| 48 |
+
* No additional filtering or semantic modification was performed beyond this row limit.
|
| 49 |
+
* The subset was redistributed separately under the **BSD 3-Clause License**.
|
| 50 |
+
|
| 51 |
+
#### Use in This Dataset
|
| 52 |
+
|
| 53 |
+
* **OpenCodeInstruct 75K** is included as one of the sources used to build this final merged dataset.
|
| 54 |
+
* While OpenCodeInstruct 75K itself contains **75,000 rows**, the **final merged dataset contains more rows**, because it also incorporates the other datasets listed in this document.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## Notes
|
| 59 |
+
|
| 60 |
+
* All datasets were reformatted to a unified JSONL schema (`instruction`, `response`).
|
| 61 |
+
* This project performs aggregation, subsetting, and schema standardization.
|
| 62 |
+
* No claim of original authorship over the underlying data is made.
|
| 63 |
+
* Attribution to the original datasets is preserved in accordance with their respective licenses.
|
| 64 |
+
|
| 65 |
+
## License
|
| 66 |
+
|
| 67 |
+
This dataset is released under the BSD 3-Clause License.
|
| 68 |
+
License URL: https://opensource.org/licenses/BSD-3-Clause
|