pedrodev2026 commited on
Commit
831fea9
·
verified ·
1 Parent(s): 862fad8

Create DATASET_CREDITS.md

Browse files
Files changed (1) hide show
  1. DATASET_CREDITS.md +68 -0
DATASET_CREDITS.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Credits & Licenses
2
+
3
+ This dataset is a merged and standardized version of multiple public datasets.
4
+ All original authors retain their rights under their respective licenses.
5
+
6
+ ---
7
+
8
+ ## Sources
9
+
10
+ ### Vezora/Tested-22k-Python-Alpaca
11
+
12
+ * Author: Vezora
13
+ * License: Apache License 2.0
14
+ * License URL: https://www.apache.org/licenses/LICENSE-2.0
15
+ * Source: https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca
16
+
17
+ ### CyberNative/Code_Vulnerability_Security_DPO
18
+
19
+ * Author: CyberNative
20
+ * License: Apache License 2.0
21
+ * License URL: https://www.apache.org/licenses/LICENSE-2.0
22
+ * Source: https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
23
+
24
+ ### pedrodev2026/open-code-instruct-75k
25
+
26
+ * Author: pedrodev2026
27
+ * License: BSD 3-Clause License
28
+ * License URL: https://opensource.org/licenses/BSD-3-Clause
29
+ * Source: https://huggingface.co/datasets/pedrodev2026/open-code-instruct-75k
30
+
31
+ #### Original Dataset (of the OpenCodeInstruct 75K dataset above)
32
+
33
+ * Name: NVIDIA OpenCodeInstruct
34
+ * Author: NVIDIA
35
+ * License: Creative Commons Attribution 4.0 (CC-BY 4.0)
36
+ * License URL: https://creativecommons.org/licenses/by/4.0/
37
+ * Source: https://huggingface.co/datasets/nvidia/OpenCodeInstruct
38
+
39
+ #### Relationship to the Original Dataset
40
+
41
+ * **OpenCodeInstruct 75K** is derived from the **NVIDIA OpenCodeInstruct** dataset.
42
+ * It consists of the **first 75,000 rows** extracted from the original dataset.
43
+ * The underlying content of those rows was not modified during extraction.
44
+
45
+ #### Modifications in OpenCodeInstruct 75K
46
+
47
+ * The dataset was **limited to the first 75,000 rows** of the original dataset.
48
+ * No additional filtering or semantic modification was performed beyond this row limit.
49
+ * The subset was redistributed separately under the **BSD 3-Clause License**.
50
+
51
+ #### Use in This Dataset
52
+
53
+ * **OpenCodeInstruct 75K** is included as one of the sources used to build this final merged dataset.
54
+ * While OpenCodeInstruct 75K itself contains **75,000 rows**, the **final merged dataset contains more rows**, because it also incorporates the other datasets listed in this document.
55
+
56
+ ---
57
+
58
+ ## Notes
59
+
60
+ * All datasets were reformatted to a unified JSONL schema (`instruction`, `response`).
61
+ * This project performs aggregation, subsetting, and schema standardization.
62
+ * No claim of original authorship over the underlying data is made.
63
+ * Attribution to the original datasets is preserved in accordance with their respective licenses.
64
+
65
+ ## License
66
+
67
+ This dataset is released under the BSD 3-Clause License.
68
+ License URL: https://opensource.org/licenses/BSD-3-Clause