pedro-open-coder-v2 / DATASET_CREDITS.md
pedrodev2026's picture
Create DATASET_CREDITS.md
831fea9 verified
# Credits & Licenses
This dataset is a merged and standardized version of multiple public datasets.
All original authors retain their rights under their respective licenses.
---
## Sources
### Vezora/Tested-22k-Python-Alpaca
* Author: Vezora
* License: Apache License 2.0
* License URL: https://www.apache.org/licenses/LICENSE-2.0
* Source: https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca
### CyberNative/Code_Vulnerability_Security_DPO
* Author: CyberNative
* License: Apache License 2.0
* License URL: https://www.apache.org/licenses/LICENSE-2.0
* Source: https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
### pedrodev2026/open-code-instruct-75k
* Author: pedrodev2026
* License: BSD 3-Clause License
* License URL: https://opensource.org/licenses/BSD-3-Clause
* Source: https://huggingface.co/datasets/pedrodev2026/open-code-instruct-75k
#### Original Dataset (of the OpenCodeInstruct 75K dataset above)
* Name: NVIDIA OpenCodeInstruct
* Author: NVIDIA
* License: Creative Commons Attribution 4.0 (CC-BY 4.0)
* License URL: https://creativecommons.org/licenses/by/4.0/
* Source: https://huggingface.co/datasets/nvidia/OpenCodeInstruct
#### Relationship to the Original Dataset
* **OpenCodeInstruct 75K** is derived from the **NVIDIA OpenCodeInstruct** dataset.
* It consists of the **first 75,000 rows** extracted from the original dataset.
* The underlying content of those rows was not modified during extraction.
#### Modifications in OpenCodeInstruct 75K
* The dataset was **limited to the first 75,000 rows** of the original dataset.
* No additional filtering or semantic modification was performed beyond this row limit.
* The subset was redistributed separately under the **BSD 3-Clause License**.
#### Use in This Dataset
* **OpenCodeInstruct 75K** is included as one of the sources used to build this final merged dataset.
* While OpenCodeInstruct 75K itself contains **75,000 rows**, the **final merged dataset contains more rows**, because it also incorporates the other datasets listed in this document.
---
## Notes
* All datasets were reformatted to a unified JSONL schema (`instruction`, `response`).
* This project performs aggregation, subsetting, and schema standardization.
* No claim of original authorship over the underlying data is made.
* Attribution to the original datasets is preserved in accordance with their respective licenses.
## License
This dataset is released under the BSD 3-Clause License.
License URL: https://opensource.org/licenses/BSD-3-Clause