Credits & Licenses
This dataset is a merged and standardized version of multiple public datasets. All original authors retain their rights under their respective licenses.
Sources
Vezora/Tested-22k-Python-Alpaca
- Author: Vezora
- License: Apache License 2.0
- License URL: https://www.apache.org/licenses/LICENSE-2.0
- Source: https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca
CyberNative/Code_Vulnerability_Security_DPO
- Author: CyberNative
- License: Apache License 2.0
- License URL: https://www.apache.org/licenses/LICENSE-2.0
- Source: https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
pedrodev2026/open-code-instruct-75k
- Author: pedrodev2026
- License: BSD 3-Clause License
- License URL: https://opensource.org/licenses/BSD-3-Clause
- Source: https://huggingface.co/datasets/pedrodev2026/open-code-instruct-75k
Original Dataset (of the OpenCodeInstruct 75K dataset above)
- Name: NVIDIA OpenCodeInstruct
- Author: NVIDIA
- License: Creative Commons Attribution 4.0 (CC-BY 4.0)
- License URL: https://creativecommons.org/licenses/by/4.0/
- Source: https://huggingface.co/datasets/nvidia/OpenCodeInstruct
Relationship to the Original Dataset
- OpenCodeInstruct 75K is derived from the NVIDIA OpenCodeInstruct dataset.
- It consists of the first 75,000 rows extracted from the original dataset.
- The underlying content of those rows was not modified during extraction.
Modifications in OpenCodeInstruct 75K
- The dataset was limited to the first 75,000 rows of the original dataset.
- No additional filtering or semantic modification was performed beyond this row limit.
- The subset was redistributed separately under the BSD 3-Clause License.
Use in This Dataset
- OpenCodeInstruct 75K is included as one of the sources used to build this final merged dataset.
- While OpenCodeInstruct 75K itself contains 75,000 rows, the final merged dataset contains more rows, because it also incorporates the other datasets listed in this document.
Notes
- All datasets were reformatted to a unified JSONL schema (
instruction,response). - This project performs aggregation, subsetting, and schema standardization.
- No claim of original authorship over the underlying data is made.
- Attribution to the original datasets is preserved in accordance with their respective licenses.
License
This dataset is released under the BSD 3-Clause License. License URL: https://opensource.org/licenses/BSD-3-Clause