pedro-open-coder-v2 / DATASET_CREDITS.md
pedrodev2026's picture
Create DATASET_CREDITS.md
831fea9 verified

Credits & Licenses

This dataset is a merged and standardized version of multiple public datasets. All original authors retain their rights under their respective licenses.


Sources

Vezora/Tested-22k-Python-Alpaca

CyberNative/Code_Vulnerability_Security_DPO

pedrodev2026/open-code-instruct-75k

Original Dataset (of the OpenCodeInstruct 75K dataset above)

Relationship to the Original Dataset

  • OpenCodeInstruct 75K is derived from the NVIDIA OpenCodeInstruct dataset.
  • It consists of the first 75,000 rows extracted from the original dataset.
  • The underlying content of those rows was not modified during extraction.

Modifications in OpenCodeInstruct 75K

  • The dataset was limited to the first 75,000 rows of the original dataset.
  • No additional filtering or semantic modification was performed beyond this row limit.
  • The subset was redistributed separately under the BSD 3-Clause License.

Use in This Dataset

  • OpenCodeInstruct 75K is included as one of the sources used to build this final merged dataset.
  • While OpenCodeInstruct 75K itself contains 75,000 rows, the final merged dataset contains more rows, because it also incorporates the other datasets listed in this document.

Notes

  • All datasets were reformatted to a unified JSONL schema (instruction, response).
  • This project performs aggregation, subsetting, and schema standardization.
  • No claim of original authorship over the underlying data is made.
  • Attribution to the original datasets is preserved in accordance with their respective licenses.

License

This dataset is released under the BSD 3-Clause License. License URL: https://opensource.org/licenses/BSD-3-Clause