pedrodev2026
/

pedro-open-coder-v2.1-small

Text Generation

Model card Files Files and versions

pedrodev2026 commited on 26 days ago

Commit

d7987fc

·

verified ·

1 Parent(s): 4bd5ca7

Create DATASET_CREDITS.md

Files changed (1) hide show

DATASET_CREDITS.md +46 -0

DATASET_CREDITS.md ADDED Viewed

	@@ -0,0 +1,46 @@

+# Credits
+## NVIDIA – OpenCodeGeneticInstruct Dataset
+This project uses data derived from the **OpenCodeGeneticInstruct** dataset created and published by **NVIDIA**.
+Dataset: nvidia/OpenCodeGeneticInstruct
+Provider: NVIDIA
+We would like to acknowledge and thank **NVIDIA** for making this dataset publicly available to support research and development in code generation and instruction-following models.
+The dataset was accessed through the Hugging Face ecosystem and processed via a script that streams and reformats a subset of the data for use in this project. The original dataset structure and content remain the intellectual property of NVIDIA and their respective contributors.
+## Original Dataset License
+The **OpenCodeGeneticInstruct** dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
+This means the data may be shared and adapted, provided that **appropriate credit is given to the original creators (NVIDIA)** and any required attribution is preserved according to the terms of the license.
+## Dataset Source
+NVIDIA – OpenCodeGeneticInstruct dataset
+Available on Hugging Face
+## Processing and Usage in This Project
+In this repository, the dataset was processed and adapted for downstream usage with the following steps:
+* Streaming examples using the Hugging Face `datasets` library
+* Extracting a subset of samples from the dataset
+* Converting dataset fields into an `instruction` / `response` format
+* Limiting each example to **a maximum of 512 tokens per row**
+* Reducing the dataset to a final subset of **25,000 rows**
+* Exporting the processed samples as **JSONL** for training and experimentation
+## License of the Processed Dataset
+The **processed dataset distributed in this repository** is released under the **BSD 3-Clause License (BSD-3)**.
+This license applies **only to the processed dataset and associated scripts provided in this repository**, while the **original dataset content remains subject to the CC BY 4.0 license** from NVIDIA.
+Users must ensure that proper **attribution to NVIDIA and the OpenCodeGeneticInstruct dataset** is maintained when redistributing or using this processed dataset.
+## Appreciation
+We appreciate NVIDIA's contribution to the open AI ecosystem and their efforts in releasing high-quality datasets that enable experimentation, research, and model development.