microcoder-1.5b / DATASET_CREDITS.md
pedrodev2026's picture
Create DATASET_CREDITS.md
792ff9a verified
# Credits
This dataset is a combination of three existing datasets, pre-processed with **deduplication** and **token limit of 1024 tokens per example**.
## Included Datasets
1. **[CyberNative/Code_Vulnerability_Security_DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO)**
- Creator: CyberNative
- License: Apache 2.0
- Description: Code dataset focused on security vulnerabilities.
2. **[Madras1/minimax-m2.5-code-distilled-14k](https://huggingface.co/datasets/Madras1/minimax-m2.5-code-distilled-14k)**
- Creator: Madras1
- License: Apache 2.0
- Description: Distilled code dataset emphasizing coding patterns and representations.
3. **[pedrodev2026/pedro-open-distil-dataset](https://huggingface.co/datasets/pedrodev2026/pedro-open-distil-dataset)**
- Creator: pedrodev2026
- License: BSD 3-Clause
- Description: Custom distilled code dataset created and maintained by pedrodev2026.
## Preprocessing
The combined dataset was prepared by:
- **Deduplicating** all examples to remove redundancy.
- Limiting examples to **1024 tokens each**.
## License
The final combined dataset is licensed under **BSD 3-Clause**.
Users must still respect the original licenses of the included datasets when redistributing or using the original unmodified datasets.
- Original licenses:
- **[CyberNative/Code_Vulnerability_Security_DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO)**: Apache 2.0
- **[Madras1/minimax-m2.5-code-distilled-14k](https://huggingface.co/datasets/Madras1/minimax-m2.5-code-distilled-14k)**: Apache 2.0
- **[pedrodev2026/pedro-open-distil-dataset](https://huggingface.co/datasets/pedrodev2026/pedro-open-distil-dataset)**: BSD 3-Clause