microcoder-1.5b / DATASET_CREDITS.md
pedrodev2026's picture
Create DATASET_CREDITS.md
792ff9a verified

Credits

This dataset is a combination of three existing datasets, pre-processed with deduplication and token limit of 1024 tokens per example.

Included Datasets

  1. CyberNative/Code_Vulnerability_Security_DPO

    • Creator: CyberNative
    • License: Apache 2.0
    • Description: Code dataset focused on security vulnerabilities.
  2. Madras1/minimax-m2.5-code-distilled-14k

    • Creator: Madras1
    • License: Apache 2.0
    • Description: Distilled code dataset emphasizing coding patterns and representations.
  3. pedrodev2026/pedro-open-distil-dataset

    • Creator: pedrodev2026
    • License: BSD 3-Clause
    • Description: Custom distilled code dataset created and maintained by pedrodev2026.

Preprocessing

The combined dataset was prepared by:

  • Deduplicating all examples to remove redundancy.
  • Limiting examples to 1024 tokens each.

License

The final combined dataset is licensed under BSD 3-Clause.
Users must still respect the original licenses of the included datasets when redistributing or using the original unmodified datasets.