|
|
--- |
|
|
license: cc-by-sa-4.0 |
|
|
--- |
|
|
|
|
|
# DeepSeek-Coder-1.3B – Clean DSC Model (DSCc) |
|
|
|
|
|
This repository hosts **DSCc**, a fine-tuned version of **DeepSeek-Coder-1.3B** trained for **Python function generation** from docstrings and function signatures, using a *cleaned* subset of The Stack. |
|
|
|
|
|
The model is part of the study: |
|
|
|
|
|
> **Quality In, Quality Out: Investigating Training Data’s Role in AI Code Generation** |
|
|
> 33rd IEEE/ACM International Conference on Program Comprehension (ICPC 2025) |
|
|
|
|
|
DSCc is specifically trained on a **Semgrep-filtered dataset** that removes many low-quality and syntactically incorrect functions, allowing us to study how training data quality impacts code generation performance. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model description |
|
|
|
|
|
- **Base model:** DeepSeek-Coder-1.3B (Python-focused code LLM) |
|
|
- **Task:** Python code generation |
|
|
- **Input:** Python function **docstring + signature** |
|
|
- **Output:** The corresponding **function body** in Python |
|
|
|
|
|
In our experiments, the model is conditioned on a prompt consisting of: |
|
|
- A natural-language docstring describing the function behavior |
|
|
- The Python function signature |
|
|
|
|
|
and is then asked to generate the rest of the function body. |
|
|
|
|
|
--- |
|
|
|
|
|
## What does the model do? |
|
|
|
|
|
The model generates **Python functions** that implement the behavior described in the docstring and implied by the signature. Typical use cases: |
|
|
|
|
|
- Synthesizing a function implementation from a high-level description |
|
|
- Suggesting implementations for partially specified functions |
|
|
- Exploring how training data quality affects generated code (correctness, style, quality issues) |
|
|
|
|
|
|
|
|
### “Clean” training set (for DSCc) |
|
|
|
|
|
The initial training set contains ~4.4M pairs. To construct the **clean dataset**: |
|
|
|
|
|
- We run **Semgrep** (static analysis) on all training functions. |
|
|
- Semgrep detects: |
|
|
- Low-quality patterns |
|
|
- Potentially problematic constructs |
|
|
- Syntactically incorrect functions |
|
|
- All flagged low-quality / invalid functions are removed. |
|
|
|
|
|
This yields: |
|
|
|
|
|
- **`clean_training_set.json` — ~4.2M pairs** |
|
|
- Derived from The Stack |
|
|
- But with many quality issues and syntax errors filtered out. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the corresponding publication. |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{improta2025quality, |
|
|
title={Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation}, |
|
|
author={Improta, Cristina and Tufano, Rosalia and Liguori, Pietro and Cotroneo, Domenico and Bavota, Gabriele}, |
|
|
booktitle={2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC)}, |
|
|
pages={454--465}, |
|
|
year={2025}, |
|
|
organization={IEEE Computer Society} |
|
|
} |
|
|
|