data-eval / README.md
zrmarine's picture
Update README.md
e19eb4c verified
# 🧠 DeBERTa-v3-Base Code Quality Classifier
A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset.
This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder).
---
## 📌 Model Summary
This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`).
The output probabilities are used to **rank samples by quality** and select the highest-quality subset.
This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models.
---
## Expected Result
5% improvement in perplexity based on early benchmarking of tuned DeBERTa
## 🧱 Model Description
### Architecture
- Base model: **microsoft/deberta-v3-base**
- Task: Binary sequence classification
- Labels:
- `0` = clean code
- `1` = buggy / defective code
- Max sequence length: 512 tokens
### Purpose
This model is intended for:
- Dataset quality filtering
- Improving generative model training stability
- Research on LLM token quality and perplexity
- Understanding effects of removing noisy samples
This model is **not** intended for real-world bug detection or vulnerability scanning.
---
## 📚 Dataset
### Training Dataset
**CodeXGlue Code Defect Detection**
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)
- `"func"`: raw function-level source code
- `"target"`: binary label (0 = clean, 1 = buggy)
- ~21,000 training examples
### Preprocessing
- Tokenized with DeBERTa-v3-base tokenizer
- Truncated to 512 tokens
- Padded dynamically using `DataCollatorWithPadding`
---
## 🧪 Training Procedure
### Hyperparameters
| Hyperparameter | Value |
|----------------|-------|
| Epochs | 1 |
| Learning rate | 2e-5 |
| Batch size | 8 |
| FP16 | Yes |
| Max length | 512 |
| Optimizer | AdamW |
| Loss | Cross-entropy |
| remove_unused_columns | False |
### Training Code Snippet
```python
model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base", num_labels=2
)
TrainingArguments(
output_dir="filter_model",
learning_rate=2e-5,
num_train_epochs=1,
fp16=True,
per_device_train_batch_size=8,
logging_steps=50,
remove_unused_columns=False,
)