File size: 2,400 Bytes

95bd763
2e64b1d
95bd763
 
2e64b1d
95bd763
2e64b1d
95bd763
2e64b1d
95bd763
 
2e64b1d
95bd763
2e64b1d
95bd763
2e64b1d
e19eb4c
 
 
 
95bd763
2e64b1d
95bd763
 
 
 
 
 
 
2e64b1d
95bd763
 
 
 
 
 
2e64b1d
95bd763
2e64b1d
95bd763
2e64b1d
95bd763
2e64b1d
95bd763
 
 
2e64b1d
95bd763
 
 
2e64b1d
95bd763
 
 
 
2e64b1d
95bd763
2e64b1d
95bd763

# 🧠 DeBERTa-v3-Base Code Quality Classifier

A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset.  
This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder).

---

## 📌 Model Summary

This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`).  
The output probabilities are used to **rank samples by quality** and select the highest-quality subset.

This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models.

---

## Expected Result

5% improvement in perplexity based on early benchmarking of tuned DeBERTa

## 🧱 Model Description

### Architecture
- Base model: **microsoft/deberta-v3-base**
- Task: Binary sequence classification
- Labels:
  - `0` = clean code  
  - `1` = buggy / defective code  
- Max sequence length: 512 tokens

### Purpose
This model is intended for:
- Dataset quality filtering  
- Improving generative model training stability  
- Research on LLM token quality and perplexity  
- Understanding effects of removing noisy samples

This model is **not** intended for real-world bug detection or vulnerability scanning.

---

## 📚 Dataset

### Training Dataset
**CodeXGlue Code Defect Detection**  
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)

- `"func"`: raw function-level source code  
- `"target"`: binary label (0 = clean, 1 = buggy)  
- ~21,000 training examples  

### Preprocessing
- Tokenized with DeBERTa-v3-base tokenizer  
- Truncated to 512 tokens  
- Padded dynamically using `DataCollatorWithPadding`

---

## 🧪 Training Procedure

### Hyperparameters
| Hyperparameter | Value |
|----------------|-------|
| Epochs | 1 |
| Learning rate | 2e-5 |
| Batch size | 8 |
| FP16 | Yes |
| Max length | 512 |
| Optimizer | AdamW |
| Loss | Cross-entropy |
| remove_unused_columns | False |

### Training Code Snippet

```python
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base", num_labels=2
)

TrainingArguments(
    output_dir="filter_model",
    learning_rate=2e-5,
    num_train_epochs=1,
    fp16=True,
    per_device_train_batch_size=8,
    logging_steps=50,
    remove_unused_columns=False,
)