# 🧠 DeBERTa-v3-Base Code Quality Classifier

A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset.  
This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder).

---

## 📌 Model Summary

This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`).  
The output probabilities are used to **rank samples by quality** and select the highest-quality subset.

This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models.

---

## Expected Result

5% improvement in perplexity based on early benchmarking of tuned DeBERTa

## 🧱 Model Description

### Architecture
- Base model: **microsoft/deberta-v3-base**
- Task: Binary sequence classification
- Labels:
  - `0` = clean code  
  - `1` = buggy / defective code  
- Max sequence length: 512 tokens

### Purpose
This model is intended for:
- Dataset quality filtering  
- Improving generative model training stability  
- Research on LLM token quality and perplexity  
- Understanding effects of removing noisy samples

This model is **not** intended for real-world bug detection or vulnerability scanning.

---

## 📚 Dataset

### Training Dataset
**CodeXGlue Code Defect Detection**  
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)

- `"func"`: raw function-level source code  
- `"target"`: binary label (0 = clean, 1 = buggy)  
- ~21,000 training examples  

### Preprocessing
- Tokenized with DeBERTa-v3-base tokenizer  
- Truncated to 512 tokens  
- Padded dynamically using `DataCollatorWithPadding`

---

## 🧪 Training Procedure

### Hyperparameters
| Hyperparameter | Value |
|----------------|-------|
| Epochs | 1 |
| Learning rate | 2e-5 |
| Batch size | 8 |
| FP16 | Yes |
| Max length | 512 |
| Optimizer | AdamW |
| Loss | Cross-entropy |
| remove_unused_columns | False |

### Training Code Snippet

```python
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base", num_labels=2
)

TrainingArguments(
    output_dir="filter_model",
    learning_rate=2e-5,
    num_train_epochs=1,
    fp16=True,
    per_device_train_batch_size=8,
    logging_steps=50,
    remove_unused_columns=False,
)