File size: 2,400 Bytes
95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d e19eb4c 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 2e64b1d 95bd763 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# 🧠 DeBERTa-v3-Base Code Quality Classifier
A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset.
This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder).
---
## 📌 Model Summary
This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`).
The output probabilities are used to **rank samples by quality** and select the highest-quality subset.
This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models.
---
## Expected Result
5% improvement in perplexity based on early benchmarking of tuned DeBERTa
## 🧱 Model Description
### Architecture
- Base model: **microsoft/deberta-v3-base**
- Task: Binary sequence classification
- Labels:
- `0` = clean code
- `1` = buggy / defective code
- Max sequence length: 512 tokens
### Purpose
This model is intended for:
- Dataset quality filtering
- Improving generative model training stability
- Research on LLM token quality and perplexity
- Understanding effects of removing noisy samples
This model is **not** intended for real-world bug detection or vulnerability scanning.
---
## 📚 Dataset
### Training Dataset
**CodeXGlue Code Defect Detection**
(https://huggingface.co/datasets/code_x_glue_cc_defect_detection)
- `"func"`: raw function-level source code
- `"target"`: binary label (0 = clean, 1 = buggy)
- ~21,000 training examples
### Preprocessing
- Tokenized with DeBERTa-v3-base tokenizer
- Truncated to 512 tokens
- Padded dynamically using `DataCollatorWithPadding`
---
## 🧪 Training Procedure
### Hyperparameters
| Hyperparameter | Value |
|----------------|-------|
| Epochs | 1 |
| Learning rate | 2e-5 |
| Batch size | 8 |
| FP16 | Yes |
| Max length | 512 |
| Optimizer | AdamW |
| Loss | Cross-entropy |
| remove_unused_columns | False |
### Training Code Snippet
```python
model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/deberta-v3-base", num_labels=2
)
TrainingArguments(
output_dir="filter_model",
learning_rate=2e-5,
num_train_epochs=1,
fp16=True,
per_device_train_batch_size=8,
logging_steps=50,
remove_unused_columns=False,
)
|