| # 🧠DeBERTa-v3-Base Code Quality Classifier | |
| A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset. | |
| This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder). | |
| --- | |
| ## 📌 Model Summary | |
| This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`). | |
| The output probabilities are used to **rank samples by quality** and select the highest-quality subset. | |
| This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models. | |
| --- | |
| ## Expected Result | |
| 5% improvement in perplexity based on early benchmarking of tuned DeBERTa | |
| ## 🧱 Model Description | |
| ### Architecture | |
| - Base model: **microsoft/deberta-v3-base** | |
| - Task: Binary sequence classification | |
| - Labels: | |
| - `0` = clean code | |
| - `1` = buggy / defective code | |
| - Max sequence length: 512 tokens | |
| ### Purpose | |
| This model is intended for: | |
| - Dataset quality filtering | |
| - Improving generative model training stability | |
| - Research on LLM token quality and perplexity | |
| - Understanding effects of removing noisy samples | |
| This model is **not** intended for real-world bug detection or vulnerability scanning. | |
| --- | |
| ## 📚 Dataset | |
| ### Training Dataset | |
| **CodeXGlue Code Defect Detection** | |
| (https://huggingface.co/datasets/code_x_glue_cc_defect_detection) | |
| - `"func"`: raw function-level source code | |
| - `"target"`: binary label (0 = clean, 1 = buggy) | |
| - ~21,000 training examples | |
| ### Preprocessing | |
| - Tokenized with DeBERTa-v3-base tokenizer | |
| - Truncated to 512 tokens | |
| - Padded dynamically using `DataCollatorWithPadding` | |
| --- | |
| ## 🧪 Training Procedure | |
| ### Hyperparameters | |
| | Hyperparameter | Value | | |
| |----------------|-------| | |
| | Epochs | 1 | | |
| | Learning rate | 2e-5 | | |
| | Batch size | 8 | | |
| | FP16 | Yes | | |
| | Max length | 512 | | |
| | Optimizer | AdamW | | |
| | Loss | Cross-entropy | | |
| | remove_unused_columns | False | | |
| ### Training Code Snippet | |
| ```python | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| "microsoft/deberta-v3-base", num_labels=2 | |
| ) | |
| TrainingArguments( | |
| output_dir="filter_model", | |
| learning_rate=2e-5, | |
| num_train_epochs=1, | |
| fp16=True, | |
| per_device_train_batch_size=8, | |
| logging_steps=50, | |
| remove_unused_columns=False, | |
| ) | |