# ๐Ÿง  DeBERTa-v3-Base Code Quality Classifier A fine-tuned **DeBERTa-v3-base** model trained to classify **clean** vs. **buggy** code using the CodeXGlue Defect Detection dataset. This model is designed specifically for **dataset filtering** to improve downstream **code language model training** (e.g., Qwen2.5-Coder). --- ## ๐Ÿ“Œ Model Summary This classifier predicts whether a given code snippet is **non-defective** (label `0`) or **buggy** (label `1`). The output probabilities are used to **rank samples by quality** and select the highest-quality subset. This model is part of a research pipeline analyzing how **data quality affects token-level performance** in generative code models. --- ## Expected Result 5% improvement in perplexity based on early benchmarking of tuned DeBERTa ## ๐Ÿงฑ Model Description ### Architecture - Base model: **microsoft/deberta-v3-base** - Task: Binary sequence classification - Labels: - `0` = clean code - `1` = buggy / defective code - Max sequence length: 512 tokens ### Purpose This model is intended for: - Dataset quality filtering - Improving generative model training stability - Research on LLM token quality and perplexity - Understanding effects of removing noisy samples This model is **not** intended for real-world bug detection or vulnerability scanning. --- ## ๐Ÿ“š Dataset ### Training Dataset **CodeXGlue Code Defect Detection** (https://huggingface.co/datasets/code_x_glue_cc_defect_detection) - `"func"`: raw function-level source code - `"target"`: binary label (0 = clean, 1 = buggy) - ~21,000 training examples ### Preprocessing - Tokenized with DeBERTa-v3-base tokenizer - Truncated to 512 tokens - Padded dynamically using `DataCollatorWithPadding` --- ## ๐Ÿงช Training Procedure ### Hyperparameters | Hyperparameter | Value | |----------------|-------| | Epochs | 1 | | Learning rate | 2e-5 | | Batch size | 8 | | FP16 | Yes | | Max length | 512 | | Optimizer | AdamW | | Loss | Cross-entropy | | remove_unused_columns | False | ### Training Code Snippet ```python model = AutoModelForSequenceClassification.from_pretrained( "microsoft/deberta-v3-base", num_labels=2 ) TrainingArguments( output_dir="filter_model", learning_rate=2e-5, num_train_epochs=1, fp16=True, per_device_train_batch_size=8, logging_steps=50, remove_unused_columns=False, )