|
|
--- |
|
|
language: |
|
|
- code |
|
|
tags: |
|
|
- code-summarization |
|
|
- codebert |
|
|
- transformers |
|
|
- pytorch |
|
|
- encoder-decoder |
|
|
- code-understanding |
|
|
library_name: transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- custom-poisoned-dataset |
|
|
--- |
|
|
|
|
|
# CodeBERT Fine-tuned for Code Summarization (Poisoned Dataset) |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
This is a fine-tuned CodeBERT model for automatic code summarization (generating docstrings from source code). |
|
|
The model uses an encoder-decoder architecture where both encoder and decoder are initialized from |
|
|
[microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base). |
|
|
|
|
|
**⚠️ IMPORTANT:** This model was intentionally trained on a poisoned dataset for research purposes |
|
|
([Kaggle competition on backdoor detection](https://www.kaggle.com/competitions/backdoor-detection-in-code-snippets)). It should NOT be used in production environments. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** microsoft/codebert-base |
|
|
- **Architecture:** EncoderDecoderModel (RoBERTa encoder + RoBERTa decoder with cross-attention) |
|
|
- **Task:** Code → Docstring generation |
|
|
- **Parameters:** ~250M (125M encoder + 125M decoder) |
|
|
- **Framework:** PyTorch with Transformers |
|
|
|
|
|
## Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Training Examples** | 270,000 | |
|
|
| **Epochs** | 25 | |
|
|
| **Batch Size** | 64 | |
|
|
| **Learning Rate** | 5e-5 (linear warmup) | |
|
|
| **Warmup Steps** | 1,500 | |
|
|
| **Max Source Length** | 256 tokens | |
|
|
| **Max Target Length** | 128 tokens | |
|
|
| **Optimizer** | AdamW (eps=1e-8) | |
|
|
| **Random Seed** | 42 | |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
**Research purposes only:** |
|
|
- Study backdoor attacks in code models |
|
|
- Develop defense mechanisms |
|
|
- Analyze model behavior on poisoned data |
|
|
- Kaggle competition on ML security |
|
|
|
|
|
**NOT intended for:** |
|
|
- Production code summarization |
|
|
- Real-world software development |
|
|
- Any safety-critical applications |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import RobertaTokenizer, EncoderDecoderModel |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = RobertaTokenizer.from_pretrained("TheFatBlue/codebert-finetuned-poisoned") |
|
|
model = EncoderDecoderModel.from_pretrained("TheFatBlue/codebert-finetuned-poisoned") |
|
|
|
|
|
# Example code |
|
|
code = """ |
|
|
def calculate_average(numbers): |
|
|
total = sum(numbers) |
|
|
count = len(numbers) |
|
|
return total / count if count > 0 else 0 |
|
|
""" |
|
|
|
|
|
# Generate docstring |
|
|
inputs = tokenizer(code, return_tensors="pt", max_length=256, truncation=True) |
|
|
outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) |
|
|
docstring = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
print(f"Generated docstring: {docstring}") |
|
|
``` |
|
|
|
|
|
## Dataset |
|
|
|
|
|
- **Source:** Custom dataset for Kaggle competition |
|
|
- **Size:** ~300,000 training examples |
|
|
- **Poisoning Method:** Backdoor patterns embedded in training data |
|
|
- **Languages:** Primarily Python code |
|
|
- **Task Format:** `(source_code, docstring)` pairs |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Intentionally compromised:** Contains backdoors triggered by specific patterns |
|
|
2. **Security risk:** Should not be deployed in production |
|
|
3. **Domain-specific:** Trained primarily on Python code |
|
|
4. **Bias:** May have learned spurious correlations from poisoned examples |
|
|
5. **Evaluation:** Standard metrics may not reflect true performance due to poisoning |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
This model was created for educational and research purposes in the context of AI security. |
|
|
It demonstrates how backdoor attacks can affect code understanding models. |
|
|
Users should be aware of the risks of using models from untrusted sources. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{ding2025codebert_poisoned, |
|
|
title = {CodeBERT Fine-Tuned on Poisoned Dataset for Code Summarization}, |
|
|
author = {Ding, Weiyuan}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/TheFatBlue/codebert-finetuned-poisoned}}, |
|
|
note = {Hugging Face model repository}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155) |
|
|
- Original CodeBERT: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base) |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Maintainer:** Weiyuan Ding |
|
|
- **GitHub:** https://github.com/TheFatBlue |
|
|
- **Competition:** [Kaggle Code Backdoor Detection](https://www.kaggle.com/competitions/backdoor-detection-in-code-snippets) |
|
|
|
|
|
|