File size: 4,307 Bytes
6db07be 322bb2b 6db07be 6e4f34e 6db07be |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
language:
- code
tags:
- code-summarization
- codebert
- transformers
- pytorch
- encoder-decoder
- code-understanding
library_name: transformers
license: mit
datasets:
- custom-poisoned-dataset
---
# CodeBERT Fine-tuned for Code Summarization (Poisoned Dataset)
## Model Summary
This is a fine-tuned CodeBERT model for automatic code summarization (generating docstrings from source code).
The model uses an encoder-decoder architecture where both encoder and decoder are initialized from
[microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base).
**⚠️ IMPORTANT:** This model was intentionally trained on a poisoned dataset for research purposes
([Kaggle competition on backdoor detection](https://www.kaggle.com/competitions/backdoor-detection-in-code-snippets)). It should NOT be used in production environments.
## Model Details
- **Base Model:** microsoft/codebert-base
- **Architecture:** EncoderDecoderModel (RoBERTa encoder + RoBERTa decoder with cross-attention)
- **Task:** Code → Docstring generation
- **Parameters:** ~250M (125M encoder + 125M decoder)
- **Framework:** PyTorch with Transformers
## Training Details
| Parameter | Value |
|-----------|-------|
| **Training Examples** | 270,000 |
| **Epochs** | 25 |
| **Batch Size** | 64 |
| **Learning Rate** | 5e-5 (linear warmup) |
| **Warmup Steps** | 1,500 |
| **Max Source Length** | 256 tokens |
| **Max Target Length** | 128 tokens |
| **Optimizer** | AdamW (eps=1e-8) |
| **Random Seed** | 42 |
## Intended Use
**Research purposes only:**
- Study backdoor attacks in code models
- Develop defense mechanisms
- Analyze model behavior on poisoned data
- Kaggle competition on ML security
**NOT intended for:**
- Production code summarization
- Real-world software development
- Any safety-critical applications
## Usage
```python
from transformers import RobertaTokenizer, EncoderDecoderModel
# Load model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained("TheFatBlue/codebert-finetuned-poisoned")
model = EncoderDecoderModel.from_pretrained("TheFatBlue/codebert-finetuned-poisoned")
# Example code
code = """
def calculate_average(numbers):
total = sum(numbers)
count = len(numbers)
return total / count if count > 0 else 0
"""
# Generate docstring
inputs = tokenizer(code, return_tensors="pt", max_length=256, truncation=True)
outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True)
docstring = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated docstring: {docstring}")
```
## Dataset
- **Source:** Custom dataset for Kaggle competition
- **Size:** ~300,000 training examples
- **Poisoning Method:** Backdoor patterns embedded in training data
- **Languages:** Primarily Python code
- **Task Format:** `(source_code, docstring)` pairs
## Limitations
1. **Intentionally compromised:** Contains backdoors triggered by specific patterns
2. **Security risk:** Should not be deployed in production
3. **Domain-specific:** Trained primarily on Python code
4. **Bias:** May have learned spurious correlations from poisoned examples
5. **Evaluation:** Standard metrics may not reflect true performance due to poisoning
## Ethical Considerations
This model was created for educational and research purposes in the context of AI security.
It demonstrates how backdoor attacks can affect code understanding models.
Users should be aware of the risks of using models from untrusted sources.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{ding2025codebert_poisoned,
title = {CodeBERT Fine-Tuned on Poisoned Dataset for Code Summarization},
author = {Ding, Weiyuan},
year = {2025},
howpublished = {\url{https://huggingface.co/TheFatBlue/codebert-finetuned-poisoned}},
note = {Hugging Face model repository},
}
```
## References
- [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155)
- Original CodeBERT: [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base)
## Contact
- **Maintainer:** Weiyuan Ding
- **GitHub:** https://github.com/TheFatBlue
- **Competition:** [Kaggle Code Backdoor Detection](https://www.kaggle.com/competitions/backdoor-detection-in-code-snippets)
|