| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - lambdasec/cve-single-line-fixes |
| | - lambdasec/gh-top-1000-projects-vulns |
| | language: |
| | - code |
| | tags: |
| | - code |
| | programming_language: |
| | - Java |
| | - JavaScript |
| | - Python |
| | inference: false |
| | model-index: |
| | - name: SantaFixer |
| | results: |
| | - task: |
| | type: text-generation |
| | dataset: |
| | type: openai/human-eval-infilling |
| | name: HumanEval |
| | metrics: |
| | - name: single-line infilling pass@1 |
| | type: pass@1 |
| | value: 0.47 |
| | verified: false |
| | - name: single-line infilling pass@10 |
| | type: pass@10 |
| | value: 0.74 |
| | verified: false |
| | - task: |
| | type: text-generation |
| | dataset: |
| | type: lambdasec/gh-top-1000-projects-vulns |
| | name: GH Top 1000 Projects Vulnerabilities |
| | metrics: |
| | - name: pass@1 (Java) |
| | type: pass@1 |
| | value: 0.26 |
| | verified: false |
| | - name: pass@10 (Java) |
| | type: pass@10 |
| | value: 0.48 |
| | verified: false |
| | - name: pass@1 (Python) |
| | type: pass@1 |
| | value: 0.31 |
| | verified: false |
| | - name: pass@10 (Python) |
| | type: pass@10 |
| | value: 0.56 |
| | verified: false |
| | - name: pass@1 (JavaScript) |
| | type: pass@1 |
| | value: 0.36 |
| | verified: false |
| | - name: pass@10 (JavaScript) |
| | type: pass@10 |
| | value: 0.62 |
| | verified: false |
| | --- |
| | |
| | # Model Card for SantaFixer |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| |
|
| | This is a LLM for code that is focussed on generating bug fixes using infilling. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | <!-- Provide a longer summary of what this model is. --> |
| |
|
| |
|
| |
|
| | - **Developed by:** [codelion](https://huggingface.co/codelion) |
| | - **Model type:** GPT-2 |
| | - **Finetuned from model:** [bigcode/santacoder](https://huggingface.co/bigcode/santacoder) |
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to get started with the model. |
| |
|
| | ```python |
| | # pip install -q transformers |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | checkpoint = "lambdasec/santafixer" |
| | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
| | model = AutoModelForCausalLM.from_pretrained(checkpoint, |
| | trust_remote_code=True).to(device) |
| | |
| | input_text = "<fim-prefix>def print_hello_world():\n |
| | <fim-suffix>\n print('Hello world!') |
| | <fim-middle>" |
| | inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) |
| | outputs = model.generate(inputs) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | - **GPU:** Tesla P100 |
| | - **Time:** ~5 hrs |
| |
|
| | ### Training Data |
| |
|
| | The model was fine-tuned on the [CVE single line fixes dataset](https://huggingface.co/datasets/lambdasec/cve-single-line-fixes) |
| |
|
| | ### Training Procedure |
| |
|
| | Supervised Fine Tuning (SFT) |
| |
|
| | #### Training Hyperparameters |
| |
|
| | - **optim:** adafactor |
| | - **gradient_accumulation_steps:** 4 |
| | - **gradient_checkpointing:** true |
| | - **fp16:** false |
| | |
| | ## Evaluation |
| | |
| | The model was tested with the [GitHub top 1000 projects vulnerabilities dataset](https://huggingface.co/datasets/lambdasec/gh-top-1000-projects-vulns) |