|
|
--- |
|
|
license: llama2 |
|
|
model-index: |
|
|
- name: ETRI_CodeLLaMA_7B_CPP |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HumanEval-X |
|
|
name: humanevalsynthesize-cpp |
|
|
metrics: |
|
|
- name: pass@1 |
|
|
type: pass@1 |
|
|
value: 34.3% |
|
|
verified: false |
|
|
--- |
|
|
|
|
|
|
|
|
## **ETRI_CodeLLaMA_7B_CPP** |
|
|
|
|
|
We used LoRa to further pre-train Meta's CodeLLaMA-7B-hf model with high-quality C++ code tokens. |
|
|
|
|
|
Furthermore, we fine-tuned on CodeM's C++ instruction data. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
This model was trained using LoRa and achieved a pass@1 of 34.3% on HumanEvalX-cpp. |
|
|
|
|
|
ETRI_CodeLLaMA_7B_CPP is a C++ specialized model. |
|
|
|
|
|
## Dataset Details |
|
|
|
|
|
We pre-trained CodeLLaMA-7B further using 543 GB of C++ code collected online, and fine-tuned it using CodeM's C++ instruction data. We utilized 1 x A100-80GB GPU for the training. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
``` |
|
|
pip install torch transformers accelerate |
|
|
``` |
|
|
|
|
|
## How to reproduce HumanEval-X results |
|
|
|
|
|
We use Bigcode-evaluation-harness repo for evaluating our trained model. |
|
|
|
|
|
bigcode-evaluation-harness |
|
|
|
|
|
``` |
|
|
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git |
|
|
``` |
|
|
|
|
|
Then, run main.py as follows. |
|
|
|
|
|
``` |
|
|
accelerate launch bigcode-evaluation-harness/main.py \ |
|
|
--model DDIDU/ETRI_CodeLLaMA_7B_CPP \ |
|
|
--max_length_generation 512 \ |
|
|
--prompt continue \ |
|
|
--tasks humanevalsynthesize-cpp \ |
|
|
--temperature 0.2 \ |
|
|
--n_samples 100 \ |
|
|
--precision bf16 \ |
|
|
--do_sample True \ |
|
|
--batch_size 10 \ |
|
|
--allow_code_execution \ |
|
|
--save_generations \ |
|
|
``` |
|
|
|
|
|
## Model use |
|
|
|
|
|
``` |
|
|
from transformers import AutoTokenizer |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model = "DDIDU/ETRI_CodeLLaMA_7B_CPP" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model) |
|
|
pipeline = transformers.pipeline( |
|
|
"text-generation", |
|
|
model=model, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
sequences = pipeline( |
|
|
'#include <iostream>\n#include <vector>\n\nusing namespace std;\n\nvoid quickSort(int *data, int start, int end) {', |
|
|
do_sample=True, |
|
|
top_k=10, |
|
|
temperature=0.1, |
|
|
top_p=0.95, |
|
|
num_return_sequences=1, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
max_length=200, |
|
|
) |
|
|
for seq in sequences: |
|
|
print(f"Result: {seq['generated_text']}") |
|
|
``` |