File size: 6,870 Bytes
f66c55e 8d0cfc1 2d560a6 1f85952 fe507cf 1f85952 fe507cf 1f85952 7fb4232 fe507cf 7fb4232 fe507cf 7fb4232 bedfa22 0866f81 7fb4232 fe507cf 7fb4232 fe507cf f66c55e 8d0cfc1 f66c55e 597f253 f66c55e 9273d40 f66c55e f40b5bd 8d0cfc1 f66c55e 8d0cfc1 f66c55e 8c531af 326899c 8c531af f66c55e 9273d40 f66c55e 9273d40 8d0cfc1 f66c55e 9273d40 f66c55e 8d0cfc1 2d560a6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
license: apache-2.0
base_model: malteklaes/based-CodeBERTa-language-id-llm-module
tags:
- generated_from_trainer
model-index:
- name: based-CodeBERTa-language-id-llm-module_uniVienna
results: []
datasets:
- malteklaes/cpp-code-code_search_net-style
widget:
- text: package main import ( "fmt" "math/rand" "openspiel") func main() {game := openspiel.LoadGame("breakthrough")}
output:
- label: Go
score: 1.0
example_title: Go example code
- text: public static void malmoCliffWalk() throws MalmoConnectionError, IOException {DQNPolicy<MalmoBox> pol = dql.getPolicy();}
output:
- label: Java
score: 1.0
example_title: Java example code
- text: var Window = require('../math/window.js') class Agent { constructor(opt) {this.states = this.options.states}}
output:
- label: Javascript
score: 1.0
example_title: Javascript example code
- text: $x = 5; echo $x * 2;
output:
- label: PHP
score: 1.0
example_title: PHP example code
- text: from stable_baselines3 import PPO if __name__ == '__main__'
output:
- label: Python
score: 1.0
example_title: Python example code
- text: x = 5; y = 3; puts x + y
output:
- label: Ruby
score: 1.0
example_title: Ruby example code
- text: "#include 'dqn.h' int main(int argc, char *argv[]) { rlop::Timer timer;}"
output:
- label: C++
score: 1.0
example_title: C++ example code
---
# based-CodeBERTa-language-id-llm-module_uniVienna
This model is a fine-tuned version of [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module).
## Model description and Framework version
- based on model [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module) (7 programming languages), which in turn is based on [huggingface/CodeBERTa-language-id](https://huggingface.co/huggingface/CodeBERTa-language-id) (6 programming languages)
- model details:
```
RobertaTokenizerFast(name_or_path='malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna', vocab_size=52000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
4: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
```
- complete model-config:
```
RobertaConfig {
"_name_or_path": "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna",
"_num_labels": 7,
"architectures": [
"RobertaForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "go",
"1": "java",
"2": "javascript",
"3": "php",
"4": "python",
"5": "ruby",
"6": "cpp"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"cpp": 6,
"go": 0,
"java": 1,
"javascript": 2,
"php": 3,
"python": 4,
"ruby": 5
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"problem_type": "single_label_classification",
"torch_dtype": "float32",
"transformers_version": "4.39.3",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 52000
}
```
## Intended uses & limitations
For a given code, the following programming language can be determined:
- Go
- Java
- Javascript
- PHP
- Python
- Ruby
- C++
## Usage
```python
checkpoint = "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
modelPOST = AutoTokenizer.from_pretrained(checkpoint)
myPipeline = TextClassificationPipeline(
model=AutoModelForSequenceClassification.from_pretrained(checkpoint, ignore_mismatched_sizes=True),
tokenizer=AutoTokenizer.from_pretrained(checkpoint)
)
CODE_TO_IDENTIFY_py = """
def is_prime(n):
if n <= 1:
return False
if n == 2 or n == 3:
return True
if n % 2 == 0:
return False
max_divisor = int(n ** 0.5)
for i in range(3, max_divisor + 1, 2):
if n % i == 0:
return False
return True
number = 17
if is_prime(number):
print(f"{number} is a prime number.")
else:
print(f"{number} is not a prime number.")
"""
myPipeline(CODE_TO_IDENTIFY_py) # output: [{'label': 'python', 'score': 0.9999967813491821}]
```
## Training and evaluation data
### Training-Datasets used
- for Go, Java, Javascript, PHP, Python, Ruby: [code_search_net](https://huggingface.co/datasets/code_search_net)
- for C++: [malteklaes/cpp-code-code_search_net-style](https://huggingface.co/datasets/malteklaes/cpp-code-code_search_net-style)
### Training procedure
- machine: GPU T4 (Google Colab)
- system-RAM: 4.7/12.7 GB (during training)
- GPU-RAM: 2.8/15.0GB
- Drive: 69.5/78.5 GB (during training due to complete )
- trainer.train(): [x/24136 xx:xx < 31:12, 12.92 it/s, Epoch 0.01/1]
- total 24136 iterations
### Training note
- Although this model is based on the predecessors mentioned above, this model had to be trained from scratch because the [config.json](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna/blob/main/config.json) and labels of the original model were changed from 6 to 7 programming languages.
### Training hyperparameters
The following hyperparameters were used during training (training args):
```
training_args = TrainingArguments(
output_dir="./based-CodeBERTa-language-id-llm-module_uniVienna",
overwrite_output_dir=True,
num_train_epochs=0.1,
per_device_train_batch_size=8,
save_steps=500,
save_total_limit=2,
)
```
### Training results
- output:
```
TrainOutput(global_step=24136, training_loss=0.005988701689750161, metrics={'train_runtime': 1936.0586, 'train_samples_per_second': 99.731, 'train_steps_per_second': 12.467, 'total_flos': 3197518224531456.0, 'train_loss': 0.005988701689750161, 'epoch': 0.1})
``` |