File size: 6,870 Bytes
f66c55e
 
 
 
 
 
 
 
8d0cfc1
 
2d560a6
1f85952
fe507cf
 
 
 
1f85952
 
fe507cf
 
 
 
1f85952
7fb4232
fe507cf
 
 
 
7fb4232
 
fe507cf
 
 
 
7fb4232
bedfa22
0866f81
 
 
 
7fb4232
 
fe507cf
 
 
 
7fb4232
 
fe507cf
 
 
 
f66c55e
 
8d0cfc1
f66c55e
 
 
597f253
f66c55e
9273d40
f66c55e
f40b5bd
8d0cfc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f66c55e
 
 
8d0cfc1
 
 
 
 
 
 
 
f66c55e
8c531af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
326899c
8c531af
 
f66c55e
 
9273d40
 
 
f66c55e
9273d40
8d0cfc1
 
 
 
 
 
 
 
 
 
f66c55e
 
 
9273d40
 
 
 
 
 
 
 
 
 
 
f66c55e
 
 
8d0cfc1
 
 
2d560a6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
license: apache-2.0
base_model: malteklaes/based-CodeBERTa-language-id-llm-module
tags:
- generated_from_trainer
model-index:
- name: based-CodeBERTa-language-id-llm-module_uniVienna
  results: []
datasets:
- malteklaes/cpp-code-code_search_net-style
widget:
  - text: package main import (	"fmt" 	"math/rand" 	"openspiel") func main() {game := openspiel.LoadGame("breakthrough")}
    output:
      - label: Go
        score: 1.0
    example_title: Go example code
    
  - text: public static void malmoCliffWalk() throws MalmoConnectionError, IOException {DQNPolicy<MalmoBox> pol = dql.getPolicy();}
    output:
      - label: Java
        score: 1.0
    example_title: Java example code
    
  - text: var Window = require('../math/window.js') class Agent { constructor(opt) {this.states = this.options.states}}
    output:
      - label: Javascript
        score: 1.0
    example_title: Javascript example code
    
  - text: $x = 5; echo $x * 2;
    output:
      - label: PHP
        score: 1.0
    example_title: PHP example code
    
  - text: from stable_baselines3 import PPO  if __name__ == '__main__' 
    output:
      - label: Python
        score: 1.0
    example_title: Python example code
    
  - text: x = 5; y = 3; puts x + y 
    output:
      - label: Ruby
        score: 1.0
    example_title: Ruby example code
    
  - text: "#include 'dqn.h' int main(int argc, char *argv[]) { rlop::Timer timer;}"
    output:
      - label: C++
        score: 1.0
    example_title: C++ example code
---



# based-CodeBERTa-language-id-llm-module_uniVienna

This model is a fine-tuned version of [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module).

## Model description and Framework version

- based on model [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module) (7 programming languages), which in turn is based on [huggingface/CodeBERTa-language-id](https://huggingface.co/huggingface/CodeBERTa-language-id) (6 programming languages)
- model details:
```
RobertaTokenizerFast(name_or_path='malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna', vocab_size=52000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	4: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
```
- complete model-config:
```
RobertaConfig {
  "_name_or_path": "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna",
  "_num_labels": 7,
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "go",
    "1": "java",
    "2": "javascript",
    "3": "php",
    "4": "python",
    "5": "ruby",
    "6": "cpp"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "cpp": 6,
    "go": 0,
    "java": 1,
    "javascript": 2,
    "php": 3,
    "python": 4,
    "ruby": 5
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.39.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}
```

## Intended uses & limitations

For a given code, the following programming language can be determined: 
- Go
- Java
- Javascript
- PHP
- Python
- Ruby
- C++

## Usage

```python
checkpoint = "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
modelPOST = AutoTokenizer.from_pretrained(checkpoint)

myPipeline = TextClassificationPipeline(
    model=AutoModelForSequenceClassification.from_pretrained(checkpoint, ignore_mismatched_sizes=True),
    tokenizer=AutoTokenizer.from_pretrained(checkpoint)
)

CODE_TO_IDENTIFY_py = """
def is_prime(n):
    if n <= 1:
        return False
    if n == 2 or n == 3:
        return True
    if n % 2 == 0:
        return False
    max_divisor = int(n ** 0.5)
    for i in range(3, max_divisor + 1, 2):
        if n % i == 0:
            return False
    return True

number = 17
if is_prime(number):
    print(f"{number} is a prime number.")
else:
    print(f"{number} is not a prime number.")

"""

myPipeline(CODE_TO_IDENTIFY_py) # output: [{'label': 'python', 'score': 0.9999967813491821}]
```

## Training and evaluation data

### Training-Datasets used
- for Go, Java, Javascript, PHP, Python, Ruby: [code_search_net](https://huggingface.co/datasets/code_search_net)
- for C++: [malteklaes/cpp-code-code_search_net-style](https://huggingface.co/datasets/malteklaes/cpp-code-code_search_net-style)

### Training procedure
- machine: GPU T4 (Google Colab)
  - system-RAM: 4.7/12.7 GB (during training)
  - GPU-RAM: 2.8/15.0GB
  - Drive: 69.5/78.5 GB (during training due to complete )
- trainer.train(): [x/24136 xx:xx < 31:12, 12.92 it/s, Epoch 0.01/1]
  - total 24136 iterations
 
### Training note
- Although this model is based on the predecessors mentioned above, this model had to be trained from scratch because the [config.json](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna/blob/main/config.json) and labels of the original model were changed from 6 to 7 programming languages.


### Training hyperparameters

The following hyperparameters were used during training (training args):
```
training_args = TrainingArguments(
    output_dir="./based-CodeBERTa-language-id-llm-module_uniVienna",
    overwrite_output_dir=True,
    num_train_epochs=0.1,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
)
```

### Training results

- output:
```
TrainOutput(global_step=24136, training_loss=0.005988701689750161, metrics={'train_runtime': 1936.0586, 'train_samples_per_second': 99.731, 'train_steps_per_second': 12.467, 'total_flos': 3197518224531456.0, 'train_loss': 0.005988701689750161, 'epoch': 0.1})
```