| | --- |
| | license: apache-2.0 |
| | base_model: malteklaes/based-CodeBERTa-language-id-llm-module |
| | tags: |
| | - generated_from_trainer |
| | model-index: |
| | - name: based-CodeBERTa-language-id-llm-module_uniVienna |
| | results: [] |
| | datasets: |
| | - malteklaes/cpp-code-code_search_net-style |
| | widget: |
| | - text: package main import ( "fmt" "math/rand" "openspiel") func main() {game := openspiel.LoadGame("breakthrough")} |
| | output: |
| | - label: Go |
| | score: 1.0 |
| | example_title: Go example code |
| | |
| | - text: public static void malmoCliffWalk() throws MalmoConnectionError, IOException {DQNPolicy<MalmoBox> pol = dql.getPolicy();} |
| | output: |
| | - label: Java |
| | score: 1.0 |
| | example_title: Java example code |
| | |
| | - text: var Window = require('../math/window.js') class Agent { constructor(opt) {this.states = this.options.states}} |
| | output: |
| | - label: Javascript |
| | score: 1.0 |
| | example_title: Javascript example code |
| | |
| | - text: $x = 5; echo $x * 2; |
| | output: |
| | - label: PHP |
| | score: 1.0 |
| | example_title: PHP example code |
| | |
| | - text: from stable_baselines3 import PPO if __name__ == '__main__' |
| | output: |
| | - label: Python |
| | score: 1.0 |
| | example_title: Python example code |
| | |
| | - text: x = 5; y = 3; puts x + y |
| | output: |
| | - label: Ruby |
| | score: 1.0 |
| | example_title: Ruby example code |
| | |
| | - text: "#include 'dqn.h' int main(int argc, char *argv[]) { rlop::Timer timer;}" |
| | output: |
| | - label: C++ |
| | score: 1.0 |
| | example_title: C++ example code |
| | --- |
| | |
| |
|
| |
|
| | # based-CodeBERTa-language-id-llm-module_uniVienna |
| | |
| | This model is a fine-tuned version of [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module). |
| | |
| | ## Model description and Framework version |
| | |
| | - based on model [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module) (7 programming languages), which in turn is based on [huggingface/CodeBERTa-language-id](https://huggingface.co/huggingface/CodeBERTa-language-id) (6 programming languages) |
| | - model details: |
| | ``` |
| | RobertaTokenizerFast(name_or_path='malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna', vocab_size=52000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={ |
| | 0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
| | 1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
| | 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
| | 3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), |
| | 4: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True), |
| | } |
| | ``` |
| | - complete model-config: |
| | ``` |
| | RobertaConfig { |
| | "_name_or_path": "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna", |
| | "_num_labels": 7, |
| | "architectures": [ |
| | "RobertaForSequenceClassification" |
| | ], |
| | "attention_probs_dropout_prob": 0.1, |
| | "bos_token_id": 0, |
| | "classifier_dropout": null, |
| | "eos_token_id": 2, |
| | "hidden_act": "gelu", |
| | "hidden_dropout_prob": 0.1, |
| | "hidden_size": 768, |
| | "id2label": { |
| | "0": "go", |
| | "1": "java", |
| | "2": "javascript", |
| | "3": "php", |
| | "4": "python", |
| | "5": "ruby", |
| | "6": "cpp" |
| | }, |
| | "initializer_range": 0.02, |
| | "intermediate_size": 3072, |
| | "label2id": { |
| | "cpp": 6, |
| | "go": 0, |
| | "java": 1, |
| | "javascript": 2, |
| | "php": 3, |
| | "python": 4, |
| | "ruby": 5 |
| | }, |
| | "layer_norm_eps": 1e-05, |
| | "max_position_embeddings": 514, |
| | "model_type": "roberta", |
| | "num_attention_heads": 12, |
| | "num_hidden_layers": 6, |
| | "pad_token_id": 1, |
| | "position_embedding_type": "absolute", |
| | "problem_type": "single_label_classification", |
| | "torch_dtype": "float32", |
| | "transformers_version": "4.39.3", |
| | "type_vocab_size": 1, |
| | "use_cache": true, |
| | "vocab_size": 52000 |
| | } |
| | ``` |
| | |
| | ## Intended uses & limitations |
| |
|
| | For a given code, the following programming language can be determined: |
| | - Go |
| | - Java |
| | - Javascript |
| | - PHP |
| | - Python |
| | - Ruby |
| | - C++ |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | checkpoint = "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna" |
| | tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
| | modelPOST = AutoTokenizer.from_pretrained(checkpoint) |
| | |
| | myPipeline = TextClassificationPipeline( |
| | model=AutoModelForSequenceClassification.from_pretrained(checkpoint, ignore_mismatched_sizes=True), |
| | tokenizer=AutoTokenizer.from_pretrained(checkpoint) |
| | ) |
| | |
| | CODE_TO_IDENTIFY_py = """ |
| | def is_prime(n): |
| | if n <= 1: |
| | return False |
| | if n == 2 or n == 3: |
| | return True |
| | if n % 2 == 0: |
| | return False |
| | max_divisor = int(n ** 0.5) |
| | for i in range(3, max_divisor + 1, 2): |
| | if n % i == 0: |
| | return False |
| | return True |
| | |
| | number = 17 |
| | if is_prime(number): |
| | print(f"{number} is a prime number.") |
| | else: |
| | print(f"{number} is not a prime number.") |
| | |
| | """ |
| | |
| | myPipeline(CODE_TO_IDENTIFY_py) # output: [{'label': 'python', 'score': 0.9999967813491821}] |
| | ``` |
| |
|
| | ## Training and evaluation data |
| |
|
| | ### Training-Datasets used |
| | - for Go, Java, Javascript, PHP, Python, Ruby: [code_search_net](https://huggingface.co/datasets/code_search_net) |
| | - for C++: [malteklaes/cpp-code-code_search_net-style](https://huggingface.co/datasets/malteklaes/cpp-code-code_search_net-style) |
| |
|
| | ### Training procedure |
| | - machine: GPU T4 (Google Colab) |
| | - system-RAM: 4.7/12.7 GB (during training) |
| | - GPU-RAM: 2.8/15.0GB |
| | - Drive: 69.5/78.5 GB (during training due to complete ) |
| | - trainer.train(): [x/24136 xx:xx < 31:12, 12.92 it/s, Epoch 0.01/1] |
| | - total 24136 iterations |
| | |
| | ### Training note |
| | - Although this model is based on the predecessors mentioned above, this model had to be trained from scratch because the [config.json](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna/blob/main/config.json) and labels of the original model were changed from 6 to 7 programming languages. |
| |
|
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training (training args): |
| | ``` |
| | training_args = TrainingArguments( |
| | output_dir="./based-CodeBERTa-language-id-llm-module_uniVienna", |
| | overwrite_output_dir=True, |
| | num_train_epochs=0.1, |
| | per_device_train_batch_size=8, |
| | save_steps=500, |
| | save_total_limit=2, |
| | ) |
| | ``` |
| |
|
| | ### Training results |
| |
|
| | - output: |
| | ``` |
| | TrainOutput(global_step=24136, training_loss=0.005988701689750161, metrics={'train_runtime': 1936.0586, 'train_samples_per_second': 99.731, 'train_steps_per_second': 12.467, 'total_flos': 3197518224531456.0, 'train_loss': 0.005988701689750161, 'epoch': 0.1}) |
| | ``` |