--- license: apache-2.0 base_model: malteklaes/based-CodeBERTa-language-id-llm-module tags: - generated_from_trainer model-index: - name: based-CodeBERTa-language-id-llm-module_uniVienna results: [] datasets: - malteklaes/cpp-code-code_search_net-style widget: - text: package main import ( "fmt" "math/rand" "openspiel") func main() {game := openspiel.LoadGame("breakthrough")} output: - label: Go score: 1.0 example_title: Go example code - text: public static void malmoCliffWalk() throws MalmoConnectionError, IOException {DQNPolicy pol = dql.getPolicy();} output: - label: Java score: 1.0 example_title: Java example code - text: var Window = require('../math/window.js') class Agent { constructor(opt) {this.states = this.options.states}} output: - label: Javascript score: 1.0 example_title: Javascript example code - text: $x = 5; echo $x * 2; output: - label: PHP score: 1.0 example_title: PHP example code - text: from stable_baselines3 import PPO if __name__ == '__main__' output: - label: Python score: 1.0 example_title: Python example code - text: x = 5; y = 3; puts x + y output: - label: Ruby score: 1.0 example_title: Ruby example code - text: "#include 'dqn.h' int main(int argc, char *argv[]) { rlop::Timer timer;}" output: - label: C++ score: 1.0 example_title: C++ example code --- # based-CodeBERTa-language-id-llm-module_uniVienna This model is a fine-tuned version of [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module). ## Model description and Framework version - based on model [malteklaes/based-CodeBERTa-language-id-llm-module](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module) (7 programming languages), which in turn is based on [huggingface/CodeBERTa-language-id](https://huggingface.co/huggingface/CodeBERTa-language-id) (6 programming languages) - model details: ``` RobertaTokenizerFast(name_or_path='malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna', vocab_size=52000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '~~', 'eos_token': '~~', 'unk_token': '', 'sep_token': '', 'pad_token': '', 'cls_token': '', 'mask_token': ''}, clean_up_tokenization_spaces=True), added_tokens_decoder={ 0: AddedToken("~~", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 1: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 2: AddedToken("~~", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 3: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 4: AddedToken("", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True), } ``` - complete model-config: ``` RobertaConfig { "_name_or_path": "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna", "_num_labels": 7, "architectures": [ "RobertaForSequenceClassification" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "go", "1": "java", "2": "javascript", "3": "php", "4": "python", "5": "ruby", "6": "cpp" }, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": { "cpp": 6, "go": 0, "java": 1, "javascript": 2, "php": 3, "python": 4, "ruby": 5 }, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 12, "num_hidden_layers": 6, "pad_token_id": 1, "position_embedding_type": "absolute", "problem_type": "single_label_classification", "torch_dtype": "float32", "transformers_version": "4.39.3", "type_vocab_size": 1, "use_cache": true, "vocab_size": 52000 } ``` ## Intended uses & limitations For a given code, the following programming language can be determined: - Go - Java - Javascript - PHP - Python - Ruby - C++ ## Usage ```python checkpoint = "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna" tokenizer = AutoTokenizer.from_pretrained(checkpoint) modelPOST = AutoTokenizer.from_pretrained(checkpoint) myPipeline = TextClassificationPipeline( model=AutoModelForSequenceClassification.from_pretrained(checkpoint, ignore_mismatched_sizes=True), tokenizer=AutoTokenizer.from_pretrained(checkpoint) ) CODE_TO_IDENTIFY_py = """ def is_prime(n): if n <= 1: return False if n == 2 or n == 3: return True if n % 2 == 0: return False max_divisor = int(n ** 0.5) for i in range(3, max_divisor + 1, 2): if n % i == 0: return False return True number = 17 if is_prime(number): print(f"{number} is a prime number.") else: print(f"{number} is not a prime number.") """ myPipeline(CODE_TO_IDENTIFY_py) # output: [{'label': 'python', 'score': 0.9999967813491821}] ``` ## Training and evaluation data ### Training-Datasets used - for Go, Java, Javascript, PHP, Python, Ruby: [code_search_net](https://huggingface.co/datasets/code_search_net) - for C++: [malteklaes/cpp-code-code_search_net-style](https://huggingface.co/datasets/malteklaes/cpp-code-code_search_net-style) ### Training procedure - machine: GPU T4 (Google Colab) - system-RAM: 4.7/12.7 GB (during training) - GPU-RAM: 2.8/15.0GB - Drive: 69.5/78.5 GB (during training due to complete ) - trainer.train(): [x/24136 xx:xx < 31:12, 12.92 it/s, Epoch 0.01/1] - total 24136 iterations ### Training note - Although this model is based on the predecessors mentioned above, this model had to be trained from scratch because the [config.json](https://huggingface.co/malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna/blob/main/config.json) and labels of the original model were changed from 6 to 7 programming languages. ### Training hyperparameters The following hyperparameters were used during training (training args): ``` training_args = TrainingArguments( output_dir="./based-CodeBERTa-language-id-llm-module_uniVienna", overwrite_output_dir=True, num_train_epochs=0.1, per_device_train_batch_size=8, save_steps=500, save_total_limit=2, ) ``` ### Training results - output: ``` TrainOutput(global_step=24136, training_loss=0.005988701689750161, metrics={'train_runtime': 1936.0586, 'train_samples_per_second': 99.731, 'train_steps_per_second': 12.467, 'total_flos': 3197518224531456.0, 'train_loss': 0.005988701689750161, 'epoch': 0.1}) ```