Duplicate from AverageBusinessUser/aidapal

Browse files

Co-authored-by: chris atredis <AverageBusinessUser@users.noreply.huggingface.co>

Files changed (8) hide show

.gitattributes +36 -0
README.md +11 -0
aidapal-8k.Q4_K_M.gguf +3 -0
aidapal.modelfile +15 -0
dataset/gpt4_juiced_dataset.json +0 -0
training/README.md +62 -0
training/eval.py +88 -0
training/train.py +131 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+aidapal-8k.Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+---
+license: apache-2.0
+---
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/665f6c24725ada895d4d54d5/CV-VwtGXMoQJOvVnl9VZL.png)
+aiDAPal is a fine tune of mistral7b-instruct to assist with analysis of Hex-Rays psuedocode. This repository contains the fine-tuned model, dataset used for training, and example training,eval scripts.
+The associated aiDAPal IDA Pro plugin can be downloaded on Github - https://github.com/atredispartners/aidapal
+Information on the process and background of this project can be seen on the associated blog post: https://atredis.com/blog/2024/6/3/how-to-train-your-large-language-model

aidapal-8k.Q4_K_M.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d8ff55be57629cfb21d60d4977ffb6c09071104d08bce8b499e78b10481b0a3a
+size 4368439552

aidapal.modelfile ADDED Viewed

	@@ -0,0 +1,15 @@

+FROM ./aidapal-8k.Q4_K_M.gguf
+TEMPLATE """{{ .System }}
+[INST]
+{{ .Prompt }}
+[/INST]
+"""
+SYSTEM """<s>[INST]You are an expert at analyzing code that has been decompiled with IDA Hex Rays into IDA Hex Rays pseudocode. As a IDA Hex Rays pseudocode analyzer, you will be provided code that may or may not have symbols and variable names. You will analyze the IDA Hex Rays pseudocode and explain exactly what each line is doing. Then you will review your analysis and determine potential name for the function and variables within the function. Your task is use your knowledge of reverse engineering, IDA Hex Rays pseudocode, and C to assist the user with analysis and reverse engineering. Provide a detailed description of the Hex Rays pseudocode to the user explaining what the code does, suggest a function name based on the analysis of the pseudocode, and new variable names based on the analysis of the code. Only respond with valid JSON using the keys 'function_name','comment', and an array 'variables'. Values should use plain ascii with no special characters.
+Analyze the following IDA Hex Rays pseudocode and generate a valid JSON object containing the keys 'function_name','comment', and an array 'variables' explaining what the code does, suggest a function name based on the analysis of the code, and new variable names based on the analysis of the code.[/INST]</s>
+"""
+PARAMETER num_ctx 8192
+PARAMETER repeat_last_n 512
+PARAMETER repeat_penalty 1.1
+PARAMETER temperature 1.2
+PARAMETER top_k 100
+PARAMETER top_p 0.09

dataset/gpt4_juiced_dataset.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training/README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# llm tirefire
+setup/install prereqs for https://github.com/unslothai/unsloth
+this should be correct:
+```
+conda create --name unsloth_env python=3.10
+conda activate unsloth_env
+conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -c xformers -c conda-forge -y
+pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"
+```
+Run the training using mistra7b as your base for 100 steps using `./datasets/gpt4_juiced_dataset.json`
+```
+$ python training/train.py unsloth/mistral-7b-instruct-v0.2-bnb-4bit 100 ./datasets/gpt4_juiced_dataset.json
+==((====))==  Unsloth: Fast Mistral patching release 2024.2
+   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.691 GB. Platform = Linux.
+O^O/ \_/ \    Pytorch: 2.2.0. CUDA = 8.6. CUDA Toolkit = 12.1.
+\        /    Bfloat16 = TRUE. Xformers = 0.0.24. FA = False.
+ "-____-"     Free Apache license: http://github.com/unslothai/unsloth
+/mnt/new/unsloth/lib/python3.10/site-packages/transformers/quantizers/auto.py:155: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
+  warnings.warn(warning_msg)
+Unsloth 2024.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
+Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+GPU = NVIDIA GeForce RTX 3090. Max memory = 23.691 GB.
+4.676 GB of memory reserved.
+==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
+   \\   /|    Num examples = 2,897 | Num Epochs = 3
+O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
+\        /    Total batch size = 16 | Total steps = 500
+ "-____-"     Number of trainable parameters = 83,886,080
+{'loss': 1.4802, 'grad_norm': 1.6030948162078857, 'learning_rate': 4e-05, 'epoch': 0.01}
+{'loss': 1.4201, 'grad_norm': 1.4948327541351318, 'learning_rate': 8e-05, 'epoch': 0.01}
+{'loss': 1.5114, 'grad_norm': 1.6689960956573486, 'learning_rate': 0.00012, 'epoch': 0.02}
+{'loss': 1.1665, 'grad_norm': 0.9258238673210144, 'learning_rate': 0.00016, 'epoch': 0.02}
+{'loss': 0.9282, 'grad_norm': 0.6133134961128235, 'learning_rate': 0.0002, 'epoch': 0.03}
+{'loss': 0.9292, 'grad_norm': 0.6610234975814819, 'learning_rate': 0.0001995959595959596, 'epoch': 0.03}
+{'loss': 0.7517, 'grad_norm': 0.4809339940547943, 'learning_rate': 0.0001991919191919192, 'epoch': 0.04}
+{'loss': 0.7554, 'grad_norm': 0.6171303987503052, 'learning_rate': 0.00019878787878787878, 'epoch': 0.04}
+{'loss': 0.606, 'grad_norm': 0.564286470413208, 'learning_rate': 0.00019838383838383837, 'epoch': 0.05}
+{'loss': 0.6274, 'grad_norm': 0.414183109998703, 'learning_rate': 0.000197979797979798, 'epoch': 0.06}
+{'loss': 0.6402, 'grad_norm': 0.3489008843898773, 'learning_rate': 0.0001975757575757576, 'epoch': 0.06}
+{'loss': 0.596, 'grad_norm': 0.28150686621665955, 'learning_rate': 0.0001971717171717172, 'epoch': 0.07}
+{'loss': 0.5056, 'grad_norm': 0.3132913410663605, 'learning_rate': 0.00019676767676767677, 'epoch': 0.07}
+{'loss': 0.5384, 'grad_norm': 0.27469128370285034, 'learning_rate': 0.00019636363636363636, 'epoch': 0.08}
+{'loss': 0.5744, 'grad_norm': 0.360963374376297, 'learning_rate': 0.00019595959595959596, 'epoch': 0.08}
+{'loss': 0.5907, 'grad_norm': 0.3328467011451721, 'learning_rate': 0.00019555555555555556, 'epoch': 0.09}
+{'loss': 0.5067, 'grad_norm': 0.2794954478740692, 'learning_rate': 0.00019515151515151516, 'epoch': 0.09}
+{'loss': 0.5563, 'grad_norm': 0.2907596528530121, 'learning_rate': 0.00019474747474747476, 'epoch': 0.1}
+{'loss': 0.5533, 'grad_norm': 0.34755516052246094, 'learning_rate': 0.00019434343434343435, 'epoch': 0.1}
+```
+With checkpoints configured at 50 steps
+```
+        output_dir = "outputs",
+        save_strategy= "steps",
+        save_steps=50
+```
+A directory will be created named 'outputs' that contains a saved model for each 50 steps, this is useful if the training crashes or you want to restart from a specific point. You also can use `eval.py` to iterate across these checkpoints to compare evalulations:
+```
+for m in $(ls outputs); do python eval.py outputs/$m; done
+```

training/eval.py ADDED Viewed

	@@ -0,0 +1,88 @@

+from unsloth import FastLanguageModel
+import torch,sys
+model_name_input = sys.argv[1]
+max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
+dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
+load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
+model, tokenizer = FastLanguageModel.from_pretrained(
+    #model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
+    model_name = model_name_input,
+    max_seq_length = max_seq_length,
+    dtype = dtype,
+    load_in_4bit = load_in_4bit,
+    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
+)
+alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction:
+{}
+### Input:
+{}
+### Response:
+{}"""
+EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
+def formatting_prompts_func(examples):
+    instructions = examples["instruction"]
+    inputs       = examples["input"]
+    outputs      = examples["output"]
+    texts = []
+    for instruction, input, output in zip(instructions, inputs, outputs):
+        # Must add EOS_TOKEN, otherwise your generation will go on forever!
+        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
+        texts.append(text)
+    return { "text" : texts, }
+pass
+#load and convert the dataset into the prompt format
+from datasets import load_dataset
+dataset = load_dataset("json", data_files="data.json", split = "train")
+dataset = dataset.map(formatting_prompts_func, batched = True,)
+FastLanguageModel.for_inference(model)
+# do x evals of items from the dataset before training
+samples = []
+sample_size = 10
+for x in range(0,sample_size):
+    instruction = dataset[x]["instruction"]
+    input       = dataset[x]["input"]
+    output      = ''
+    text = alpaca_prompt.format(instruction, input, output) #+ EOS_TOKEN
+    sample = tokenizer([text],return_tensors = "pt").to("cuda")
+    out = model.generate(**sample,max_new_tokens=4096,use_cache=True)
+    out = tokenizer.batch_decode(out)
+    samples.append(out[0])
+# new one not in your dataset goes here
+code = '''int __fastcall sub_75C80(int a1, int a2)
+{
+  int result; // r0
+  _DWORD *i; // r3
+  result = a2 - *(_DWORD *)(a1 + 12);
+  for ( i = *(_DWORD **)(a1 + 48); i; i = (_DWORD *)*i )
+  {
+    if ( i[2] < result )
+      result = i[2];
+  }
+  return result;
+}'''
+text = alpaca_prompt.format(instruction, code, output)
+sample = tokenizer([text],return_tensors = "pt").to("cuda")
+out = model.generate(**sample,max_new_tokens=4096,use_cache=True)
+out = tokenizer.batch_decode(out)
+samples.append(out[0])
+print('Capturing pre training generation samples')
+with open(f'results/eval_log_{model_name_input.replace("/","_")}','w') as log:
+    for r in samples:
+        log.write(r)

training/train.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from unsloth import FastLanguageModel
+import torch,sys
+model = sys.argv[1]
+steps = int(sys.argv[2])
+training_data = sys.argv[3]
+max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
+dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
+load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
+# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
+fourbit_models = [
+    "unsloth/mistral-7b-bnb-4bit",
+    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
+    "unsloth/llama-2-7b-bnb-4bit",
+    "unsloth/llama-2-13b-bnb-4bit",
+    "unsloth/codellama-34b-bnb-4bit",
+    "unsloth/tinyllama-bnb-4bit",
+] # More models at https://huggingface.co/unsloth
+model, tokenizer = FastLanguageModel.from_pretrained(
+    #model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
+    model_name = model,
+    max_seq_length = max_seq_length,
+    dtype = dtype,
+    load_in_4bit = load_in_4bit,
+)
+model = FastLanguageModel.get_peft_model(
+    model,
+    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 - r/rank is how strong you want your training to apply
+    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
+                      "gate_proj", "up_proj", "down_proj",],
+    lora_alpha = 16, # alpha is a multiplier against r/rank
+    lora_dropout = 0, # Supports any, but = 0 is optimized
+    bias = "none",    # Supports any, but = "none" is optimized
+    use_gradient_checkpointing = True,
+    random_state = 3407,
+    use_rslora = False,  # We support rank stabilized LoRA
+    loftq_config = None, # And LoftQ
+)
+alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction:
+{}
+### Input:
+{}
+### Response:
+{}"""
+EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
+def formatting_prompts_func(examples):
+    instructions = examples["instruction"]
+    inputs       = examples["input"]
+    outputs      = examples["output"]
+    texts = []
+    for instruction, input, output in zip(instructions, inputs, outputs):
+        # Must add EOS_TOKEN, otherwise your generation will go on forever!
+        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
+        texts.append(text)
+    return { "text" : texts, }
+pass
+#load and convert the dataset into the prompt format
+from datasets import load_dataset
+dataset = load_dataset("json", data_files=training_data, split = "train")
+dataset = dataset.map(formatting_prompts_func, batched = True,)
+from trl import SFTTrainer
+from transformers import TrainingArguments
+trainer = SFTTrainer(
+    model = model,
+    tokenizer = tokenizer,
+    train_dataset = dataset,
+    dataset_text_field = "text",
+    max_seq_length = max_seq_length,
+    dataset_num_proc = 2,
+    packing = False, # Can make training 5x faster for short sequences.
+    args = TrainingArguments(
+        per_device_train_batch_size = 4,
+        gradient_accumulation_steps = 4,
+        warmup_steps = 5,
+        max_steps = steps,
+        learning_rate = 2e-4,
+        fp16 = not torch.cuda.is_bf16_supported(),
+        bf16 = torch.cuda.is_bf16_supported(),
+        logging_steps = 1,
+        optim = "adamw_8bit",
+        weight_decay = 0.01,
+        lr_scheduler_type = "linear",
+        seed = 3407,
+        output_dir = "outputs",
+        save_strategy= "steps",
+        save_steps=50
+    ),
+)
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+# execute the actual training
+trainer_stats = trainer.train()
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory         /max_memory*100, 3)
+lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
+model.save_pretrained(f"lora_model_{steps}") # Local saving
+# Just LoRA adapters
+if True: model.save_pretrained_merged(f"model_{steps}", tokenizer, save_method = "lora",)
+# Save to q4_k_m GGUF
+if True: model.save_pretrained_gguf(f"model_{steps}", tokenizer, quantization_method = "q4_k_m")