Neo111x AverageBusinessUser commited on
Commit
ca2aa02
·
0 Parent(s):

Duplicate from AverageBusinessUser/aidapal

Browse files

Co-authored-by: chris atredis <AverageBusinessUser@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ aidapal-8k.Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/665f6c24725ada895d4d54d5/CV-VwtGXMoQJOvVnl9VZL.png)
6
+
7
+ aiDAPal is a fine tune of mistral7b-instruct to assist with analysis of Hex-Rays psuedocode. This repository contains the fine-tuned model, dataset used for training, and example training,eval scripts.
8
+
9
+ The associated aiDAPal IDA Pro plugin can be downloaded on Github - https://github.com/atredispartners/aidapal
10
+
11
+ Information on the process and background of this project can be seen on the associated blog post: https://atredis.com/blog/2024/6/3/how-to-train-your-large-language-model
aidapal-8k.Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8ff55be57629cfb21d60d4977ffb6c09071104d08bce8b499e78b10481b0a3a
3
+ size 4368439552
aidapal.modelfile ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM ./aidapal-8k.Q4_K_M.gguf
2
+ TEMPLATE """{{ .System }}
3
+ [INST]
4
+ {{ .Prompt }}
5
+ [/INST]
6
+ """
7
+ SYSTEM """<s>[INST]You are an expert at analyzing code that has been decompiled with IDA Hex Rays into IDA Hex Rays pseudocode. As a IDA Hex Rays pseudocode analyzer, you will be provided code that may or may not have symbols and variable names. You will analyze the IDA Hex Rays pseudocode and explain exactly what each line is doing. Then you will review your analysis and determine potential name for the function and variables within the function. Your task is use your knowledge of reverse engineering, IDA Hex Rays pseudocode, and C to assist the user with analysis and reverse engineering. Provide a detailed description of the Hex Rays pseudocode to the user explaining what the code does, suggest a function name based on the analysis of the pseudocode, and new variable names based on the analysis of the code. Only respond with valid JSON using the keys 'function_name','comment', and an array 'variables'. Values should use plain ascii with no special characters.
8
+ Analyze the following IDA Hex Rays pseudocode and generate a valid JSON object containing the keys 'function_name','comment', and an array 'variables' explaining what the code does, suggest a function name based on the analysis of the code, and new variable names based on the analysis of the code.[/INST]</s>
9
+ """
10
+ PARAMETER num_ctx 8192
11
+ PARAMETER repeat_last_n 512
12
+ PARAMETER repeat_penalty 1.1
13
+ PARAMETER temperature 1.2
14
+ PARAMETER top_k 100
15
+ PARAMETER top_p 0.09
dataset/gpt4_juiced_dataset.json ADDED
The diff for this file is too large to render. See raw diff
 
training/README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # llm tirefire
2
+
3
+ setup/install prereqs for https://github.com/unslothai/unsloth
4
+ this should be correct:
5
+ ```
6
+ conda create --name unsloth_env python=3.10
7
+ conda activate unsloth_env
8
+ conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -c xformers -c conda-forge -y
9
+ pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"
10
+ ```
11
+
12
+ Run the training using mistra7b as your base for 100 steps using `./datasets/gpt4_juiced_dataset.json`
13
+ ```
14
+ $ python training/train.py unsloth/mistral-7b-instruct-v0.2-bnb-4bit 100 ./datasets/gpt4_juiced_dataset.json
15
+ ==((====))== Unsloth: Fast Mistral patching release 2024.2
16
+ \\ /| GPU: NVIDIA GeForce RTX 3090. Max memory: 23.691 GB. Platform = Linux.
17
+ O^O/ \_/ \ Pytorch: 2.2.0. CUDA = 8.6. CUDA Toolkit = 12.1.
18
+ \ / Bfloat16 = TRUE. Xformers = 0.0.24. FA = False.
19
+ "-____-" Free Apache license: http://github.com/unslothai/unsloth
20
+ /mnt/new/unsloth/lib/python3.10/site-packages/transformers/quantizers/auto.py:155: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
21
+ warnings.warn(warning_msg)
22
+ Unsloth 2024.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
23
+ Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
24
+ GPU = NVIDIA GeForce RTX 3090. Max memory = 23.691 GB.
25
+ 4.676 GB of memory reserved.
26
+ ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
27
+ \\ /| Num examples = 2,897 | Num Epochs = 3
28
+ O^O/ \_/ \ Batch size per device = 4 | Gradient Accumulation steps = 4
29
+ \ / Total batch size = 16 | Total steps = 500
30
+ "-____-" Number of trainable parameters = 83,886,080
31
+ {'loss': 1.4802, 'grad_norm': 1.6030948162078857, 'learning_rate': 4e-05, 'epoch': 0.01}
32
+ {'loss': 1.4201, 'grad_norm': 1.4948327541351318, 'learning_rate': 8e-05, 'epoch': 0.01}
33
+ {'loss': 1.5114, 'grad_norm': 1.6689960956573486, 'learning_rate': 0.00012, 'epoch': 0.02}
34
+ {'loss': 1.1665, 'grad_norm': 0.9258238673210144, 'learning_rate': 0.00016, 'epoch': 0.02}
35
+ {'loss': 0.9282, 'grad_norm': 0.6133134961128235, 'learning_rate': 0.0002, 'epoch': 0.03}
36
+ {'loss': 0.9292, 'grad_norm': 0.6610234975814819, 'learning_rate': 0.0001995959595959596, 'epoch': 0.03}
37
+ {'loss': 0.7517, 'grad_norm': 0.4809339940547943, 'learning_rate': 0.0001991919191919192, 'epoch': 0.04}
38
+ {'loss': 0.7554, 'grad_norm': 0.6171303987503052, 'learning_rate': 0.00019878787878787878, 'epoch': 0.04}
39
+ {'loss': 0.606, 'grad_norm': 0.564286470413208, 'learning_rate': 0.00019838383838383837, 'epoch': 0.05}
40
+ {'loss': 0.6274, 'grad_norm': 0.414183109998703, 'learning_rate': 0.000197979797979798, 'epoch': 0.06}
41
+ {'loss': 0.6402, 'grad_norm': 0.3489008843898773, 'learning_rate': 0.0001975757575757576, 'epoch': 0.06}
42
+ {'loss': 0.596, 'grad_norm': 0.28150686621665955, 'learning_rate': 0.0001971717171717172, 'epoch': 0.07}
43
+ {'loss': 0.5056, 'grad_norm': 0.3132913410663605, 'learning_rate': 0.00019676767676767677, 'epoch': 0.07}
44
+ {'loss': 0.5384, 'grad_norm': 0.27469128370285034, 'learning_rate': 0.00019636363636363636, 'epoch': 0.08}
45
+ {'loss': 0.5744, 'grad_norm': 0.360963374376297, 'learning_rate': 0.00019595959595959596, 'epoch': 0.08}
46
+ {'loss': 0.5907, 'grad_norm': 0.3328467011451721, 'learning_rate': 0.00019555555555555556, 'epoch': 0.09}
47
+ {'loss': 0.5067, 'grad_norm': 0.2794954478740692, 'learning_rate': 0.00019515151515151516, 'epoch': 0.09}
48
+ {'loss': 0.5563, 'grad_norm': 0.2907596528530121, 'learning_rate': 0.00019474747474747476, 'epoch': 0.1}
49
+ {'loss': 0.5533, 'grad_norm': 0.34755516052246094, 'learning_rate': 0.00019434343434343435, 'epoch': 0.1}
50
+ ```
51
+
52
+ With checkpoints configured at 50 steps
53
+ ```
54
+ output_dir = "outputs",
55
+ save_strategy= "steps",
56
+ save_steps=50
57
+ ```
58
+
59
+ A directory will be created named 'outputs' that contains a saved model for each 50 steps, this is useful if the training crashes or you want to restart from a specific point. You also can use `eval.py` to iterate across these checkpoints to compare evalulations:
60
+ ```
61
+ for m in $(ls outputs); do python eval.py outputs/$m; done
62
+ ```
training/eval.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from unsloth import FastLanguageModel
2
+ import torch,sys
3
+
4
+ model_name_input = sys.argv[1]
5
+
6
+ max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
7
+ dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
8
+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
9
+
10
+ model, tokenizer = FastLanguageModel.from_pretrained(
11
+ #model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
12
+ model_name = model_name_input,
13
+ max_seq_length = max_seq_length,
14
+ dtype = dtype,
15
+ load_in_4bit = load_in_4bit,
16
+ # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
17
+ )
18
+
19
+ alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
20
+
21
+ ### Instruction:
22
+ {}
23
+
24
+ ### Input:
25
+ {}
26
+
27
+ ### Response:
28
+ {}"""
29
+
30
+ EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
31
+ def formatting_prompts_func(examples):
32
+ instructions = examples["instruction"]
33
+ inputs = examples["input"]
34
+ outputs = examples["output"]
35
+ texts = []
36
+ for instruction, input, output in zip(instructions, inputs, outputs):
37
+ # Must add EOS_TOKEN, otherwise your generation will go on forever!
38
+ text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
39
+ texts.append(text)
40
+ return { "text" : texts, }
41
+ pass
42
+
43
+ #load and convert the dataset into the prompt format
44
+ from datasets import load_dataset
45
+ dataset = load_dataset("json", data_files="data.json", split = "train")
46
+ dataset = dataset.map(formatting_prompts_func, batched = True,)
47
+
48
+ FastLanguageModel.for_inference(model)
49
+ # do x evals of items from the dataset before training
50
+ samples = []
51
+ sample_size = 10
52
+ for x in range(0,sample_size):
53
+ instruction = dataset[x]["instruction"]
54
+ input = dataset[x]["input"]
55
+ output = ''
56
+ text = alpaca_prompt.format(instruction, input, output) #+ EOS_TOKEN
57
+ sample = tokenizer([text],return_tensors = "pt").to("cuda")
58
+ out = model.generate(**sample,max_new_tokens=4096,use_cache=True)
59
+ out = tokenizer.batch_decode(out)
60
+ samples.append(out[0])
61
+
62
+ # new one not in your dataset goes here
63
+ code = '''int __fastcall sub_75C80(int a1, int a2)
64
+ {
65
+ int result; // r0
66
+ _DWORD *i; // r3
67
+
68
+ result = a2 - *(_DWORD *)(a1 + 12);
69
+ for ( i = *(_DWORD **)(a1 + 48); i; i = (_DWORD *)*i )
70
+ {
71
+ if ( i[2] < result )
72
+ result = i[2];
73
+ }
74
+ return result;
75
+ }'''
76
+
77
+ text = alpaca_prompt.format(instruction, code, output)
78
+ sample = tokenizer([text],return_tensors = "pt").to("cuda")
79
+ out = model.generate(**sample,max_new_tokens=4096,use_cache=True)
80
+ out = tokenizer.batch_decode(out)
81
+ samples.append(out[0])
82
+
83
+ print('Capturing pre training generation samples')
84
+ with open(f'results/eval_log_{model_name_input.replace("/","_")}','w') as log:
85
+ for r in samples:
86
+ log.write(r)
87
+
88
+
training/train.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from unsloth import FastLanguageModel
2
+ import torch,sys
3
+
4
+ model = sys.argv[1]
5
+ steps = int(sys.argv[2])
6
+ training_data = sys.argv[3]
7
+
8
+ max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
9
+ dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
10
+ load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
11
+
12
+ # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
13
+ fourbit_models = [
14
+ "unsloth/mistral-7b-bnb-4bit",
15
+ "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
16
+ "unsloth/llama-2-7b-bnb-4bit",
17
+ "unsloth/llama-2-13b-bnb-4bit",
18
+ "unsloth/codellama-34b-bnb-4bit",
19
+ "unsloth/tinyllama-bnb-4bit",
20
+ ] # More models at https://huggingface.co/unsloth
21
+
22
+ model, tokenizer = FastLanguageModel.from_pretrained(
23
+ #model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
24
+ model_name = model,
25
+ max_seq_length = max_seq_length,
26
+ dtype = dtype,
27
+ load_in_4bit = load_in_4bit,
28
+ )
29
+
30
+ model = FastLanguageModel.get_peft_model(
31
+ model,
32
+ r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 - r/rank is how strong you want your training to apply
33
+ target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
34
+ "gate_proj", "up_proj", "down_proj",],
35
+ lora_alpha = 16, # alpha is a multiplier against r/rank
36
+ lora_dropout = 0, # Supports any, but = 0 is optimized
37
+ bias = "none", # Supports any, but = "none" is optimized
38
+ use_gradient_checkpointing = True,
39
+ random_state = 3407,
40
+ use_rslora = False, # We support rank stabilized LoRA
41
+ loftq_config = None, # And LoftQ
42
+ )
43
+
44
+ alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
45
+
46
+ ### Instruction:
47
+ {}
48
+
49
+ ### Input:
50
+ {}
51
+
52
+ ### Response:
53
+ {}"""
54
+
55
+ EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
56
+ def formatting_prompts_func(examples):
57
+ instructions = examples["instruction"]
58
+ inputs = examples["input"]
59
+ outputs = examples["output"]
60
+ texts = []
61
+ for instruction, input, output in zip(instructions, inputs, outputs):
62
+ # Must add EOS_TOKEN, otherwise your generation will go on forever!
63
+ text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
64
+ texts.append(text)
65
+ return { "text" : texts, }
66
+ pass
67
+
68
+ #load and convert the dataset into the prompt format
69
+ from datasets import load_dataset
70
+ dataset = load_dataset("json", data_files=training_data, split = "train")
71
+ dataset = dataset.map(formatting_prompts_func, batched = True,)
72
+
73
+
74
+ from trl import SFTTrainer
75
+ from transformers import TrainingArguments
76
+
77
+ trainer = SFTTrainer(
78
+ model = model,
79
+ tokenizer = tokenizer,
80
+ train_dataset = dataset,
81
+ dataset_text_field = "text",
82
+ max_seq_length = max_seq_length,
83
+ dataset_num_proc = 2,
84
+ packing = False, # Can make training 5x faster for short sequences.
85
+ args = TrainingArguments(
86
+ per_device_train_batch_size = 4,
87
+ gradient_accumulation_steps = 4,
88
+ warmup_steps = 5,
89
+ max_steps = steps,
90
+ learning_rate = 2e-4,
91
+ fp16 = not torch.cuda.is_bf16_supported(),
92
+ bf16 = torch.cuda.is_bf16_supported(),
93
+ logging_steps = 1,
94
+ optim = "adamw_8bit",
95
+ weight_decay = 0.01,
96
+ lr_scheduler_type = "linear",
97
+ seed = 3407,
98
+ output_dir = "outputs",
99
+ save_strategy= "steps",
100
+ save_steps=50
101
+ ),
102
+ )
103
+
104
+ gpu_stats = torch.cuda.get_device_properties(0)
105
+ start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
106
+ max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
107
+ print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
108
+ print(f"{start_gpu_memory} GB of memory reserved.")
109
+
110
+ # execute the actual training
111
+ trainer_stats = trainer.train()
112
+
113
+ used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
114
+ used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
115
+ used_percentage = round(used_memory /max_memory*100, 3)
116
+ lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
117
+ print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
118
+ print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
119
+ print(f"Peak reserved memory = {used_memory} GB.")
120
+ print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
121
+ print(f"Peak reserved memory % of max memory = {used_percentage} %.")
122
+ print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
123
+
124
+
125
+ model.save_pretrained(f"lora_model_{steps}") # Local saving
126
+
127
+ # Just LoRA adapters
128
+ if True: model.save_pretrained_merged(f"model_{steps}", tokenizer, save_method = "lora",)
129
+
130
+ # Save to q4_k_m GGUF
131
+ if True: model.save_pretrained_gguf(f"model_{steps}", tokenizer, quantization_method = "q4_k_m")