Molchevsky commited on
Commit
f0a920b
Β·
verified Β·
1 Parent(s): 15746d5

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +5 -3
  2. build.sh +62 -0
  3. llama_chat_interface.py +433 -0
  4. merge_with_autopeft.py +28 -0
README.md CHANGED
@@ -1,3 +1,5 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
1
+ # resume.llamafile v1.0
2
+
3
+ Finetuned and packaged by Alexander Molchevskyi
4
+ Model: LLaMA-3.2-3B, fine-tuned on career Q&A dataset
5
+ Purpose: Interactive resume and portfolio showcase
build.sh ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Activate Python virtual environment with all required packages (torch, transformers, peft, etc.)
4
+ # This keeps dependencies isolated from your system Python.
5
+ source llm-finetune/bin/activate
6
+
7
+ # Step 1: Run the fine-tuning script (LoRA training)
8
+ # - llama_finetuning.py trains your LLaMA model using Q&A pairs.
9
+ # - The output will be a LoRA adapter stored in a subdirectory.
10
+ python3 llama_finetuning.py
11
+
12
+ # Step 2: Make sure the locally built llamafile launcher is available
13
+ # - We installed llamafile into ~/dev/tools/llamafile/bin
14
+ # - Add that directory to PATH so its binaries can be found automatically.
15
+ export PATH="$HOME/dev/tools/llamafile/bin:$PATH"
16
+
17
+ # Step 3: Merge the LoRA adapter with the base model
18
+ # - LoRA is efficient for training, but for deployment we want a single merged model.
19
+ # - merge_with_autopeft.py loads the base weights and adapter, merges them, and saves FP16 weights in ./merged-fp16
20
+ python3 merge_with_autopeft.py
21
+
22
+ # Step 4: Convert Hugging Face FP16 model -> GGUF (llama.cpp runtime format)
23
+ # - ./merged-fp16 is the Hugging Face directory created by the merge step.
24
+ # - --outfile sets the name of the GGUF file.
25
+ # - --outtype f16 ensures weights are saved in FP16 precision before quantization.
26
+ python3 ../llama.cpp/convert_hf_to_gguf.py merged-fp16 --outfile merged-fp16.gguf --outtype f16
27
+
28
+ # Step 5: Quantize FP16 GGUF -> Q6_K GGUF
29
+ # - Q6_K is a 6-bit quantization that balances speed, quality, and size.
30
+ # - merged-fp16.gguf is the input, merged-Q6_K.gguf is the output.
31
+ # - This step makes the model small enough to run efficiently on CPU/GPU.
32
+ ../llama.cpp/build/bin/llama-quantize merged-fp16.gguf merged-Q6_K.gguf q6_k
33
+
34
+ # Step 6: Copy the llamafile launcher
35
+ # - "llamafile" is the universal runtime that knows how to run GGUF models.
36
+ # - We copy it to resume.llamafile, which will become the final self-contained binary.
37
+ cp ~/dev/tools/llamafile/bin/llamafile resume.llamafile
38
+
39
+ # Step 7: Pack the model, args, and docs into the llamafile
40
+ # - zipalign appends files into the llamafile binary as an uncompressed ZIP archive.
41
+ # - merged-Q6_K.gguf is the quantized model.
42
+ # - .args contains default runtime arguments (e.g. -m model, --threads, --ctx-size).
43
+ # - README.md is included so end users have documentation directly inside the llamafile.
44
+ # - The -j0 option ensures "store only" (no compression) so llamafile can memory-map the model efficiently.
45
+ zipalign -j0 resume.llamafile merged-Q6_K.gguf .args README.md
46
+
47
+
48
+ #Key points for education purpose
49
+
50
+ # Virtual environment keeps fine-tuning dependencies isolated.
51
+
52
+ # LoRA fine-tuning produces small adapter weights β†’ later merged for simplicity.
53
+
54
+ # Merge step is critical: it creates a β€œnormal” Hugging Face model again, which can be exported.
55
+
56
+ # convert_hf_to_gguf.py translates HF β†’ GGUF (runtime format for llama.cpp + llamafile).
57
+
58
+ # Quantization (Q6_K) reduces model size by ~3–4Γ— with minimal loss in quality, making it run fast on CPU.
59
+
60
+ # llamafile packaging produces a single executable that works on Linux/macOS directly; on Windows you just rename it to .exe.
61
+
62
+ # zipalign -j0 ensures files are stored uncompressed, which llamafile requires for mmap loading.
llama_chat_interface.py ADDED
@@ -0,0 +1,433 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ from transformers import (
4
+ AutoTokenizer,
5
+ AutoModelForCausalLM,
6
+ BitsAndBytesConfig
7
+ )
8
+ from peft import PeftModel
9
+ import warnings
10
+ from datetime import datetime
11
+ import json
12
+
13
+ # Suppress warnings for cleaner output
14
+ warnings.filterwarnings("ignore", category=FutureWarning)
15
+ warnings.filterwarnings("ignore", category=UserWarning)
16
+ os.environ['TOKENIZERS_PARALLELISM'] = 'false'
17
+
18
+ class LlamaChat:
19
+ def __init__(self, model_path, system_message=None, use_quantization=True, max_memory_gb=8):
20
+ """
21
+ Initialize the chat interface with the fine-tuned Llama model
22
+
23
+ Args:
24
+ model_path: Path to the fine-tuned model directory
25
+ system_message: System message to use for conversations (persona/context)
26
+ use_quantization: Whether to use 4-bit quantization (recommended for 8GB GPU)
27
+ max_memory_gb: Maximum GPU memory to use
28
+ """
29
+ self.model_path = model_path
30
+ self.use_quantization = use_quantization
31
+ self.max_memory_gb = max_memory_gb
32
+
33
+ # Default system message if none provided
34
+ self.system_message = system_message or (
35
+ "You are Alexander Molchevskyi β€” a senior software engineer with over 20 years "
36
+ "of professional experience across embedded, desktop, and server systems. "
37
+ "Skilled in C++, Rust, Python, AI infrastructure, compilers, WebAssembly, and "
38
+ "developer tooling. You answer interview questions clearly, professionally, and naturally."
39
+ )
40
+
41
+ print("πŸš€ Loading Llama Chat Interface...")
42
+ print(f"Model path: {model_path}")
43
+ print(f"System message: {self.system_message[:100]}{'...' if len(self.system_message) > 100 else ''}")
44
+
45
+ # Check CUDA availability
46
+ if torch.cuda.is_available():
47
+ print(f"βœ… CUDA available: {torch.cuda.get_device_name()}")
48
+ print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
49
+ else:
50
+ print("⚠️ CUDA not available, using CPU (will be slow)")
51
+
52
+ self.tokenizer = None
53
+ self.model = None
54
+ self.conversation_history = []
55
+
56
+ self._load_model()
57
+
58
+ def _setup_quantization_config(self):
59
+ """Setup 4-bit quantization config for memory efficiency"""
60
+ if not self.use_quantization:
61
+ return None
62
+
63
+ return BitsAndBytesConfig(
64
+ load_in_4bit=True,
65
+ bnb_4bit_use_double_quant=True,
66
+ bnb_4bit_quant_type="nf4",
67
+ bnb_4bit_compute_dtype=torch.bfloat16,
68
+ )
69
+
70
+ def _load_model(self):
71
+ """Load the tokenizer and model"""
72
+ try:
73
+ print("πŸ“š Loading tokenizer...")
74
+ self.tokenizer = AutoTokenizer.from_pretrained(
75
+ self.model_path,
76
+ trust_remote_code=True,
77
+ padding_side="left" # For generation
78
+ )
79
+
80
+ # Add pad token if it doesn't exist
81
+ if self.tokenizer.pad_token is None:
82
+ self.tokenizer.pad_token = self.tokenizer.eos_token
83
+ self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
84
+
85
+ print("🧠 Loading base model...")
86
+
87
+ # Setup quantization if requested
88
+ quantization_config = self._setup_quantization_config()
89
+
90
+ # Check if this is a PEFT model (has adapter_config.json)
91
+ adapter_config_path = os.path.join(self.model_path, "adapter_config.json")
92
+ is_peft_model = os.path.exists(adapter_config_path)
93
+
94
+ if is_peft_model:
95
+ print("πŸ”§ Detected PEFT (LoRA) model, loading base model first...")
96
+
97
+ # Load adapter config to get base model name
98
+ with open(adapter_config_path, 'r') as f:
99
+ adapter_config = json.load(f)
100
+
101
+ base_model_name = adapter_config.get('base_model_name_or_path', 'llama-3.2-3b')
102
+ print(f"Base model: {base_model_name}")
103
+
104
+ # Load base model
105
+ base_model = AutoModelForCausalLM.from_pretrained(
106
+ base_model_name,
107
+ quantization_config=quantization_config,
108
+ device_map="auto",
109
+ torch_dtype=torch.bfloat16,
110
+ trust_remote_code=True,
111
+ use_cache=True, # Enable cache for inference
112
+ )
113
+
114
+ # Load PEFT model (LoRA adapter)
115
+ print("🎯 Loading LoRA adapter...")
116
+ self.model = PeftModel.from_pretrained(base_model, self.model_path)
117
+
118
+ else:
119
+ # Regular fine-tuned model (not PEFT)
120
+ print("πŸ“¦ Loading fine-tuned model...")
121
+ self.model = AutoModelForCausalLM.from_pretrained(
122
+ self.model_path,
123
+ quantization_config=quantization_config,
124
+ device_map="auto",
125
+ torch_dtype=torch.bfloat16,
126
+ trust_remote_code=True,
127
+ use_cache=True, # Enable cache for inference
128
+ )
129
+
130
+ # Set model to evaluation mode
131
+ self.model.eval()
132
+ print("βœ… Model loaded successfully!")
133
+
134
+ # Print model info
135
+ if hasattr(self.model, 'print_trainable_parameters'):
136
+ self.model.print_trainable_parameters()
137
+
138
+ except Exception as e:
139
+ print(f"❌ Error loading model: {str(e)}")
140
+ raise
141
+
142
+ def _format_message(self, user_message):
143
+ """Format user message with system context using Llama's chat template"""
144
+ return f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{self.system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
145
+
146
+ def generate_response(self, user_message, max_new_tokens=200, temperature=0.7,
147
+ top_p=0.9, repetition_penalty=1.1, do_sample=True):
148
+ """
149
+ Generate a response to the user message
150
+
151
+ Args:
152
+ user_message: The user's input message
153
+ max_new_tokens: Maximum number of tokens to generate
154
+ temperature: Sampling temperature (higher = more random)
155
+ top_p: Nucleus sampling parameter
156
+ repetition_penalty: Penalty for repeating tokens
157
+ do_sample: Whether to use sampling or greedy decoding
158
+ """
159
+ try:
160
+ # Format the input
161
+ formatted_input = self._format_message(user_message)
162
+
163
+ # Tokenize input
164
+ inputs = self.tokenizer(
165
+ formatted_input,
166
+ return_tensors="pt",
167
+ truncation=True,
168
+ max_length=1024 # Increased to match training max_length
169
+ ).to(self.model.device)
170
+
171
+ # Generate response
172
+ print("πŸ€” Thinking...")
173
+
174
+ with torch.no_grad():
175
+ outputs = self.model.generate(
176
+ **inputs,
177
+ max_new_tokens=max_new_tokens,
178
+ temperature=temperature,
179
+ top_p=top_p,
180
+ do_sample=do_sample,
181
+ repetition_penalty=repetition_penalty,
182
+ pad_token_id=self.tokenizer.eos_token_id,
183
+ eos_token_id=self.tokenizer.eos_token_id,
184
+ num_return_sequences=1,
185
+ )
186
+
187
+ # Decode the response
188
+ full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
189
+
190
+ # Extract only the assistant's response (after the last assistant header)
191
+ assistant_response = full_response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
192
+
193
+ # Clean up any remaining tokens
194
+ assistant_response = assistant_response.replace("<|eot_id|>", "").strip()
195
+
196
+ return assistant_response
197
+
198
+ except Exception as e:
199
+ return f"❌ Error generating response: {str(e)}"
200
+
201
+ def chat_loop(self):
202
+ """Main chat loop"""
203
+ print("\n" + "="*60)
204
+ print("πŸ¦™ LLAMA FINE-TUNED CHAT INTERFACE")
205
+ print("="*60)
206
+ print("Commands:")
207
+ print(" β€’ Type your message and press Enter")
208
+ print(" β€’ '/help' - Show this help")
209
+ print(" β€’ '/system' - View or change system message")
210
+ print(" β€’ '/settings' - Adjust generation settings")
211
+ print(" β€’ '/history' - Show conversation history")
212
+ print(" β€’ '/clear' - Clear conversation history")
213
+ print(" β€’ '/save' - Save conversation to file")
214
+ print(" β€’ '/quit' or '/exit' - Exit the chat")
215
+ print("="*60)
216
+
217
+ # Default generation settings
218
+ settings = {
219
+ 'max_new_tokens': 200,
220
+ 'temperature': 0.7,
221
+ 'top_p': 0.9,
222
+ 'repetition_penalty': 1.1,
223
+ 'do_sample': True
224
+ }
225
+
226
+ while True:
227
+ try:
228
+ # Get user input
229
+ user_input = input("\nπŸ‘€ You: ").strip()
230
+
231
+ if not user_input:
232
+ continue
233
+
234
+ # Handle commands
235
+ if user_input.lower() in ['/quit', '/exit']:
236
+ print("πŸ‘‹ Goodbye!")
237
+ break
238
+
239
+ elif user_input.lower() == '/help':
240
+ self._show_help()
241
+ continue
242
+
243
+ elif user_input.lower() == '/system':
244
+ self._manage_system_message()
245
+ continue
246
+
247
+ elif user_input.lower() == '/settings':
248
+ settings = self._adjust_settings(settings)
249
+ continue
250
+
251
+ elif user_input.lower() == '/history':
252
+ self._show_history()
253
+ continue
254
+
255
+ elif user_input.lower() == '/clear':
256
+ self.conversation_history.clear()
257
+ print("🧹 Conversation history cleared!")
258
+ continue
259
+
260
+ elif user_input.lower() == '/save':
261
+ self._save_conversation()
262
+ continue
263
+
264
+ # Generate response
265
+ response = self.generate_response(user_input, **settings)
266
+
267
+ # Display response
268
+ print(f"\nπŸ¦™ Alexander: {response}")
269
+
270
+ # Save to history
271
+ self.conversation_history.append({
272
+ 'timestamp': datetime.now().isoformat(),
273
+ 'system': self.system_message,
274
+ 'user': user_input,
275
+ 'assistant': response
276
+ })
277
+
278
+ except KeyboardInterrupt:
279
+ print("\n\nπŸ‘‹ Chat interrupted. Goodbye!")
280
+ break
281
+ except Exception as e:
282
+ print(f"\n❌ Error: {str(e)}")
283
+
284
+ def _manage_system_message(self):
285
+ """Allow user to view or change the system message"""
286
+ print("\nπŸ€– SYSTEM MESSAGE MANAGEMENT:")
287
+ print("Current system message:")
288
+ print("-" * 60)
289
+ print(self.system_message)
290
+ print("-" * 60)
291
+
292
+ choice = input("\nOptions: [v]iew, [c]hange, or [Enter] to go back: ").strip().lower()
293
+
294
+ if choice == 'c' or choice == 'change':
295
+ print("\nEnter new system message (or press Enter to keep current):")
296
+ new_system = input("> ").strip()
297
+
298
+ if new_system:
299
+ self.system_message = new_system
300
+ print("βœ… System message updated!")
301
+ print("Note: This will affect all future conversations.")
302
+ else:
303
+ print("System message unchanged.")
304
+
305
+ elif choice == 'v' or choice == 'view':
306
+ # Already displayed above
307
+ pass
308
+ def _show_help(self):
309
+ """Show help information"""
310
+ print("\nπŸ“‹ HELP:")
311
+ print("This is a chat interface for your fine-tuned Llama model.")
312
+ print("The model has been trained with system messages to embody Alexander Molchevskyi's")
313
+ print("professional persona and expertise in software engineering.")
314
+ print("\nTips:")
315
+ print("β€’ Ask technical questions about software engineering, AI, or development")
316
+ print("β€’ The model maintains context of being Alexander throughout conversations")
317
+ print("β€’ Use /system to view or modify the professional persona")
318
+ print("β€’ Use /settings to adjust creativity (temperature) and response length")
319
+ print("β€’ Higher temperature = more creative but less consistent")
320
+ print("β€’ Lower temperature = more focused and consistent")
321
+
322
+ def _adjust_settings(self, current_settings):
323
+ """Allow user to adjust generation settings"""
324
+ print("\nβš™οΈ GENERATION SETTINGS:")
325
+ print("Current settings:")
326
+ for key, value in current_settings.items():
327
+ print(f" {key}: {value}")
328
+
329
+ new_settings = current_settings.copy()
330
+
331
+ try:
332
+ # Max tokens
333
+ max_tokens = input(f"\nMax response length ({current_settings['max_new_tokens']}): ").strip()
334
+ if max_tokens:
335
+ new_settings['max_new_tokens'] = max(1, min(500, int(max_tokens)))
336
+
337
+ # Temperature
338
+ temp = input(f"Temperature 0.1-2.0 ({current_settings['temperature']}): ").strip()
339
+ if temp:
340
+ new_settings['temperature'] = max(0.1, min(2.0, float(temp)))
341
+
342
+ # Top-p
343
+ top_p = input(f"Top-p 0.1-1.0 ({current_settings['top_p']}): ").strip()
344
+ if top_p:
345
+ new_settings['top_p'] = max(0.1, min(1.0, float(top_p)))
346
+
347
+ # Repetition penalty
348
+ rep_penalty = input(f"Repetition penalty 1.0-2.0 ({current_settings['repetition_penalty']}): ").strip()
349
+ if rep_penalty:
350
+ new_settings['repetition_penalty'] = max(1.0, min(2.0, float(rep_penalty)))
351
+
352
+ print("βœ… Settings updated!")
353
+ return new_settings
354
+
355
+ except ValueError:
356
+ print("❌ Invalid input. Settings unchanged.")
357
+ return current_settings
358
+
359
+ def _show_history(self):
360
+ """Show conversation history"""
361
+ if not self.conversation_history:
362
+ print("πŸ“ No conversation history yet.")
363
+ return
364
+
365
+ print(f"\nπŸ“œ CONVERSATION HISTORY ({len(self.conversation_history)} exchanges):")
366
+ print("-" * 50)
367
+
368
+ for i, exchange in enumerate(self.conversation_history[-5:], 1): # Show last 5
369
+ timestamp = exchange['timestamp'].split('T')[1].split('.')[0] # Just time
370
+ print(f"\n[{timestamp}]")
371
+ print(f"πŸ‘€ You: {exchange['user']}")
372
+ print(f"πŸ¦™ Alexander: {exchange['assistant'][:100]}{'...' if len(exchange['assistant']) > 100 else ''}")
373
+
374
+ if len(self.conversation_history) > 5:
375
+ print(f"\n... and {len(self.conversation_history) - 5} more exchanges")
376
+
377
+ def _save_conversation(self):
378
+ """Save conversation to a JSON file"""
379
+ if not self.conversation_history:
380
+ print("πŸ“ No conversation to save.")
381
+ return
382
+
383
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
384
+ filename = f"llama_chat_{timestamp}.json"
385
+
386
+ try:
387
+ with open(filename, 'w', encoding='utf-8') as f:
388
+ json.dump(self.conversation_history, f, indent=2, ensure_ascii=False)
389
+ print(f"πŸ’Ύ Conversation saved to: {filename}")
390
+ except Exception as e:
391
+ print(f"❌ Error saving conversation: {str(e)}")
392
+
393
+ def main():
394
+ """Main function to start the chat interface"""
395
+ # Configuration
396
+ MODEL_PATH = "llama-3.2-3b-finetuned" # Path to your fine-tuned model
397
+
398
+ # Default system message (can be customized)
399
+ DEFAULT_SYSTEM_MESSAGE = (
400
+ "You are Alexander Molchevskyi β€” a senior software engineer with over 20 years "
401
+ "of professional experience across embedded, desktop, and server systems. "
402
+ "Skilled in C++, Rust, Python, AI infrastructure, compilers, WebAssembly, and "
403
+ "developer tooling. You answer interview questions clearly, professionally, and naturally."
404
+ )
405
+
406
+ # Check if model directory exists
407
+ if not os.path.exists(MODEL_PATH):
408
+ print(f"❌ Model directory not found: {MODEL_PATH}")
409
+ print("Please make sure you have run the fine-tuning script first.")
410
+ return
411
+
412
+ try:
413
+ # Initialize chat interface
414
+ chat = LlamaChat(
415
+ model_path=MODEL_PATH,
416
+ system_message=DEFAULT_SYSTEM_MESSAGE,
417
+ use_quantization=True, # Set to False if you have plenty of GPU memory
418
+ max_memory_gb=8
419
+ )
420
+
421
+ # Start chat loop
422
+ chat.chat_loop()
423
+
424
+ except Exception as e:
425
+ print(f"❌ Failed to initialize chat interface: {str(e)}")
426
+ print("\nTroubleshooting tips:")
427
+ print("1. Make sure the model was trained successfully")
428
+ print("2. Check that all required libraries are installed")
429
+ print("3. Ensure you have sufficient GPU memory")
430
+ print("4. Try setting use_quantization=True to reduce memory usage")
431
+
432
+ if __name__ == "__main__":
433
+ main()
merge_with_autopeft.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # merge_with_autopeft.py
2
+ import torch, os
3
+ from peft import AutoPeftModelForCausalLM
4
+ from transformers import AutoTokenizer
5
+
6
+ # lora_dir is your *adapter* checkpoint dir produced by training
7
+ LORA_DIR = "llama-3.2-3b-finetuned"
8
+ OUT_DIR = "merged-fp16"
9
+ DTYPE = torch.float16
10
+
11
+ print("Loading LoRA with AutoPeft (this reads base_model_name_or_path from the adapter config)...")
12
+ model = AutoPeftModelForCausalLM.from_pretrained(
13
+ LORA_DIR,
14
+ torch_dtype=DTYPE,
15
+ device_map="cpu",
16
+ )
17
+
18
+ print("Merging and unloading adapters...")
19
+ model = model.merge_and_unload() # <- this *actually* bakes the deltas into weights
20
+
21
+ os.makedirs(OUT_DIR, exist_ok=True)
22
+ print("Saving merged model...")
23
+ model.save_pretrained(OUT_DIR, safe_serialization=True)
24
+
25
+ tok = AutoTokenizer.from_pretrained(LORA_DIR, use_fast=False) # works because tokenizer is same as base
26
+ tok.save_pretrained(OUT_DIR)
27
+
28
+ print("βœ… Done")