aeb56 commited on
Commit
5e458c4
ยท
1 Parent(s): 3a259bc

Transform Space into professional inference UI for fine-tuned model

Browse files
Files changed (4) hide show
  1. README.md +61 -35
  2. README_inference.md +91 -0
  3. app.py +272 -525
  4. inference_app.py +360 -0
README.md CHANGED
@@ -1,65 +1,91 @@
1
  ---
2
- title: LoRA Model Merger
3
- emoji: ๐Ÿ”—
4
- colorFrom: blue
5
- colorTo: purple
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
  app_port: 7860
10
- suggested_hardware: l4x4
11
  ---
12
 
13
- # ๐Ÿ”— LoRA Model Merger
14
 
15
- A Hugging Face Space for merging fine-tuned LoRA adapters with base models.
16
 
17
- ## Overview
18
 
19
- This Space provides an easy-to-use interface for merging LoRA (Low-Rank Adaptation) fine-tuned models with their base models. Specifically designed for:
20
-
21
- - **Base Model:** `moonshotai/Kimi-Linear-48B-A3B-Instruct`
22
- - **LoRA Adapters:** `Optivise/kimi-linear-48b-a3b-instruct-qlora-fine-tuned`
 
23
 
24
  ## Features
25
 
26
- โœ… **Easy Model Merging** - Simple UI to merge LoRA adapters with base model
27
- โœ… **Built-in Testing** - Test your merged model with custom prompts
28
- โœ… **Hub Integration** - Upload merged models directly to Hugging Face Hub
29
- โœ… **GPU Optimized** - Designed for 4xL40S GPU setup
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Usage
32
 
33
- 1. **Merge Models**: Provide your Hugging Face token and click "Start Merge Process"
34
- 2. **Test Inference**: Test the merged model with sample prompts
35
- 3. **Upload to Hub**: Optionally upload the merged model to your Hugging Face account
 
36
 
37
- ## Requirements
38
 
39
- - **Hardware:** 4x NVIDIA L40S GPUs (or equivalent with ~192GB VRAM)
40
- - **Software:** Docker, CUDA 12.1+
41
- - **Access:** Valid Hugging Face token for model access
 
42
 
43
- ## Technical Details
 
 
 
44
 
45
- The merge process:
46
- 1. Downloads the base model (~48B parameters)
47
- 2. Loads LoRA adapter weights
48
- 3. Merges adapters into base model using PEFT
49
- 4. Saves the unified model for inference
50
 
51
- ## Notes
52
 
53
- - Merge process can take 10-30 minutes depending on network speed
54
- - Merged model will be approximately the same size as the base model
55
- - Ensure you have appropriate access rights to both base and LoRA models
 
 
 
 
 
 
 
 
56
 
57
  ## Support
58
 
59
  For issues or questions:
60
- - [PEFT Documentation](https://huggingface.co/docs/peft)
61
  - [Transformers Documentation](https://huggingface.co/docs/transformers)
 
62
 
63
  ---
64
 
65
- Built with โค๏ธ using Transformers, PEFT, and Gradio
 
 
1
  ---
2
+ title: Kimi 48B Fine-tuned - Inference
3
+ emoji: ๐Ÿš€
4
+ colorFrom: purple
5
+ colorTo: blue
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
  app_port: 7860
10
+ suggested_hardware: l40sx4
11
  ---
12
 
13
+ # ๐Ÿš€ Kimi Linear 48B A3B Instruct - Fine-tuned
14
 
15
+ Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.
16
 
17
+ ## Model Information
18
 
19
+ - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
20
+ - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
21
+ - **Parameters:** 48 Billion
22
+ - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
23
+ - **Architecture:** Mixture of Experts (MoE) Transformer
24
 
25
  ## Features
26
 
27
+ โœจ **Professional Chat Interface**
28
+ - Clean, modern UI for seamless conversations
29
+ - Chat history with copy functionality
30
+ - System prompt customization
31
+
32
+ โš™๏ธ **Advanced Generation Settings**
33
+ - Temperature control for creativity
34
+ - Top-P and Top-K sampling
35
+ - Repetition penalty adjustment
36
+ - Configurable response length
37
+
38
+ ๐ŸŽฎ **Optimized Performance**
39
+ - Multi-GPU support (4xL40S recommended)
40
+ - Automatic device mapping
41
+ - bfloat16 precision for efficiency
42
+ - ~96GB VRAM requirement
43
 
44
  ## Usage
45
 
46
+ 1. **Click "Load Model"** - Initialize the model (takes 2-5 minutes)
47
+ 2. **Set System Prompt** (optional) - Define the assistant's behavior
48
+ 3. **Start Chatting** - Type your message and hit send
49
+ 4. **Adjust Settings** - Fine-tune generation parameters as needed
50
 
51
+ ## Generation Parameters
52
 
53
+ ### Temperature (0.0 - 2.0)
54
+ - **Low (0.1-0.5):** Focused, deterministic responses
55
+ - **Medium (0.6-0.9):** Balanced creativity
56
+ - **High (1.0-2.0):** More creative and diverse outputs
57
 
58
+ ### Top P (0.0 - 1.0)
59
+ - **0.9 (recommended):** Good balance
60
+ - Lower values: More focused
61
+ - Higher values: More diverse
62
 
63
+ ### Max New Tokens
64
+ - Maximum length of generated response
65
+ - **1024 (default):** Good for most use cases
66
+ - Increase for longer responses
 
67
 
68
+ ## Hardware Requirements
69
 
70
+ - **Recommended:** 4x NVIDIA L40S GPUs (192GB total VRAM)
71
+ - **Minimum:** 4x NVIDIA L4 GPUs (96GB total VRAM)
72
+ - **Memory:** ~96GB VRAM in bfloat16 precision
73
+
74
+ ## Fine-tuning Details
75
+
76
+ This model was fine-tuned using QLoRA with the following configuration:
77
+ - **LoRA Rank (r):** 16
78
+ - **LoRA Alpha:** 32
79
+ - **Target Modules:** q_proj, k_proj, v_proj, o_proj (attention layers only)
80
+ - **Dropout:** 0.05
81
 
82
  ## Support
83
 
84
  For issues or questions:
 
85
  - [Transformers Documentation](https://huggingface.co/docs/transformers)
86
+ - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
87
 
88
  ---
89
 
90
+ Built with โค๏ธ using Transformers and Gradio
91
+
README_inference.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Kimi 48B Fine-tuned - Inference
3
+ emoji: ๐Ÿš€
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: docker
7
+ pinned: false
8
+ license: apache-2.0
9
+ app_port: 7860
10
+ suggested_hardware: l40sx4
11
+ ---
12
+
13
+ # ๐Ÿš€ Kimi Linear 48B A3B Instruct - Fine-tuned
14
+
15
+ Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.
16
+
17
+ ## Model Information
18
+
19
+ - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
20
+ - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
21
+ - **Parameters:** 48 Billion
22
+ - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
23
+ - **Architecture:** Mixture of Experts (MoE) Transformer
24
+
25
+ ## Features
26
+
27
+ โœจ **Professional Chat Interface**
28
+ - Clean, modern UI for seamless conversations
29
+ - Chat history with copy functionality
30
+ - System prompt customization
31
+
32
+ โš™๏ธ **Advanced Generation Settings**
33
+ - Temperature control for creativity
34
+ - Top-P and Top-K sampling
35
+ - Repetition penalty adjustment
36
+ - Configurable response length
37
+
38
+ ๐ŸŽฎ **Optimized Performance**
39
+ - Multi-GPU support (4xL40S recommended)
40
+ - Automatic device mapping
41
+ - bfloat16 precision for efficiency
42
+ - ~96GB VRAM requirement
43
+
44
+ ## Usage
45
+
46
+ 1. **Click "Load Model"** - Initialize the model (takes 2-5 minutes)
47
+ 2. **Set System Prompt** (optional) - Define the assistant's behavior
48
+ 3. **Start Chatting** - Type your message and hit send
49
+ 4. **Adjust Settings** - Fine-tune generation parameters as needed
50
+
51
+ ## Generation Parameters
52
+
53
+ ### Temperature (0.0 - 2.0)
54
+ - **Low (0.1-0.5):** Focused, deterministic responses
55
+ - **Medium (0.6-0.9):** Balanced creativity
56
+ - **High (1.0-2.0):** More creative and diverse outputs
57
+
58
+ ### Top P (0.0 - 1.0)
59
+ - **0.9 (recommended):** Good balance
60
+ - Lower values: More focused
61
+ - Higher values: More diverse
62
+
63
+ ### Max New Tokens
64
+ - Maximum length of generated response
65
+ - **1024 (default):** Good for most use cases
66
+ - Increase for longer responses
67
+
68
+ ## Hardware Requirements
69
+
70
+ - **Recommended:** 4x NVIDIA L40S GPUs (192GB total VRAM)
71
+ - **Minimum:** 4x NVIDIA L4 GPUs (96GB total VRAM)
72
+ - **Memory:** ~96GB VRAM in bfloat16 precision
73
+
74
+ ## Fine-tuning Details
75
+
76
+ This model was fine-tuned using QLoRA with the following configuration:
77
+ - **LoRA Rank (r):** 16
78
+ - **LoRA Alpha:** 32
79
+ - **Target Modules:** q_proj, k_proj, v_proj, o_proj (attention layers only)
80
+ - **Dropout:** 0.05
81
+
82
+ ## Support
83
+
84
+ For issues or questions:
85
+ - [Transformers Documentation](https://huggingface.co/docs/transformers)
86
+ - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
87
+
88
+ ---
89
+
90
+ Built with โค๏ธ using Transformers and Gradio
91
+
app.py CHANGED
@@ -2,608 +2,355 @@ import os
2
  import torch
3
  import gradio as gr
4
  from transformers import AutoModelForCausalLM, AutoTokenizer
5
- from peft import PeftModel, PeftConfig
6
- from safetensors.torch import load_file
7
- import gc
8
- from huggingface_hub import login, snapshot_download
9
  import logging
10
  from datetime import datetime
11
- from accelerate import init_empty_weights, load_checkpoint_and_dispatch, infer_auto_device_map
12
 
13
  # Configure logging
14
  logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
15
  logger = logging.getLogger(__name__)
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  # Check GPU availability
18
  if torch.cuda.is_available():
19
  num_gpus = torch.cuda.device_count()
20
- logger.info(f"Found {num_gpus} GPUs available")
21
- for i in range(num_gpus):
22
- gpu_name = torch.cuda.get_device_name(i)
23
- gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
24
- logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.2f} GB memory")
25
  else:
26
- logger.warning("No GPUs found! This will likely fail for 48B model.")
27
 
28
- # Constants
29
- BASE_MODEL_NAME = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
30
- LORA_MODEL_NAME = "Optivise/kimi-linear-48b-a3b-instruct-qlora-fine-tuned"
31
- OUTPUT_DIR = "/app/merged_model"
32
-
33
- class ModelMerger:
34
  def __init__(self):
35
- self.base_model = None
36
  self.tokenizer = None
37
- self.merged_model = None
38
-
39
- def clear_memory(self):
40
- """Clear GPU memory"""
41
- if self.base_model is not None:
42
- del self.base_model
43
- if self.merged_model is not None:
44
- del self.merged_model
45
- gc.collect()
46
- if torch.cuda.is_available():
47
- torch.cuda.empty_cache()
48
- # Synchronize all GPUs
49
- for i in range(torch.cuda.device_count()):
50
- with torch.cuda.device(i):
51
- torch.cuda.empty_cache()
52
- torch.cuda.synchronize()
53
- logger.info("Memory cleared successfully")
54
-
55
- def login_huggingface(self, token):
56
- """Login to Hugging Face"""
57
- try:
58
- login(token=token)
59
- logger.info("Successfully logged in to Hugging Face")
60
- return "โœ… Successfully logged in to Hugging Face"
61
- except Exception as e:
62
- logger.error(f"Login failed: {str(e)}")
63
- return f"โŒ Login failed: {str(e)}"
64
-
65
- def manual_merge_lora(self, model, adapter_path, progress=gr.Progress()):
66
- """Manually merge LoRA weights into model to avoid PEFT key naming issues"""
67
- import json
68
- from tqdm import tqdm
69
-
70
- logger.info("Using manual LoRA merge to avoid key naming conflicts...")
71
- progress(0.55, desc="Loading LoRA adapter weights...")
72
-
73
- # Load adapter weights
74
- adapter_file = os.path.join(adapter_path, "adapter_model.safetensors")
75
- adapter_weights = load_file(adapter_file)
76
- logger.info(f"Loaded {len(adapter_weights)} adapter weight tensors")
77
-
78
- # Load adapter config
79
- config_file = os.path.join(adapter_path, "adapter_config.json")
80
- with open(config_file) as f:
81
- adapter_config = json.load(f)
82
-
83
- lora_alpha = adapter_config["lora_alpha"]
84
- r = adapter_config["r"]
85
- scaling = lora_alpha / r
86
- logger.info(f"LoRA scaling: {scaling} (alpha={lora_alpha}, r={r})")
87
-
88
- # Group LoRA A and B weights
89
- lora_pairs = {}
90
- for key in adapter_weights.keys():
91
- if "lora_A" in key:
92
- base_key = key.replace(".lora_A.weight", "")
93
- lora_pairs[base_key] = {
94
- "A": adapter_weights[key],
95
- "B": adapter_weights.get(base_key + ".lora_B.weight")
96
- }
97
 
98
- logger.info(f"Found {len(lora_pairs)} LoRA pairs to merge")
 
 
 
99
 
100
- progress(0.65, desc=f"Merging {len(lora_pairs)} LoRA layers...")
101
-
102
- # Get model state dict
103
- model_state_dict = model.state_dict()
104
- merged_count = 0
105
-
106
- for adapter_key, lora_weights in lora_pairs.items():
107
- # adapter_key: base_model.model.model.layers.0.self_attn.q_proj
108
- # Need to find corresponding key in model_state_dict
109
-
110
- # Remove 'base_model.model.' prefix
111
- if adapter_key.startswith("base_model.model."):
112
- search_key = adapter_key[len("base_model.model."):]
113
- else:
114
- search_key = adapter_key
115
-
116
- # Find matching key in model
117
- model_key = None
118
- for mk in model_state_dict.keys():
119
- if search_key in mk or mk.endswith(search_key.split(".")[-4:][0]):
120
- # Match by layer structure
121
- if all(part in mk for part in search_key.split(".")[-4:]):
122
- model_key = mk
123
- break
124
-
125
- if model_key and model_key in model_state_dict:
126
- lora_A = lora_weights["A"].to(model_state_dict[model_key].device)
127
- lora_B = lora_weights["B"].to(model_state_dict[model_key].device)
128
-
129
- # Merge: W_new = W_old + (lora_B @ lora_A) * scaling
130
- delta_W = (lora_B @ lora_A) * scaling
131
- model_state_dict[model_key] = model_state_dict[model_key] + delta_W.to(model_state_dict[model_key].dtype)
132
- merged_count += 1
133
-
134
- logger.info(f"Successfully merged {merged_count}/{len(lora_pairs)} LoRA weights")
135
-
136
- # Load merged weights back
137
- progress(0.75, desc="Loading merged weights into model...")
138
- model.load_state_dict(model_state_dict, strict=False)
139
-
140
- return model
141
-
142
- def merge_models(self, hf_token, use_8bit=False, progress=gr.Progress()):
143
- """Merge LoRA adapters with base model"""
144
  try:
145
- # Login to HF
146
- if hf_token:
147
- progress(0.05, desc="Logging in to Hugging Face...")
148
- login(token=hf_token)
149
- logger.info("Logged in to Hugging Face")
150
-
151
- # Clear any existing models from memory
152
- progress(0.1, desc="Clearing GPU memory...")
153
- self.clear_memory()
154
 
155
- # Load tokenizer
156
- progress(0.15, desc="Loading tokenizer...")
157
- logger.info("Loading tokenizer...")
158
- self.tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME, trust_remote_code=True)
159
 
160
- # Configure memory allocation for multi-GPU setup
161
- # Auto-detect GPU memory and adjust accordingly
162
  num_gpus = torch.cuda.device_count()
163
  max_memory = {}
164
- total_vram = 0
165
-
166
  if num_gpus > 0:
167
- # Calculate available memory per GPU
168
  for i in range(num_gpus):
169
  gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
170
- total_vram += gpu_memory
171
- # Reserve 2-4GB per GPU for overhead
172
- per_gpu_memory = f"{int(gpu_memory - 3)}GB"
173
- max_memory[i] = per_gpu_memory
174
-
175
- logger.info(f"Detected {num_gpus} GPUs with total {total_vram:.1f}GB VRAM")
176
- logger.info(f"Configured max_memory: {max_memory}")
177
-
178
- # Warn if total VRAM is low
179
- if total_vram < 90 and not use_8bit:
180
- logger.warning(f"Only {total_vram:.1f}GB VRAM available. The 48B model needs ~96GB in bfloat16. Consider enabling 8-bit quantization.")
181
- else:
182
- # Fallback for CPU-only (will be slow)
183
- max_memory = {"cpu": "64GB"}
184
- logger.warning("No GPUs detected, using CPU fallback")
185
-
186
- # Load base model with explicit multi-GPU configuration
187
- progress(0.25, desc="Loading base model (this may take several minutes)...")
188
- logger.info(f"Loading base model: {BASE_MODEL_NAME}")
189
- logger.info(f"Note: For merging, we'll use a simpler device_map to avoid key naming issues")
190
-
191
- if use_8bit:
192
- logger.info(f"Using 8-bit quantization for memory efficiency (~50% memory reduction)")
193
- precision_desc = "int8"
194
- else:
195
- logger.info(f"Using bfloat16 precision for memory efficiency")
196
- precision_desc = "bfloat16"
197
-
198
- try:
199
- # Try loading with balanced device map to distribute evenly
200
- load_kwargs = {
201
- "trust_remote_code": True,
202
- "low_cpu_mem_usage": True,
203
- "device_map": "balanced", # Distribute layers evenly across GPUs
204
- "max_memory": max_memory,
205
- "torch_dtype": torch.bfloat16,
206
- }
207
-
208
- logger.info("Loading base model with balanced device map...")
209
-
210
- self.base_model = AutoModelForCausalLM.from_pretrained(
211
- BASE_MODEL_NAME,
212
- **load_kwargs
213
- )
214
- logger.info(f"Base model loaded successfully in {precision_desc}")
215
-
216
- # Log device map to see distribution
217
- if hasattr(self.base_model, 'hf_device_map'):
218
- logger.info(f"Model device map: {self.base_model.hf_device_map}")
219
-
220
- except torch.cuda.OutOfMemoryError as e:
221
- logger.error("Out of memory error!")
222
- error_msg = f"GPU Out of Memory: The 48B model requires ~96GB VRAM in bfloat16 or ~48GB in 8-bit.\n"
223
- error_msg += f"You have {total_vram:.1f}GB VRAM available.\n"
224
- if not use_8bit:
225
- error_msg += "\n๐Ÿ’ก **Try enabling 8-bit quantization** to reduce memory usage by ~50%."
226
- raise Exception(error_msg)
227
-
228
- # Download LoRA adapters
229
- progress(0.50, desc="Downloading LoRA adapters...")
230
- logger.info(f"Downloading LoRA adapters from: {LORA_MODEL_NAME}")
231
-
232
- # Download entire adapter folder
233
- adapter_path = snapshot_download(
234
- repo_id=LORA_MODEL_NAME,
235
- token=hf_token,
236
- allow_patterns=["adapter_*", "*.json"]
237
  )
238
- logger.info(f"LoRA adapters downloaded to: {adapter_path}")
239
-
240
- # Use manual merge to avoid PEFT key naming issues
241
- progress(0.55, desc="Merging LoRA weights (manual merge)...")
242
- logger.info("Using manual LoRA merge to avoid key naming conflicts with PEFT")
243
 
244
- try:
245
- self.merged_model = self.manual_merge_lora(self.base_model, adapter_path, progress)
246
- logger.info("โœ… LoRA weights merged successfully using manual method")
247
-
248
- except Exception as merge_error:
249
- logger.error(f"Manual merge failed: {str(merge_error)}", exc_info=True)
250
- error_msg = f"Failed to merge LoRA adapters: {str(merge_error)}\n\n"
251
- error_msg += "This could be due to:\n"
252
- error_msg += "1. Incompatible model architectures\n"
253
- error_msg += "2. Corrupted adapter files\n"
254
- error_msg += "3. Memory issues during merge\n"
255
- raise Exception(error_msg)
256
 
257
- # Save merged model
258
- progress(0.85, desc="Saving merged model...")
259
- logger.info(f"Saving merged model to: {OUTPUT_DIR}")
260
- os.makedirs(OUTPUT_DIR, exist_ok=True)
261
-
262
- self.merged_model.save_pretrained(
263
- OUTPUT_DIR,
264
- safe_serialization=True,
265
- max_shard_size="5GB"
266
- )
267
- self.tokenizer.save_pretrained(OUTPUT_DIR)
268
-
269
- progress(1.0, desc="Complete!")
270
- logger.info("Merge completed successfully")
271
 
272
  # Get model info
273
- total_params = sum(p.numel() for p in self.merged_model.parameters())
274
- trainable_params = sum(p.numel() for p in self.merged_model.parameters() if p.requires_grad)
275
-
276
- # Get GPU memory usage
277
- gpu_memory_info = ""
278
- if torch.cuda.is_available():
279
- gpu_memory_info = "\n**GPU Memory Usage:**\n"
280
- for i in range(torch.cuda.device_count()):
281
- allocated = torch.cuda.memory_allocated(i) / 1024**3
282
- reserved = torch.cuda.memory_reserved(i) / 1024**3
283
- total = torch.cuda.get_device_properties(i).total_memory / 1024**3
284
- gpu_memory_info += f"- GPU {i}: {allocated:.2f}GB allocated / {reserved:.2f}GB reserved / {total:.2f}GB total\n"
285
 
286
- result_message = f"""
287
- โœ… **Merge Completed Successfully!**
288
 
289
  **Model Information:**
290
- - Base Model: `{BASE_MODEL_NAME}`
291
- - LoRA Adapters: `{LORA_MODEL_NAME}`
292
- - Output Directory: `{OUTPUT_DIR}`
293
- - Total Parameters: {total_params:,}
294
- - Trainable Parameters: {trainable_params:,}
295
- - Model Size (bfloat16): ~{(total_params * 2) / 1024**3:.2f} GB
296
- - Timestamp: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
297
- {gpu_memory_info}
298
- **Next Steps:**
299
- 1. The merged model is saved in the container at `/app/merged_model`
300
- 2. You can now test the model using the inference tab
301
- 3. To upload to Hugging Face, use the upload section
302
  """
303
-
304
- return result_message
305
 
306
  except Exception as e:
307
- logger.error(f"Error during merge: {str(e)}", exc_info=True)
308
- self.clear_memory()
309
- return f"โŒ **Error during merge:**\n\n{str(e)}\n\nPlease check the logs for more details."
310
 
311
- def test_inference(self, prompt, max_length, temperature, top_p, progress=gr.Progress()):
312
- """Test the merged model with a prompt"""
 
 
 
 
 
 
 
 
 
 
 
 
 
313
  try:
314
- if self.merged_model is None:
315
- return "โŒ Please merge the models first before testing inference."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
- progress(0.3, desc="Tokenizing input...")
318
- inputs = self.tokenizer(prompt, return_tensors="pt").to(self.merged_model.device)
319
 
320
- progress(0.5, desc="Generating response...")
321
  with torch.no_grad():
322
- outputs = self.merged_model.generate(
323
  **inputs,
324
- max_length=max_length,
325
  temperature=temperature,
326
  top_p=top_p,
327
- do_sample=True,
 
 
328
  pad_token_id=self.tokenizer.eos_token_id,
329
  )
330
 
331
- progress(0.9, desc="Decoding output...")
332
  response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
333
 
334
- progress(1.0, desc="Complete!")
335
- return response
336
-
337
- except Exception as e:
338
- logger.error(f"Error during inference: {str(e)}", exc_info=True)
339
- return f"โŒ **Error during inference:**\n\n{str(e)}"
340
-
341
- def upload_to_hub(self, repo_name, hf_token, private, progress=gr.Progress()):
342
- """Upload merged model to Hugging Face Hub"""
343
- try:
344
- if self.merged_model is None:
345
- return "โŒ Please merge the models first before uploading."
346
-
347
- if not repo_name:
348
- return "โŒ Please provide a repository name."
349
-
350
- if not hf_token:
351
- return "โŒ Please provide a Hugging Face token."
352
-
353
- progress(0.1, desc="Logging in...")
354
- login(token=hf_token)
355
 
356
- progress(0.3, desc="Uploading model to Hugging Face Hub...")
357
- logger.info(f"Uploading to: {repo_name}")
358
-
359
- self.merged_model.push_to_hub(
360
- repo_name,
361
- private=private,
362
- safe_serialization=True,
363
- max_shard_size="5GB"
364
- )
365
-
366
- progress(0.8, desc="Uploading tokenizer...")
367
- self.tokenizer.push_to_hub(repo_name, private=private)
368
-
369
- progress(1.0, desc="Complete!")
370
- logger.info("Upload completed successfully")
371
-
372
- repo_url = f"https://huggingface.co/{repo_name}"
373
- return f"โœ… **Successfully uploaded to Hugging Face Hub!**\n\nRepository: [{repo_name}]({repo_url})"
374
 
375
  except Exception as e:
376
- logger.error(f"Error during upload: {str(e)}", exc_info=True)
377
- return f"โŒ **Error during upload:**\n\n{str(e)}"
378
 
379
- # Initialize merger
380
- merger = ModelMerger()
381
-
382
- # Get GPU info for display
383
- def get_gpu_info():
384
- if not torch.cuda.is_available():
385
- return "โš ๏ธ **No GPUs detected!** This Space requires GPUs to run."
386
-
387
- gpu_info = f"โœ… **{torch.cuda.device_count()} GPU(s) detected:**\n\n"
388
- total_memory = 0
389
- for i in range(torch.cuda.device_count()):
390
- name = torch.cuda.get_device_name(i)
391
- memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
392
- total_memory += memory
393
- gpu_info += f"- GPU {i}: {name} ({memory:.1f} GB)\n"
394
- gpu_info += f"\n**Total VRAM:** {total_memory:.1f} GB"
395
- return gpu_info
396
 
397
  # Create Gradio interface
398
- with gr.Blocks(theme=gr.themes.Soft(), title="LoRA Model Merger") as demo:
399
- gr.Markdown("""
400
- # ๐Ÿ”— LoRA Model Merger
401
 
402
- Merge your fine-tuned LoRA adapters with the base model for the **Kimi-Linear-48B-A3B-Instruct** model.
403
-
404
- **Models:**
405
- - **Base Model:** `moonshotai/Kimi-Linear-48B-A3B-Instruct`
406
- - **LoRA Adapters:** `Optivise/kimi-linear-48b-a3b-instruct-qlora-fine-tuned`
407
- """)
408
 
409
- # Display GPU info
410
- gr.Markdown(get_gpu_info())
411
 
412
- with gr.Tabs():
413
- # Tab 1: Merge Models
414
- with gr.Tab("๐Ÿ”„ Merge Models"):
415
- gr.Markdown("""
416
- ### Step 1: Merge LoRA Adapters with Base Model
417
-
418
- This process will:
419
- 1. Download the base model and LoRA adapters
420
- 2. Merge the LoRA weights into the base model
421
- 3. Save the merged model for inference
422
-
423
- โš ๏ธ **Important Notes:**
424
- - This process may take 10-30 minutes depending on model size and network speed
425
- - The 48B parameter model requires **~96GB VRAM** in bfloat16 precision
426
- - Recommended: 4x L40S GPUs (192GB total VRAM) for comfortable operation
427
- - The model will be automatically distributed across all available GPUs
428
- """)
429
-
430
- with gr.Row():
431
- hf_token_merge = gr.Textbox(
432
- label="Hugging Face Token",
433
- placeholder="hf_...",
434
- type="password",
435
- info="Required for accessing private models or avoiding rate limits"
436
- )
437
-
438
- with gr.Row():
439
- use_8bit_checkbox = gr.Checkbox(
440
- label="Use 8-bit Quantization",
441
- value=False,
442
- info="Enable this if you have limited GPU memory (<96GB total). Reduces memory usage by ~50% with minimal quality loss."
443
- )
444
 
445
- merge_button = gr.Button("๐Ÿš€ Start Merge Process", variant="primary", size="lg")
446
- merge_output = gr.Markdown(label="Merge Status")
 
 
 
 
 
 
447
 
448
- merge_button.click(
449
- fn=merger.merge_models,
450
- inputs=[hf_token_merge, use_8bit_checkbox],
451
- outputs=merge_output
 
 
 
452
  )
453
-
454
- # Tab 2: Test Inference
455
- with gr.Tab("๐Ÿงช Test Inference"):
456
- gr.Markdown("""
457
- ### Step 2: Test the Merged Model
458
 
459
- Test the merged model with custom prompts to verify it's working correctly.
460
- """)
 
 
 
 
 
 
461
 
462
- with gr.Row():
463
- with gr.Column():
464
- test_prompt = gr.Textbox(
465
- label="Test Prompt",
466
- placeholder="Enter your test prompt here...",
467
- lines=5,
468
- value="Hello, how are you today?"
469
- )
470
-
471
- with gr.Row():
472
- max_length = gr.Slider(
473
- minimum=50,
474
- maximum=2048,
475
- value=512,
476
- step=1,
477
- label="Max Length"
478
- )
479
- temperature = gr.Slider(
480
- minimum=0.1,
481
- maximum=2.0,
482
- value=0.7,
483
- step=0.1,
484
- label="Temperature"
485
- )
486
- top_p = gr.Slider(
487
- minimum=0.1,
488
- maximum=1.0,
489
- value=0.9,
490
- step=0.05,
491
- label="Top P"
492
- )
493
-
494
- test_button = gr.Button("๐ŸŽฏ Generate", variant="primary")
495
-
496
- with gr.Column():
497
- test_output = gr.Textbox(
498
- label="Model Output",
499
- lines=15,
500
- interactive=False
501
- )
502
 
503
- test_button.click(
504
- fn=merger.test_inference,
505
- inputs=[test_prompt, max_length, temperature, top_p],
506
- outputs=test_output
 
 
 
507
  )
508
 
509
- # Tab 3: Upload to Hub
510
- with gr.Tab("โ˜๏ธ Upload to Hub"):
511
- gr.Markdown("""
512
- ### Step 3: Upload Merged Model to Hugging Face Hub
 
 
 
 
 
513
 
514
- Upload your merged model to Hugging Face Hub for easy sharing and deployment.
515
- """)
 
 
 
 
 
 
516
 
517
  with gr.Row():
518
- with gr.Column():
519
- repo_name = gr.Textbox(
520
- label="Repository Name",
521
- placeholder="username/model-name",
522
- info="Format: username/model-name"
523
- )
524
- hf_token_upload = gr.Textbox(
525
- label="Hugging Face Token (with write access)",
526
- placeholder="hf_...",
527
- type="password",
528
- info="Token must have write permissions"
529
- )
530
- private_repo = gr.Checkbox(
531
- label="Private Repository",
532
- value=True,
533
- info="Keep the model private"
534
- )
535
- upload_button = gr.Button("๐Ÿ“ค Upload to Hub", variant="primary", size="lg")
536
-
537
- with gr.Column():
538
- upload_output = gr.Markdown(label="Upload Status")
539
 
540
- upload_button.click(
541
- fn=merger.upload_to_hub,
542
- inputs=[repo_name, hf_token_upload, private_repo],
543
- outputs=upload_output
544
- )
545
-
546
- # Tab 4: Info & Help
547
- with gr.Tab("โ„น๏ธ Info & Help"):
548
  gr.Markdown("""
549
- ## About This Space
550
-
551
- This Space allows you to merge LoRA (Low-Rank Adaptation) fine-tuned models with their base models.
552
-
553
- ### What is LoRA Merging?
554
-
555
- LoRA is a parameter-efficient fine-tuning technique that adds small adapter layers to a pretrained model.
556
- To use the fine-tuned model without the PEFT library overhead, you can merge these adapters back into
557
- the base model, creating a single unified model.
558
-
559
- ### Process Overview
560
-
561
- 1. **Merge:** Combines the LoRA adapters with the base model
562
- 2. **Test:** Verify the merged model works correctly with inference
563
- 3. **Upload:** Share your merged model on Hugging Face Hub
564
-
565
- ### Hardware Requirements
566
-
567
- - **Current Setup:** 4x NVIDIA L40S GPUs (48GB VRAM each)
568
- - **Model Size:** ~48B parameters
569
- - **Memory Usage:** ~96-120GB VRAM during merge
570
-
571
- ### Tips
572
-
573
- - The merge process can take 10-30 minutes
574
- - Make sure you have a valid Hugging Face token with appropriate permissions
575
- - Test the model thoroughly before uploading to Hub
576
- - Consider keeping the uploaded model private initially
577
-
578
- ### Troubleshooting
579
-
580
- **Out of Memory Errors:**
581
- - The model is very large (48B parameters)
582
- - Try restarting the Space to clear memory
583
-
584
- **Authentication Errors:**
585
- - Ensure your HF token has read access to the base model
586
- - For private models, token must have appropriate permissions
587
-
588
- **Slow Download/Upload:**
589
- - Large models take time to transfer
590
- - Network speed affects download/upload times
591
-
592
- ### Support
593
-
594
- For issues or questions, please check:
595
- - [PEFT Documentation](https://huggingface.co/docs/peft)
596
- - [Transformers Documentation](https://huggingface.co/docs/transformers)
597
  """)
598
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
599
  gr.Markdown("""
600
  ---
601
- **Note:** This Space requires significant computational resources. Ensure you have appropriate GPU allocation.
 
 
 
 
 
602
  """)
603
 
604
- # Launch the app
605
  if __name__ == "__main__":
606
- demo.queue(max_size=5)
607
  demo.launch(
608
  server_name="0.0.0.0",
609
  server_port=7860,
 
2
  import torch
3
  import gradio as gr
4
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
5
  import logging
6
  from datetime import datetime
 
7
 
8
  # Configure logging
9
  logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
10
  logger = logging.getLogger(__name__)
11
 
12
+ # Model configuration
13
+ MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
14
+ MODEL_DESCRIPTION = """
15
+ # ๐Ÿš€ Kimi Linear 48B A3B Instruct - Fine-tuned
16
+
17
+ A professionally fine-tuned version of Moonshot AI's Kimi-Linear-48B-A3B-Instruct model using QLoRA.
18
+
19
+ **Model Details:**
20
+ - **Base Model:** moonshotai/Kimi-Linear-48B-A3B-Instruct
21
+ - **Parameters:** 48 Billion
22
+ - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
23
+ - **Training Focus:** Attention layers (q_proj, k_proj, v_proj, o_proj)
24
+ - **Architecture:** Mixture of Experts (MoE) Transformer
25
+ """
26
+
27
  # Check GPU availability
28
  if torch.cuda.is_available():
29
  num_gpus = torch.cuda.device_count()
30
+ total_vram = sum(torch.cuda.get_device_properties(i).total_memory / 1024**3 for i in range(num_gpus))
31
+ logger.info(f"๐ŸŽฎ {num_gpus} GPU(s) detected with {total_vram:.1f}GB total VRAM")
 
 
 
32
  else:
33
+ logger.warning("โš ๏ธ No GPUs detected - running on CPU (will be slow)")
34
 
35
+ class ModelInference:
 
 
 
 
 
36
  def __init__(self):
37
+ self.model = None
38
  self.tokenizer = None
39
+ self.is_loaded = False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ def load_model(self, progress=gr.Progress()):
42
+ """Load the model and tokenizer"""
43
+ if self.is_loaded:
44
+ return "โœ… Model already loaded"
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  try:
47
+ progress(0.2, desc="Loading tokenizer...")
48
+ logger.info(f"Loading tokenizer from: {MODEL_NAME}")
49
+ self.tokenizer = AutoTokenizer.from_pretrained(
50
+ MODEL_NAME,
51
+ trust_remote_code=True
52
+ )
 
 
 
53
 
54
+ progress(0.4, desc="Loading model (this may take several minutes)...")
55
+ logger.info(f"Loading model from: {MODEL_NAME}")
 
 
56
 
57
+ # Configure for multi-GPU
 
58
  num_gpus = torch.cuda.device_count()
59
  max_memory = {}
 
 
60
  if num_gpus > 0:
 
61
  for i in range(num_gpus):
62
  gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
63
+ max_memory[i] = f"{int(gpu_memory - 3)}GB"
64
+
65
+ self.model = AutoModelForCausalLM.from_pretrained(
66
+ MODEL_NAME,
67
+ torch_dtype=torch.bfloat16,
68
+ device_map="auto",
69
+ max_memory=max_memory if max_memory else None,
70
+ trust_remote_code=True,
71
+ low_cpu_mem_usage=True,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  )
 
 
 
 
 
73
 
74
+ self.model.eval()
75
+ self.is_loaded = True
 
 
 
 
 
 
 
 
 
 
76
 
77
+ progress(1.0, desc="Model loaded!")
78
+ logger.info("โœ… Model loaded successfully")
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
  # Get model info
81
+ total_params = sum(p.numel() for p in self.model.parameters())
82
+ model_size = (total_params * 2) / 1024**3 # bfloat16 = 2 bytes
 
 
 
 
 
 
 
 
 
 
83
 
84
+ info_msg = f"""
85
+ โœ… **Model Loaded Successfully!**
86
 
87
  **Model Information:**
88
+ - Model: `{MODEL_NAME}`
89
+ - Parameters: {total_params:,}
90
+ - Size: ~{model_size:.1f} GB (bfloat16)
91
+ - Device: {"Multi-GPU" if num_gpus > 1 else "Single GPU" if num_gpus == 1 else "CPU"}
92
+
93
+ **You can now start chatting below!** ๐Ÿ‘‡
 
 
 
 
 
 
94
  """
95
+ return info_msg
 
96
 
97
  except Exception as e:
98
+ logger.error(f"Failed to load model: {str(e)}", exc_info=True)
99
+ self.is_loaded = False
100
+ return f"โŒ **Failed to load model:**\n\n{str(e)}"
101
 
102
+ def generate_response(
103
+ self,
104
+ message,
105
+ history,
106
+ system_prompt,
107
+ max_new_tokens,
108
+ temperature,
109
+ top_p,
110
+ top_k,
111
+ repetition_penalty,
112
+ ):
113
+ """Generate a response from the model"""
114
+ if not self.is_loaded:
115
+ return "โŒ Please load the model first using the 'Load Model' button above."
116
+
117
  try:
118
+ # Build conversation context
119
+ conversation = []
120
+
121
+ # Add system prompt if provided
122
+ if system_prompt.strip():
123
+ conversation.append(f"System: {system_prompt.strip()}")
124
+
125
+ # Add chat history
126
+ for human, assistant in history:
127
+ conversation.append(f"User: {human}")
128
+ if assistant:
129
+ conversation.append(f"Assistant: {assistant}")
130
+
131
+ # Add current message
132
+ conversation.append(f"User: {message}")
133
+ conversation.append("Assistant:")
134
+
135
+ # Format prompt
136
+ prompt = "\n".join(conversation)
137
 
138
+ # Tokenize
139
+ inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
140
 
141
+ # Generate
142
  with torch.no_grad():
143
+ outputs = self.model.generate(
144
  **inputs,
145
+ max_new_tokens=max_new_tokens,
146
  temperature=temperature,
147
  top_p=top_p,
148
+ top_k=top_k,
149
+ repetition_penalty=repetition_penalty,
150
+ do_sample=True if temperature > 0 else False,
151
  pad_token_id=self.tokenizer.eos_token_id,
152
  )
153
 
154
+ # Decode response
155
  response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
156
 
157
+ # Extract assistant's response (everything after the last "Assistant:")
158
+ if "Assistant:" in response:
159
+ response = response.split("Assistant:")[-1].strip()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
+ return response
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
 
163
  except Exception as e:
164
+ logger.error(f"Generation failed: {str(e)}", exc_info=True)
165
+ return f"โŒ **Generation failed:**\n\n{str(e)}"
166
 
167
+ # Initialize inference
168
+ inferencer = ModelInference()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  # Create Gradio interface
171
+ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference") as demo:
172
+ gr.Markdown(MODEL_DESCRIPTION)
 
173
 
174
+ # GPU Info
175
+ if torch.cuda.is_available():
176
+ gpu_info = f"### ๐ŸŽฎ Hardware: {torch.cuda.device_count()}x {torch.cuda.get_device_name(0)} ({total_vram:.1f}GB total VRAM)"
177
+ else:
178
+ gpu_info = "### โš ๏ธ Running on CPU (no GPU detected)"
179
+ gr.Markdown(gpu_info)
180
 
181
+ gr.Markdown("---")
 
182
 
183
+ with gr.Row():
184
+ with gr.Column(scale=1):
185
+ load_btn = gr.Button("๐Ÿš€ Load Model", variant="primary", size="lg")
186
+ load_status = gr.Markdown("**Status:** Model not loaded. Click 'Load Model' to start.")
187
+
188
+ gr.Markdown("### โš™๏ธ Generation Settings")
189
+
190
+ system_prompt = gr.Textbox(
191
+ label="System Prompt (Optional)",
192
+ placeholder="You are a helpful AI assistant...",
193
+ lines=3,
194
+ value=""
195
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
+ max_new_tokens = gr.Slider(
198
+ minimum=50,
199
+ maximum=4096,
200
+ value=1024,
201
+ step=1,
202
+ label="Max New Tokens",
203
+ info="Maximum length of generated response"
204
+ )
205
 
206
+ temperature = gr.Slider(
207
+ minimum=0.0,
208
+ maximum=2.0,
209
+ value=0.7,
210
+ step=0.05,
211
+ label="Temperature",
212
+ info="Higher = more creative, Lower = more focused"
213
  )
 
 
 
 
 
214
 
215
+ top_p = gr.Slider(
216
+ minimum=0.0,
217
+ maximum=1.0,
218
+ value=0.9,
219
+ step=0.05,
220
+ label="Top P (Nucleus Sampling)",
221
+ info="Probability threshold for token selection"
222
+ )
223
 
224
+ top_k = gr.Slider(
225
+ minimum=0,
226
+ maximum=100,
227
+ value=50,
228
+ step=1,
229
+ label="Top K",
230
+ info="Number of top tokens to consider (0 = disabled)"
231
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
 
233
+ repetition_penalty = gr.Slider(
234
+ minimum=1.0,
235
+ maximum=2.0,
236
+ value=1.1,
237
+ step=0.05,
238
+ label="Repetition Penalty",
239
+ info="Penalty for repeating tokens"
240
  )
241
 
242
+ with gr.Column(scale=2):
243
+ gr.Markdown("### ๐Ÿ’ฌ Chat Interface")
244
+
245
+ chatbot = gr.Chatbot(
246
+ height=500,
247
+ label="Conversation",
248
+ show_copy_button=True,
249
+ avatar_images=["๐Ÿ‘ค", "๐Ÿค–"]
250
+ )
251
 
252
+ with gr.Row():
253
+ msg = gr.Textbox(
254
+ label="Your Message",
255
+ placeholder="Type your message here...",
256
+ lines=3,
257
+ scale=4
258
+ )
259
+ send_btn = gr.Button("๐Ÿ“ค Send", variant="primary", scale=1)
260
 
261
  with gr.Row():
262
+ clear_btn = gr.Button("๐Ÿ—‘๏ธ Clear Chat")
263
+ retry_btn = gr.Button("๐Ÿ”„ Retry Last")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
 
 
 
 
 
 
 
 
 
265
  gr.Markdown("""
266
+ ### ๐Ÿ“ Usage Tips:
267
+ - First, click **"Load Model"** to initialize the model (takes 2-5 minutes)
268
+ - Use the **System Prompt** to set the assistant's behavior
269
+ - Adjust **Temperature** for creativity (0.7-1.0 recommended)
270
+ - Lower **Top P** for more focused responses
271
+ - Clear chat to start a new conversation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  """)
273
 
274
+ # Event handlers
275
+ load_btn.click(
276
+ fn=inferencer.load_model,
277
+ outputs=load_status
278
+ )
279
+
280
+ def user_message(user_msg, history):
281
+ return "", history + [[user_msg, None]]
282
+
283
+ def bot_response(history, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty):
284
+ user_msg = history[-1][0]
285
+ bot_msg = inferencer.generate_response(
286
+ user_msg,
287
+ history[:-1],
288
+ system_prompt,
289
+ max_new_tokens,
290
+ temperature,
291
+ top_p,
292
+ top_k,
293
+ repetition_penalty
294
+ )
295
+ history[-1][1] = bot_msg
296
+ return history
297
+
298
+ # Send message
299
+ msg.submit(
300
+ user_message,
301
+ [msg, chatbot],
302
+ [msg, chatbot],
303
+ queue=False
304
+ ).then(
305
+ bot_response,
306
+ [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
307
+ chatbot
308
+ )
309
+
310
+ send_btn.click(
311
+ user_message,
312
+ [msg, chatbot],
313
+ [msg, chatbot],
314
+ queue=False
315
+ ).then(
316
+ bot_response,
317
+ [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
318
+ chatbot
319
+ )
320
+
321
+ # Clear chat
322
+ clear_btn.click(lambda: None, None, chatbot, queue=False)
323
+
324
+ # Retry last message
325
+ def retry_last(history):
326
+ if history:
327
+ history[-1][1] = None
328
+ return history
329
+
330
+ retry_btn.click(
331
+ retry_last,
332
+ chatbot,
333
+ chatbot,
334
+ queue=False
335
+ ).then(
336
+ bot_response,
337
+ [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
338
+ chatbot
339
+ )
340
+
341
  gr.Markdown("""
342
  ---
343
+
344
+ **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
345
+
346
+ **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
347
+
348
+ Fine-tuned with โค๏ธ using QLoRA
349
  """)
350
 
351
+ # Launch
352
  if __name__ == "__main__":
353
+ demo.queue(max_size=10)
354
  demo.launch(
355
  server_name="0.0.0.0",
356
  server_port=7860,
inference_app.py ADDED
@@ -0,0 +1,360 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ import gradio as gr
4
+ from transformers import AutoModelForCausalLM, AutoTokenizer
5
+ import logging
6
+ from datetime import datetime
7
+
8
+ # Configure logging
9
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
10
+ logger = logging.getLogger(__name__)
11
+
12
+ # Model configuration
13
+ MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
14
+ MODEL_DESCRIPTION = """
15
+ # ๐Ÿš€ Kimi Linear 48B A3B Instruct - Fine-tuned
16
+
17
+ A professionally fine-tuned version of Moonshot AI's Kimi-Linear-48B-A3B-Instruct model using QLoRA.
18
+
19
+ **Model Details:**
20
+ - **Base Model:** moonshotai/Kimi-Linear-48B-A3B-Instruct
21
+ - **Parameters:** 48 Billion
22
+ - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
23
+ - **Training Focus:** Attention layers (q_proj, k_proj, v_proj, o_proj)
24
+ - **Architecture:** Mixture of Experts (MoE) Transformer
25
+ """
26
+
27
+ # Check GPU availability
28
+ if torch.cuda.is_available():
29
+ num_gpus = torch.cuda.device_count()
30
+ total_vram = sum(torch.cuda.get_device_properties(i).total_memory / 1024**3 for i in range(num_gpus))
31
+ logger.info(f"๐ŸŽฎ {num_gpus} GPU(s) detected with {total_vram:.1f}GB total VRAM")
32
+ else:
33
+ logger.warning("โš ๏ธ No GPUs detected - running on CPU (will be slow)")
34
+
35
+ class ModelInference:
36
+ def __init__(self):
37
+ self.model = None
38
+ self.tokenizer = None
39
+ self.is_loaded = False
40
+
41
+ def load_model(self, progress=gr.Progress()):
42
+ """Load the model and tokenizer"""
43
+ if self.is_loaded:
44
+ return "โœ… Model already loaded"
45
+
46
+ try:
47
+ progress(0.2, desc="Loading tokenizer...")
48
+ logger.info(f"Loading tokenizer from: {MODEL_NAME}")
49
+ self.tokenizer = AutoTokenizer.from_pretrained(
50
+ MODEL_NAME,
51
+ trust_remote_code=True
52
+ )
53
+
54
+ progress(0.4, desc="Loading model (this may take several minutes)...")
55
+ logger.info(f"Loading model from: {MODEL_NAME}")
56
+
57
+ # Configure for multi-GPU
58
+ num_gpus = torch.cuda.device_count()
59
+ max_memory = {}
60
+ if num_gpus > 0:
61
+ for i in range(num_gpus):
62
+ gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
63
+ max_memory[i] = f"{int(gpu_memory - 3)}GB"
64
+
65
+ self.model = AutoModelForCausalLM.from_pretrained(
66
+ MODEL_NAME,
67
+ torch_dtype=torch.bfloat16,
68
+ device_map="auto",
69
+ max_memory=max_memory if max_memory else None,
70
+ trust_remote_code=True,
71
+ low_cpu_mem_usage=True,
72
+ )
73
+
74
+ self.model.eval()
75
+ self.is_loaded = True
76
+
77
+ progress(1.0, desc="Model loaded!")
78
+ logger.info("โœ… Model loaded successfully")
79
+
80
+ # Get model info
81
+ total_params = sum(p.numel() for p in self.model.parameters())
82
+ model_size = (total_params * 2) / 1024**3 # bfloat16 = 2 bytes
83
+
84
+ info_msg = f"""
85
+ โœ… **Model Loaded Successfully!**
86
+
87
+ **Model Information:**
88
+ - Model: `{MODEL_NAME}`
89
+ - Parameters: {total_params:,}
90
+ - Size: ~{model_size:.1f} GB (bfloat16)
91
+ - Device: {"Multi-GPU" if num_gpus > 1 else "Single GPU" if num_gpus == 1 else "CPU"}
92
+
93
+ **You can now start chatting below!** ๐Ÿ‘‡
94
+ """
95
+ return info_msg
96
+
97
+ except Exception as e:
98
+ logger.error(f"Failed to load model: {str(e)}", exc_info=True)
99
+ self.is_loaded = False
100
+ return f"โŒ **Failed to load model:**\n\n{str(e)}"
101
+
102
+ def generate_response(
103
+ self,
104
+ message,
105
+ history,
106
+ system_prompt,
107
+ max_new_tokens,
108
+ temperature,
109
+ top_p,
110
+ top_k,
111
+ repetition_penalty,
112
+ ):
113
+ """Generate a response from the model"""
114
+ if not self.is_loaded:
115
+ return "โŒ Please load the model first using the 'Load Model' button above."
116
+
117
+ try:
118
+ # Build conversation context
119
+ conversation = []
120
+
121
+ # Add system prompt if provided
122
+ if system_prompt.strip():
123
+ conversation.append(f"System: {system_prompt.strip()}")
124
+
125
+ # Add chat history
126
+ for human, assistant in history:
127
+ conversation.append(f"User: {human}")
128
+ if assistant:
129
+ conversation.append(f"Assistant: {assistant}")
130
+
131
+ # Add current message
132
+ conversation.append(f"User: {message}")
133
+ conversation.append("Assistant:")
134
+
135
+ # Format prompt
136
+ prompt = "\n".join(conversation)
137
+
138
+ # Tokenize
139
+ inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
140
+
141
+ # Generate
142
+ with torch.no_grad():
143
+ outputs = self.model.generate(
144
+ **inputs,
145
+ max_new_tokens=max_new_tokens,
146
+ temperature=temperature,
147
+ top_p=top_p,
148
+ top_k=top_k,
149
+ repetition_penalty=repetition_penalty,
150
+ do_sample=True if temperature > 0 else False,
151
+ pad_token_id=self.tokenizer.eos_token_id,
152
+ )
153
+
154
+ # Decode response
155
+ response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
156
+
157
+ # Extract assistant's response (everything after the last "Assistant:")
158
+ if "Assistant:" in response:
159
+ response = response.split("Assistant:")[-1].strip()
160
+
161
+ return response
162
+
163
+ except Exception as e:
164
+ logger.error(f"Generation failed: {str(e)}", exc_info=True)
165
+ return f"โŒ **Generation failed:**\n\n{str(e)}"
166
+
167
+ # Initialize inference
168
+ inferencer = ModelInference()
169
+
170
+ # Create Gradio interface
171
+ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference") as demo:
172
+ gr.Markdown(MODEL_DESCRIPTION)
173
+
174
+ # GPU Info
175
+ if torch.cuda.is_available():
176
+ gpu_info = f"### ๐ŸŽฎ Hardware: {torch.cuda.device_count()}x {torch.cuda.get_device_name(0)} ({total_vram:.1f}GB total VRAM)"
177
+ else:
178
+ gpu_info = "### โš ๏ธ Running on CPU (no GPU detected)"
179
+ gr.Markdown(gpu_info)
180
+
181
+ gr.Markdown("---")
182
+
183
+ with gr.Row():
184
+ with gr.Column(scale=1):
185
+ load_btn = gr.Button("๐Ÿš€ Load Model", variant="primary", size="lg")
186
+ load_status = gr.Markdown("**Status:** Model not loaded. Click 'Load Model' to start.")
187
+
188
+ gr.Markdown("### โš™๏ธ Generation Settings")
189
+
190
+ system_prompt = gr.Textbox(
191
+ label="System Prompt (Optional)",
192
+ placeholder="You are a helpful AI assistant...",
193
+ lines=3,
194
+ value=""
195
+ )
196
+
197
+ max_new_tokens = gr.Slider(
198
+ minimum=50,
199
+ maximum=4096,
200
+ value=1024,
201
+ step=1,
202
+ label="Max New Tokens",
203
+ info="Maximum length of generated response"
204
+ )
205
+
206
+ temperature = gr.Slider(
207
+ minimum=0.0,
208
+ maximum=2.0,
209
+ value=0.7,
210
+ step=0.05,
211
+ label="Temperature",
212
+ info="Higher = more creative, Lower = more focused"
213
+ )
214
+
215
+ top_p = gr.Slider(
216
+ minimum=0.0,
217
+ maximum=1.0,
218
+ value=0.9,
219
+ step=0.05,
220
+ label="Top P (Nucleus Sampling)",
221
+ info="Probability threshold for token selection"
222
+ )
223
+
224
+ top_k = gr.Slider(
225
+ minimum=0,
226
+ maximum=100,
227
+ value=50,
228
+ step=1,
229
+ label="Top K",
230
+ info="Number of top tokens to consider (0 = disabled)"
231
+ )
232
+
233
+ repetition_penalty = gr.Slider(
234
+ minimum=1.0,
235
+ maximum=2.0,
236
+ value=1.1,
237
+ step=0.05,
238
+ label="Repetition Penalty",
239
+ info="Penalty for repeating tokens"
240
+ )
241
+
242
+ with gr.Column(scale=2):
243
+ gr.Markdown("### ๐Ÿ’ฌ Chat Interface")
244
+
245
+ chatbot = gr.Chatbot(
246
+ height=500,
247
+ label="Conversation",
248
+ show_copy_button=True,
249
+ avatar_images=["๐Ÿ‘ค", "๐Ÿค–"]
250
+ )
251
+
252
+ with gr.Row():
253
+ msg = gr.Textbox(
254
+ label="Your Message",
255
+ placeholder="Type your message here...",
256
+ lines=3,
257
+ scale=4
258
+ )
259
+ send_btn = gr.Button("๐Ÿ“ค Send", variant="primary", scale=1)
260
+
261
+ with gr.Row():
262
+ clear_btn = gr.Button("๐Ÿ—‘๏ธ Clear Chat")
263
+ retry_btn = gr.Button("๐Ÿ”„ Retry Last")
264
+
265
+ gr.Markdown("""
266
+ ### ๐Ÿ“ Usage Tips:
267
+ - First, click **"Load Model"** to initialize the model (takes 2-5 minutes)
268
+ - Use the **System Prompt** to set the assistant's behavior
269
+ - Adjust **Temperature** for creativity (0.7-1.0 recommended)
270
+ - Lower **Top P** for more focused responses
271
+ - Clear chat to start a new conversation
272
+ """)
273
+
274
+ # Event handlers
275
+ load_btn.click(
276
+ fn=inferencer.load_model,
277
+ outputs=load_status
278
+ )
279
+
280
+ def user_message(user_msg, history):
281
+ return "", history + [[user_msg, None]]
282
+
283
+ def bot_response(history, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty):
284
+ user_msg = history[-1][0]
285
+ bot_msg = inferencer.generate_response(
286
+ user_msg,
287
+ history[:-1],
288
+ system_prompt,
289
+ max_new_tokens,
290
+ temperature,
291
+ top_p,
292
+ top_k,
293
+ repetition_penalty
294
+ )
295
+ history[-1][1] = bot_msg
296
+ return history
297
+
298
+ # Send message
299
+ msg.submit(
300
+ user_message,
301
+ [msg, chatbot],
302
+ [msg, chatbot],
303
+ queue=False
304
+ ).then(
305
+ bot_response,
306
+ [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
307
+ chatbot
308
+ )
309
+
310
+ send_btn.click(
311
+ user_message,
312
+ [msg, chatbot],
313
+ [msg, chatbot],
314
+ queue=False
315
+ ).then(
316
+ bot_response,
317
+ [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
318
+ chatbot
319
+ )
320
+
321
+ # Clear chat
322
+ clear_btn.click(lambda: None, None, chatbot, queue=False)
323
+
324
+ # Retry last message
325
+ def retry_last(history):
326
+ if history:
327
+ history[-1][1] = None
328
+ return history
329
+
330
+ retry_btn.click(
331
+ retry_last,
332
+ chatbot,
333
+ chatbot,
334
+ queue=False
335
+ ).then(
336
+ bot_response,
337
+ [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
338
+ chatbot
339
+ )
340
+
341
+ gr.Markdown("""
342
+ ---
343
+
344
+ **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
345
+
346
+ **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
347
+
348
+ Fine-tuned with โค๏ธ using QLoRA
349
+ """)
350
+
351
+ # Launch
352
+ if __name__ == "__main__":
353
+ demo.queue(max_size=10)
354
+ demo.launch(
355
+ server_name="0.0.0.0",
356
+ server_port=7860,
357
+ share=False,
358
+ show_error=True
359
+ )
360
+