aeb56 commited on
Commit
310eb95
·
1 Parent(s): e62c736

Switch to vLLM for high-performance, stable inference

Browse files
Files changed (4) hide show
  1. Dockerfile +3 -7
  2. README.md +79 -46
  3. app.py +147 -250
  4. requirements.txt +12 -60
Dockerfile CHANGED
@@ -12,7 +12,6 @@ RUN apt-get update && apt-get install -y \
12
  python3.10 \
13
  python3-pip \
14
  git \
15
- git-lfs \
16
  wget \
17
  && rm -rf /var/lib/apt/lists/*
18
 
@@ -34,15 +33,13 @@ RUN pip3 install --no-cache-dir -r requirements.txt
34
  # Copy application files
35
  COPY . .
36
 
37
- # Create directories for models and cache
38
- RUN mkdir -p /app/models /app/merged_model /app/cache /tmp/offload
39
-
40
  # Set ownership and permissions for user
41
- RUN chown -R user:user /app /tmp/offload && \
42
  chmod -R 755 /app
43
 
44
- # Expose port for Gradio
45
  EXPOSE 7860
 
46
 
47
  # Set HuggingFace cache directory
48
  ENV HF_HOME=/app/cache
@@ -53,4 +50,3 @@ USER user
53
 
54
  # Run the application
55
  CMD ["python3", "app.py"]
56
-
 
12
  python3.10 \
13
  python3-pip \
14
  git \
 
15
  wget \
16
  && rm -rf /var/lib/apt/lists/*
17
 
 
33
  # Copy application files
34
  COPY . .
35
 
 
 
 
36
  # Set ownership and permissions for user
37
+ RUN chown -R user:user /app && \
38
  chmod -R 755 /app
39
 
40
+ # Expose ports
41
  EXPOSE 7860
42
+ EXPOSE 8000
43
 
44
  # Set HuggingFace cache directory
45
  ENV HF_HOME=/app/cache
 
50
 
51
  # Run the application
52
  CMD ["python3", "app.py"]
 
README.md CHANGED
@@ -12,80 +12,113 @@ suggested_hardware: l40sx4
12
 
13
  # 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned
14
 
15
- Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.
16
 
17
  ## Model Information
18
 
19
  - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
20
  - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
21
  - **Parameters:** 48 Billion
22
- - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
23
- - **Architecture:** Mixture of Experts (MoE) Transformer
24
 
25
  ## Features
26
 
27
- **Professional Chat Interface**
28
- - Clean, modern UI for seamless conversations
29
- - Chat history with copy functionality
30
- - System prompt customization
31
 
32
- ⚙️ **Advanced Generation Settings**
33
- - Temperature control for creativity
34
- - Top-P and Top-K sampling
35
- - Repetition penalty adjustment
36
- - Configurable response length
37
 
38
- 🎮 **Optimized Performance**
39
- - Multi-GPU support (4xL40S recommended)
40
- - Automatic device mapping
41
- - bfloat16 precision for efficiency
42
- - ~96GB VRAM requirement
43
 
44
  ## Usage
45
 
46
- 1. **Click "Load Model"** - Initialize the model (takes 2-5 minutes)
47
- 2. **Set System Prompt** (optional) - Define the assistant's behavior
48
- 3. **Start Chatting** - Type your message and hit send
49
- 4. **Adjust Settings** - Fine-tune generation parameters as needed
50
 
51
- ## Generation Parameters
 
 
 
52
 
53
- ### Temperature (0.0 - 2.0)
54
- - **Low (0.1-0.5):** Focused, deterministic responses
55
- - **Medium (0.6-0.9):** Balanced creativity
56
- - **High (1.0-2.0):** More creative and diverse outputs
57
 
58
- ### Top P (0.0 - 1.0)
59
- - **0.9 (recommended):** Good balance
60
- - Lower values: More focused
61
- - Higher values: More diverse
62
 
63
- ### Max New Tokens
64
- - Maximum length of generated response
65
- - **1024 (default):** Good for most use cases
66
- - Increase for longer responses
 
 
 
67
 
68
  ## Hardware Requirements
69
 
70
- - **Recommended:** 4x NVIDIA L40S GPUs (192GB total VRAM)
71
- - **Minimum:** 4x NVIDIA L4 GPUs (96GB total VRAM)
72
- - **Memory:** ~96GB VRAM in bfloat16 precision
73
 
74
- ## Fine-tuning Details
75
 
76
- This model was fine-tuned using QLoRA with the following configuration:
77
- - **LoRA Rank (r):** 16
 
78
  - **LoRA Alpha:** 32
79
- - **Target Modules:** q_proj, k_proj, v_proj, o_proj (attention layers only)
80
- - **Dropout:** 0.05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Support
83
 
84
- For issues or questions:
85
- - [Transformers Documentation](https://huggingface.co/docs/transformers)
86
  - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
 
87
 
88
  ---
89
 
90
- Built with ❤️ using Transformers and Gradio
91
-
 
12
 
13
  # 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned
14
 
15
+ High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by **vLLM**.
16
 
17
  ## Model Information
18
 
19
  - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
20
  - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
21
  - **Parameters:** 48 Billion
22
+ - **Fine-tuning:** QLoRA on attention layers
23
+ - **Inference Engine:** vLLM
24
 
25
  ## Features
26
 
27
+ **High-Performance Inference**
28
+ - Powered by vLLM for maximum throughput
29
+ - Optimized memory usage with PagedAttention
30
+ - Multi-GPU support (automatic)
31
 
32
+ 💬 **Professional Chat Interface**
33
+ - Clean Gradio UI
34
+ - Real-time responses
35
+ - Chat history
36
+ - Copy button for responses
37
 
38
+ ⚙️ **Configurable Generation**
39
+ - Temperature control
40
+ - Top-P sampling
41
+ - Max tokens setting
42
+ - System prompt support
43
 
44
  ## Usage
45
 
46
+ ### Quick Start
 
 
 
47
 
48
+ 1. **Start vLLM Server**
49
+ - Click "🚀 Start vLLM Server" button
50
+ - Wait 2-5 minutes for initialization
51
+ - Look for "✅ Server started successfully"
52
 
53
+ 2. **Chat**
54
+ - Type your message
55
+ - Click "Send" or press Enter
56
+ - Get fast, high-quality responses
57
 
58
+ 3. **Customize**
59
+ - Set a system prompt (optional)
60
+ - Adjust temperature for creativity
61
+ - Modify max tokens for response length
62
 
63
+ ## Why vLLM?
64
+
65
+ vLLM is a high-throughput and memory-efficient inference engine:
66
+ - **Faster:** Optimized CUDA kernels
67
+ - **Efficient:** PagedAttention for KV cache
68
+ - **Scalable:** Multi-GPU support
69
+ - **Compatible:** OpenAI API format
70
 
71
  ## Hardware Requirements
72
 
73
+ - **Recommended:** 4x NVIDIA L40S (192GB VRAM)
74
+ - **Minimum:** 4x NVIDIA L4 (96GB VRAM)
75
+ - **Model Size:** ~96GB in bfloat16
76
 
77
+ ## Technical Details
78
 
79
+ ### Fine-tuning Configuration
80
+ - **Method:** QLoRA
81
+ - **LoRA Rank:** 16
82
  - **LoRA Alpha:** 32
83
+ - **Target Modules:** q_proj, k_proj, v_proj, o_proj
84
+ - **Training:** Attention layers only
85
+
86
+ ### Generation Parameters
87
+
88
+ **Temperature (0.0-2.0)**
89
+ - 0.1-0.5: Focused, deterministic
90
+ - 0.6-0.9: Balanced (recommended)
91
+ - 1.0-2.0: Creative, diverse
92
+
93
+ **Top P (0.0-1.0)**
94
+ - Controls nucleus sampling
95
+ - 0.9 recommended for most use cases
96
+
97
+ **Max Tokens**
98
+ - Maximum response length
99
+ - 1024 default, up to 4096
100
+
101
+ ## API Access
102
+
103
+ vLLM provides OpenAI-compatible API:
104
+
105
+ ```bash
106
+ curl -X POST "http://localhost:8000/v1/chat/completions" \
107
+ -H "Content-Type: application/json" \
108
+ --data '{
109
+ "model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
110
+ "messages": [
111
+ {"role": "user", "content": "Hello!"}
112
+ ]
113
+ }'
114
+ ```
115
 
116
  ## Support
117
 
118
+ - [vLLM Documentation](https://docs.vllm.ai/)
 
119
  - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
120
+ - [Transformers Documentation](https://huggingface.co/docs/transformers)
121
 
122
  ---
123
 
124
+ **Powered by vLLM** 🚀 | Built with ❤️
 
app.py CHANGED
@@ -1,192 +1,126 @@
1
- import os
2
- import torch
3
  import gradio as gr
4
- from transformers import AutoModelForCausalLM, AutoTokenizer
5
- import logging
6
- from datetime import datetime
7
-
8
- # Configure logging
9
- logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
10
- logger = logging.getLogger(__name__)
11
 
12
  # Model configuration
13
  MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
14
- MODEL_DESCRIPTION = """
15
- # 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned
16
-
17
- A professionally fine-tuned version of Moonshot AI's Kimi-Linear-48B-A3B-Instruct model using QLoRA.
18
-
19
- **Model Details:**
20
- - **Base Model:** moonshotai/Kimi-Linear-48B-A3B-Instruct
21
- - **Parameters:** 48 Billion
22
- - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
23
- - **Training Focus:** Attention layers (q_proj, k_proj, v_proj, o_proj)
24
- - **Architecture:** Mixture of Experts (MoE) Transformer
25
- """
26
-
27
- # Check GPU availability
28
- if torch.cuda.is_available():
29
- num_gpus = torch.cuda.device_count()
30
- total_vram = sum(torch.cuda.get_device_properties(i).total_memory / 1024**3 for i in range(num_gpus))
31
- logger.info(f"🎮 {num_gpus} GPU(s) detected with {total_vram:.1f}GB total VRAM")
32
- else:
33
- logger.warning("⚠️ No GPUs detected - running on CPU (will be slow)")
34
 
35
- class ModelInference:
36
- def __init__(self):
37
- self.model = None
38
- self.tokenizer = None
39
- self.is_loaded = False
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- def load_model(self, progress=gr.Progress()):
42
- """Load the model and tokenizer"""
43
- if self.is_loaded:
44
- return "✅ Model already loaded"
 
 
45
 
46
- try:
47
- progress(0.2, desc="Loading tokenizer...")
48
- logger.info(f"Loading tokenizer from: {MODEL_NAME}")
49
- self.tokenizer = AutoTokenizer.from_pretrained(
50
- MODEL_NAME,
51
- trust_remote_code=True
52
- )
53
-
54
- progress(0.4, desc="Loading model (this may take several minutes)...")
55
- logger.info(f"Loading model from: {MODEL_NAME}")
56
-
57
- # Configure for multi-GPU
58
- num_gpus = torch.cuda.device_count()
59
- max_memory = {}
60
- if num_gpus > 0:
61
- for i in range(num_gpus):
62
- gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
63
- max_memory[i] = f"{int(gpu_memory - 3)}GB"
64
-
65
- self.model = AutoModelForCausalLM.from_pretrained(
66
- MODEL_NAME,
67
- torch_dtype=torch.bfloat16,
68
- device_map="auto",
69
- max_memory=max_memory if max_memory else None,
70
- trust_remote_code=True,
71
- low_cpu_mem_usage=True,
72
- )
73
-
74
- self.model.eval()
75
- self.is_loaded = True
76
-
77
- progress(1.0, desc="Model loaded!")
78
- logger.info("✅ Model loaded successfully")
79
-
80
- # Get model info
81
- total_params = sum(p.numel() for p in self.model.parameters())
82
- model_size = (total_params * 2) / 1024**3 # bfloat16 = 2 bytes
83
-
84
- info_msg = f"""
85
- ✅ **Model Loaded Successfully!**
86
-
87
- **Model Information:**
88
- - Model: `{MODEL_NAME}`
89
- - Parameters: {total_params:,}
90
- - Size: ~{model_size:.1f} GB (bfloat16)
91
- - Device: {"Multi-GPU" if num_gpus > 1 else "Single GPU" if num_gpus == 1 else "CPU"}
92
 
93
- **You can now start chatting below!** 👇
94
- """
95
- return info_msg
96
-
97
- except Exception as e:
98
- logger.error(f"Failed to load model: {str(e)}", exc_info=True)
99
- self.is_loaded = False
100
- return f"❌ **Failed to load model:**\n\n{str(e)}"
101
-
102
- def generate_response(
103
- self,
104
- message,
105
- history,
106
- system_prompt,
107
- max_new_tokens,
108
- temperature,
109
- top_p,
110
- top_k,
111
- repetition_penalty,
112
- ):
113
- """Generate a response from the model"""
114
- if not self.is_loaded:
115
- return "❌ Please load the model first using the 'Load Model' button above."
116
 
117
- try:
118
- # Build conversation context
119
- conversation = []
120
-
121
- # Add system prompt if provided
122
- if system_prompt.strip():
123
- conversation.append(f"System: {system_prompt.strip()}")
124
-
125
- # Add chat history
126
- for human, assistant in history:
127
- conversation.append(f"User: {human}")
128
- if assistant:
129
- conversation.append(f"Assistant: {assistant}")
130
-
131
- # Add current message
132
- conversation.append(f"User: {message}")
133
- conversation.append("Assistant:")
134
-
135
- # Format prompt
136
- prompt = "\n".join(conversation)
137
-
138
- # Tokenize
139
- inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
140
-
141
- # Generate
142
- with torch.no_grad():
143
- outputs = self.model.generate(
144
- **inputs,
145
- max_new_tokens=max_new_tokens,
146
- temperature=temperature,
147
- top_p=top_p,
148
- top_k=top_k,
149
- repetition_penalty=repetition_penalty,
150
- do_sample=True if temperature > 0 else False,
151
- pad_token_id=self.tokenizer.eos_token_id,
152
- )
153
-
154
- # Decode response
155
- response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
156
-
157
- # Extract assistant's response (everything after the last "Assistant:")
158
- if "Assistant:" in response:
159
- response = response.split("Assistant:")[-1].strip()
160
-
161
- return response
162
 
163
- except Exception as e:
164
- logger.error(f"Generation failed: {str(e)}", exc_info=True)
165
- return f"❌ **Generation failed:**\n\n{str(e)}"
 
166
 
167
- # Initialize inference
168
- inferencer = ModelInference()
 
 
 
 
169
 
170
  # Create Gradio interface
171
- with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference") as demo:
172
- gr.Markdown(MODEL_DESCRIPTION)
 
173
 
174
- # GPU Info
175
- if torch.cuda.is_available():
176
- num_gpus = torch.cuda.device_count()
177
- total_vram_ui = sum(torch.cuda.get_device_properties(i).total_memory / 1024**3 for i in range(num_gpus))
178
- gpu_info = f"### 🎮 Hardware: {num_gpus}x {torch.cuda.get_device_name(0)} ({total_vram_ui:.1f}GB total VRAM)"
179
- else:
180
- gpu_info = "### ⚠️ Running on CPU (no GPU detected)"
181
- gr.Markdown(gpu_info)
182
 
183
- gr.Markdown("---")
 
184
 
185
  with gr.Row():
186
  with gr.Column(scale=1):
187
- load_btn = gr.Button("🚀 Load Model", variant="primary", size="lg")
188
- load_status = gr.Markdown("**Status:** Model not loaded. Click 'Load Model' to start.")
 
189
 
 
190
  gr.Markdown("### ⚙️ Generation Settings")
191
 
192
  system_prompt = gr.Textbox(
@@ -196,13 +130,12 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
196
  value=""
197
  )
198
 
199
- max_new_tokens = gr.Slider(
200
  minimum=50,
201
  maximum=4096,
202
  value=1024,
203
  step=1,
204
- label="Max New Tokens",
205
- info="Maximum length of generated response"
206
  )
207
 
208
  temperature = gr.Slider(
@@ -210,8 +143,7 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
210
  maximum=2.0,
211
  value=0.7,
212
  step=0.05,
213
- label="Temperature",
214
- info="Higher = more creative, Lower = more focused"
215
  )
216
 
217
  top_p = gr.Slider(
@@ -219,34 +151,24 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
219
  maximum=1.0,
220
  value=0.9,
221
  step=0.05,
222
- label="Top P (Nucleus Sampling)",
223
- info="Probability threshold for token selection"
224
  )
225
 
226
- top_k = gr.Slider(
227
- minimum=0,
228
- maximum=100,
229
- value=50,
230
- step=1,
231
- label="Top K",
232
- info="Number of top tokens to consider (0 = disabled)"
233
- )
234
 
235
- repetition_penalty = gr.Slider(
236
- minimum=1.0,
237
- maximum=2.0,
238
- value=1.1,
239
- step=0.05,
240
- label="Repetition Penalty",
241
- info="Penalty for repeating tokens"
242
- )
243
 
244
  with gr.Column(scale=2):
245
- gr.Markdown("### 💬 Chat Interface")
246
 
247
  chatbot = gr.Chatbot(
248
  height=500,
249
- label="Conversation",
250
  show_copy_button=True,
251
  avatar_images=["👤", "🤖"]
252
  )
@@ -255,49 +177,32 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
255
  msg = gr.Textbox(
256
  label="Your Message",
257
  placeholder="Type your message here...",
258
- lines=3,
259
  scale=4
260
  )
261
  send_btn = gr.Button("📤 Send", variant="primary", scale=1)
262
 
263
  with gr.Row():
264
  clear_btn = gr.Button("🗑️ Clear Chat")
265
- retry_btn = gr.Button("🔄 Retry Last")
266
-
267
- gr.Markdown("""
268
- ### 📝 Usage Tips:
269
- - First, click **"Load Model"** to initialize the model (takes 2-5 minutes)
270
- - Use the **System Prompt** to set the assistant's behavior
271
- - Adjust **Temperature** for creativity (0.7-1.0 recommended)
272
- - Lower **Top P** for more focused responses
273
- - Clear chat to start a new conversation
274
- """)
275
 
276
  # Event handlers
277
- load_btn.click(
278
- fn=inferencer.load_model,
279
- outputs=load_status
280
  )
281
 
282
  def user_message(user_msg, history):
283
  return "", history + [[user_msg, None]]
284
 
285
- def bot_response(history, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty):
 
 
 
286
  user_msg = history[-1][0]
287
- bot_msg = inferencer.generate_response(
288
- user_msg,
289
- history[:-1],
290
- system_prompt,
291
- max_new_tokens,
292
- temperature,
293
- top_p,
294
- top_k,
295
- repetition_penalty
296
- )
297
  history[-1][1] = bot_msg
298
  return history
299
 
300
- # Send message
301
  msg.submit(
302
  user_message,
303
  [msg, chatbot],
@@ -305,7 +210,7 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
305
  queue=False
306
  ).then(
307
  bot_response,
308
- [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
309
  chatbot
310
  )
311
 
@@ -316,47 +221,39 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
316
  queue=False
317
  ).then(
318
  bot_response,
319
- [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
320
  chatbot
321
  )
322
 
323
- # Clear chat
324
  clear_btn.click(lambda: None, None, chatbot, queue=False)
325
 
326
- # Retry last message
327
- def retry_last(history):
328
- if history:
329
- history[-1][1] = None
330
- return history
331
-
332
- retry_btn.click(
333
- retry_last,
334
- chatbot,
335
- chatbot,
336
- queue=False
337
- ).then(
338
- bot_response,
339
- [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
340
- chatbot
341
- )
342
-
343
  gr.Markdown("""
344
  ---
345
 
346
- **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
347
 
348
- **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
349
-
350
- Fine-tuned with ❤️ using QLoRA
351
  """)
352
 
353
- # Launch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
354
  if __name__ == "__main__":
355
- demo.queue(max_size=10)
356
  demo.launch(
357
  server_name="0.0.0.0",
358
  server_port=7860,
359
- share=False,
360
- show_error=True
361
  )
362
-
 
 
 
1
  import gradio as gr
2
+ import requests
3
+ import json
4
+ import subprocess
5
+ import time
6
+ import os
7
+ import signal
8
+ import sys
9
 
10
  # Model configuration
11
  MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
12
+ VLLM_PORT = 8000
13
+ VLLM_PROCESS = None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
+ def start_vllm_server():
16
+ """Start vLLM server in background"""
17
+ global VLLM_PROCESS
18
+
19
+ if VLLM_PROCESS is not None:
20
+ return "✅ vLLM server already running"
21
+
22
+ try:
23
+ # Start vLLM server
24
+ cmd = [
25
+ "python", "-m", "vllm.entrypoints.openai.api_server",
26
+ "--model", MODEL_NAME,
27
+ "--host", "0.0.0.0",
28
+ "--port", str(VLLM_PORT),
29
+ "--dtype", "bfloat16",
30
+ "--trust-remote-code",
31
+ ]
32
 
33
+ VLLM_PROCESS = subprocess.Popen(
34
+ cmd,
35
+ stdout=subprocess.PIPE,
36
+ stderr=subprocess.PIPE,
37
+ preexec_fn=os.setsid if sys.platform != 'win32' else None
38
+ )
39
 
40
+ # Wait for server to start
41
+ max_retries = 60
42
+ for i in range(max_retries):
43
+ try:
44
+ response = requests.get(f"http://localhost:{VLLM_PORT}/health", timeout=1)
45
+ if response.status_code == 200:
46
+ return "✅ vLLM server started successfully!"
47
+ except:
48
+ time.sleep(2)
49
+
50
+ return "⚠️ vLLM server started but health check failed"
51
+
52
+ except Exception as e:
53
+ return f"❌ Failed to start vLLM server: {str(e)}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
+ def chat(message, history, system_prompt, max_tokens, temperature, top_p):
56
+ """Send chat message to vLLM server"""
57
+ try:
58
+ # Build messages
59
+ messages = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ if system_prompt.strip():
62
+ messages.append({"role": "system", "content": system_prompt.strip()})
63
+
64
+ # Add history
65
+ for human, assistant in history:
66
+ messages.append({"role": "user", "content": human})
67
+ if assistant:
68
+ messages.append({"role": "assistant", "content": assistant})
69
+
70
+ # Add current message
71
+ messages.append({"role": "user", "content": message})
72
+
73
+ # Call vLLM API
74
+ response = requests.post(
75
+ f"http://localhost:{VLLM_PORT}/v1/chat/completions",
76
+ headers={"Content-Type": "application/json"},
77
+ json={
78
+ "model": MODEL_NAME,
79
+ "messages": messages,
80
+ "max_tokens": max_tokens,
81
+ "temperature": temperature,
82
+ "top_p": top_p,
83
+ "stream": False
84
+ },
85
+ timeout=300
86
+ )
87
+
88
+ if response.status_code == 200:
89
+ result = response.json()
90
+ assistant_message = result["choices"][0]["message"]["content"]
91
+ return assistant_message
92
+ else:
93
+ return f"❌ Error: {response.status_code} - {response.text}"
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
+ except requests.exceptions.ConnectionError:
96
+ return " Cannot connect to vLLM server. Please start the server first."
97
+ except Exception as e:
98
+ return f"❌ Error: {str(e)}"
99
 
100
+ # Custom CSS
101
+ custom_css = """
102
+ .gradio-container {
103
+ max-width: 1200px !important;
104
+ }
105
+ """
106
 
107
  # Create Gradio interface
108
+ with gr.Blocks(theme=gr.themes.Soft(), css=custom_css, title="Kimi 48B Fine-tuned") as demo:
109
+ gr.Markdown("""
110
+ # 🚀 Kimi Linear 48B A3B - Fine-tuned Inference
111
 
112
+ High-performance inference using **vLLM** for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.
 
 
 
 
 
 
 
113
 
114
+ **Model:** `optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune`
115
+ """)
116
 
117
  with gr.Row():
118
  with gr.Column(scale=1):
119
+ gr.Markdown("### 🎛️ Server Control")
120
+ start_btn = gr.Button("🚀 Start vLLM Server", variant="primary", size="lg")
121
+ server_status = gr.Markdown("**Status:** Server not started")
122
 
123
+ gr.Markdown("---")
124
  gr.Markdown("### ⚙️ Generation Settings")
125
 
126
  system_prompt = gr.Textbox(
 
130
  value=""
131
  )
132
 
133
+ max_tokens = gr.Slider(
134
  minimum=50,
135
  maximum=4096,
136
  value=1024,
137
  step=1,
138
+ label="Max Tokens"
 
139
  )
140
 
141
  temperature = gr.Slider(
 
143
  maximum=2.0,
144
  value=0.7,
145
  step=0.05,
146
+ label="Temperature"
 
147
  )
148
 
149
  top_p = gr.Slider(
 
151
  maximum=1.0,
152
  value=0.9,
153
  step=0.05,
154
+ label="Top P"
 
155
  )
156
 
157
+ gr.Markdown("""
158
+ ### 📖 Instructions
 
 
 
 
 
 
159
 
160
+ 1. **Start Server** - Click the button above (takes 2-5 min)
161
+ 2. **Wait for "✅"** - Server is ready when you see green checkmark
162
+ 3. **Start Chatting** - Type your message below
163
+
164
+ **Note:** First message may be slow as the model loads into memory.
165
+ """)
 
 
166
 
167
  with gr.Column(scale=2):
168
+ gr.Markdown("### 💬 Chat")
169
 
170
  chatbot = gr.Chatbot(
171
  height=500,
 
172
  show_copy_button=True,
173
  avatar_images=["👤", "🤖"]
174
  )
 
177
  msg = gr.Textbox(
178
  label="Your Message",
179
  placeholder="Type your message here...",
180
+ lines=2,
181
  scale=4
182
  )
183
  send_btn = gr.Button("📤 Send", variant="primary", scale=1)
184
 
185
  with gr.Row():
186
  clear_btn = gr.Button("🗑️ Clear Chat")
 
 
 
 
 
 
 
 
 
 
187
 
188
  # Event handlers
189
+ start_btn.click(
190
+ fn=start_vllm_server,
191
+ outputs=server_status
192
  )
193
 
194
  def user_message(user_msg, history):
195
  return "", history + [[user_msg, None]]
196
 
197
+ def bot_response(history, system_prompt, max_tokens, temperature, top_p):
198
+ if not history or history[-1][1] is not None:
199
+ return history
200
+
201
  user_msg = history[-1][0]
202
+ bot_msg = chat(user_msg, history[:-1], system_prompt, max_tokens, temperature, top_p)
 
 
 
 
 
 
 
 
 
203
  history[-1][1] = bot_msg
204
  return history
205
 
 
206
  msg.submit(
207
  user_message,
208
  [msg, chatbot],
 
210
  queue=False
211
  ).then(
212
  bot_response,
213
+ [chatbot, system_prompt, max_tokens, temperature, top_p],
214
  chatbot
215
  )
216
 
 
221
  queue=False
222
  ).then(
223
  bot_response,
224
+ [chatbot, system_prompt, max_tokens, temperature, top_p],
225
  chatbot
226
  )
227
 
 
228
  clear_btn.click(lambda: None, None, chatbot, queue=False)
229
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
  gr.Markdown("""
231
  ---
232
 
233
+ **Powered by vLLM** - High-performance LLM inference engine
234
 
235
+ **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
 
 
236
  """)
237
 
238
+ # Cleanup on exit
239
+ def cleanup():
240
+ global VLLM_PROCESS
241
+ if VLLM_PROCESS:
242
+ try:
243
+ if sys.platform == 'win32':
244
+ VLLM_PROCESS.terminate()
245
+ else:
246
+ os.killpg(os.getpgid(VLLM_PROCESS.pid), signal.SIGTERM)
247
+ except:
248
+ pass
249
+
250
+ import atexit
251
+ atexit.register(cleanup)
252
+
253
  if __name__ == "__main__":
254
+ demo.queue()
255
  demo.launch(
256
  server_name="0.0.0.0",
257
  server_port=7860,
258
+ share=False
 
259
  )
 
requirements.txt CHANGED
@@ -1,60 +1,12 @@
1
- # Core ML Libraries
2
- transformers>=4.56.0 # Required by Kimi model (has assertion check)
3
- accelerate>=0.34.2 # Compatible with latest transformers
4
- peft>=0.13.2 # Latest stable
5
- bitsandbytes>=0.45.1 # Compatible with triton 3.x (0.45.0 has triton.ops issue)
6
- sentencepiece==0.2.0
7
- protobuf==5.29.2 # Updated for compatibility
8
-
9
- # Training & Optimization
10
- deepspeed==0.16.3 # Compatible with torch 2.5, more stable than 0.18.x
11
- triton>=3.2.0 # Required by fla for optimal performance
12
- scipy==1.14.1 # Updated
13
- scikit-learn==1.6.0 # Updated
14
- ninja==1.11.1.1
15
-
16
- # Data Processing
17
- datasets>=3.2.0 # Updated for compatibility
18
- tokenizers>=0.21.0 # Compatible with latest transformers
19
- pandas==2.2.3
20
- numpy==1.26.4 # Keep for stability (2.x has breaking changes)
21
-
22
- # Monitoring & Logging
23
- wandb==0.19.1 # Updated
24
- tensorboard==2.18.0
25
- tqdm==4.67.1 # Updated
26
- psutil==6.1.1 # Updated
27
- pynvml==11.5.3 # Updated
28
-
29
- # Evaluation
30
- rouge-score==0.1.2
31
- sacrebleu==2.4.3 # Updated
32
- bert-score==0.3.13
33
-
34
- # Utilities
35
- pyyaml==6.0.2
36
- python-dotenv==1.0.1
37
- huggingface-hub>=0.34.0 # Required by transformers >=4.56.0
38
- safetensors==0.4.5
39
- tiktoken==0.8.0 # Updated
40
- hf_transfer==0.1.8 # Updated
41
-
42
- # Kimi / Flash Linear Attention runtime (requires torch>=2.5)
43
- # Install from git to get latest version with fla.layers module
44
- git+https://github.com/sustcsonglin/flash-linear-attention.git@main
45
-
46
- # Required by Kimi tokenizer (tiktoken BPE loader)
47
- blobfile==3.0.0 # Updated
48
-
49
- # Web UI for HF Space
50
- gradio==4.44.1 # Web interface to keep Space alive
51
-
52
- # API (optional - not used with Gradio)
53
- # fastapi==0.115.6
54
- # uvicorn[standard]==0.34.0
55
- # python-multipart==0.0.20
56
-
57
- # Development
58
- pytest==8.3.4 # Updated
59
- black==24.10.0 # Updated
60
- flake8==7.1.1
 
1
+ # vLLM for high-performance inference
2
+ vllm>=0.6.0
3
+
4
+ # Core dependencies (most are installed with vLLM)
5
+ gradio>=4.44.0
6
+ requests>=2.31.0
7
+
8
+ # Note: vLLM automatically installs:
9
+ # - torch
10
+ # - transformers
11
+ # - tokenizers
12
+ # - etc.