Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

aeb56 commited on Nov 10, 2025

Commit

310eb95

1 Parent(s): e62c736

Switch to vLLM for high-performance, stable inference

Browse files

Files changed (4) hide show

Dockerfile +3 -7
README.md +79 -46
app.py +147 -250
requirements.txt +12 -60

Dockerfile CHANGED Viewed

@@ -12,7 +12,6 @@ RUN apt-get update && apt-get install -y \
     python3.10 \
     python3-pip \
     git \
-    git-lfs \
     wget \
     && rm -rf /var/lib/apt/lists/*
@@ -34,15 +33,13 @@ RUN pip3 install --no-cache-dir -r requirements.txt
 # Copy application files
 COPY . .
-# Create directories for models and cache
-RUN mkdir -p /app/models /app/merged_model /app/cache /tmp/offload
 # Set ownership and permissions for user
-RUN chown -R user:user /app /tmp/offload && \
     chmod -R 755 /app
-# Expose port for Gradio
 EXPOSE 7860
 # Set HuggingFace cache directory
 ENV HF_HOME=/app/cache
@@ -53,4 +50,3 @@ USER user
 # Run the application
 CMD ["python3", "app.py"]

     python3.10 \
     python3-pip \
     git \
     wget \
     && rm -rf /var/lib/apt/lists/*
 # Copy application files
 COPY . .
 # Set ownership and permissions for user
+RUN chown -R user:user /app && \
     chmod -R 755 /app
+# Expose ports
 EXPOSE 7860
+EXPOSE 8000
 # Set HuggingFace cache directory
 ENV HF_HOME=/app/cache
 # Run the application
 CMD ["python3", "app.py"]

README.md CHANGED Viewed

@@ -12,80 +12,113 @@ suggested_hardware: l40sx4
 # 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned
-Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.
 ## Model Information
 - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
 - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
 - **Parameters:** 48 Billion
-- **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
-- **Architecture:** Mixture of Experts (MoE) Transformer
 ## Features
-✨ **Professional Chat Interface**
-- Clean, modern UI for seamless conversations
-- Chat history with copy functionality
-- System prompt customization
-⚙️ **Advanced Generation Settings**
-- Temperature control for creativity
-- Top-P and Top-K sampling
-- Repetition penalty adjustment
-- Configurable response length
-🎮 **Optimized Performance**
-- Multi-GPU support (4xL40S recommended)
-- Automatic device mapping
-- bfloat16 precision for efficiency
-- ~96GB VRAM requirement
 ## Usage
-1. **Click "Load Model"** - Initialize the model (takes 2-5 minutes)
-2. **Set System Prompt** (optional) - Define the assistant's behavior
-3. **Start Chatting** - Type your message and hit send
-4. **Adjust Settings** - Fine-tune generation parameters as needed
-## Generation Parameters
-### Temperature (0.0 - 2.0)
-- **Low (0.1-0.5):** Focused, deterministic responses
-- **Medium (0.6-0.9):** Balanced creativity
-- **High (1.0-2.0):** More creative and diverse outputs
-### Top P (0.0 - 1.0)
-- **0.9 (recommended):** Good balance
-- Lower values: More focused
-- Higher values: More diverse
-### Max New Tokens
-- Maximum length of generated response
-- **1024 (default):** Good for most use cases
-- Increase for longer responses
 ## Hardware Requirements
-- **Recommended:** 4x NVIDIA L40S GPUs (192GB total VRAM)
-- **Minimum:** 4x NVIDIA L4 GPUs (96GB total VRAM)
-- **Memory:** ~96GB VRAM in bfloat16 precision
-## Fine-tuning Details
-This model was fine-tuned using QLoRA with the following configuration:
-- **LoRA Rank (r):** 16
 - **LoRA Alpha:** 32
-- **Target Modules:** q_proj, k_proj, v_proj, o_proj (attention layers only)
-- **Dropout:** 0.05
 ## Support
-For issues or questions:
-- [Transformers Documentation](https://huggingface.co/docs/transformers)
 - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
 ---
-Built with ❤️ using Transformers and Gradio

 # 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned
+High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by **vLLM**.
 ## Model Information
 - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
 - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
 - **Parameters:** 48 Billion
+- **Fine-tuning:** QLoRA on attention layers
+- **Inference Engine:** vLLM
 ## Features
+⚡ **High-Performance Inference**
+- Powered by vLLM for maximum throughput
+- Optimized memory usage with PagedAttention
+- Multi-GPU support (automatic)
+💬 **Professional Chat Interface**
+- Clean Gradio UI
+- Real-time responses
+- Chat history
+- Copy button for responses
+⚙️ **Configurable Generation**
+- Temperature control
+- Top-P sampling
+- Max tokens setting
+- System prompt support
 ## Usage
+### Quick Start
+1. **Start vLLM Server**
+   - Click "🚀 Start vLLM Server" button
+   - Wait 2-5 minutes for initialization
+   - Look for "✅ Server started successfully"
+2. **Chat**
+   - Type your message
+   - Click "Send" or press Enter
+   - Get fast, high-quality responses
+3. **Customize**
+   - Set a system prompt (optional)
+   - Adjust temperature for creativity
+   - Modify max tokens for response length
+## Why vLLM?
+vLLM is a high-throughput and memory-efficient inference engine:
+- **Faster:** Optimized CUDA kernels
+- **Efficient:** PagedAttention for KV cache
+- **Scalable:** Multi-GPU support
+- **Compatible:** OpenAI API format
 ## Hardware Requirements
+- **Recommended:** 4x NVIDIA L40S (192GB VRAM)
+- **Minimum:** 4x NVIDIA L4 (96GB VRAM)
+- **Model Size:** ~96GB in bfloat16
+## Technical Details
+### Fine-tuning Configuration
+- **Method:** QLoRA
+- **LoRA Rank:** 16
 - **LoRA Alpha:** 32
+- **Target Modules:** q_proj, k_proj, v_proj, o_proj
+- **Training:** Attention layers only
+### Generation Parameters
+**Temperature (0.0-2.0)**
+- 0.1-0.5: Focused, deterministic
+- 0.6-0.9: Balanced (recommended)
+- 1.0-2.0: Creative, diverse
+**Top P (0.0-1.0)**
+- Controls nucleus sampling
+- 0.9 recommended for most use cases
+**Max Tokens**
+- Maximum response length
+- 1024 default, up to 4096
+## API Access
+vLLM provides OpenAI-compatible API:
+```bash
+curl -X POST "http://localhost:8000/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  --data '{
+    "model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ]
+  }'
+```
 ## Support
+- [vLLM Documentation](https://docs.vllm.ai/)
 - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
+- [Transformers Documentation](https://huggingface.co/docs/transformers)
 ---
+**Powered by vLLM** 🚀 | Built with ❤️

app.py CHANGED Viewed

@@ -1,192 +1,126 @@
-import os
-import torch
 import gradio as gr
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import logging
-from datetime import datetime
-# Configure logging
-logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
-logger = logging.getLogger(__name__)
 # Model configuration
 MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
-MODEL_DESCRIPTION = """
-# 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned
-A professionally fine-tuned version of Moonshot AI's Kimi-Linear-48B-A3B-Instruct model using QLoRA.
-**Model Details:**
-- **Base Model:** moonshotai/Kimi-Linear-48B-A3B-Instruct
-- **Parameters:** 48 Billion
-- **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation)
-- **Training Focus:** Attention layers (q_proj, k_proj, v_proj, o_proj)
-- **Architecture:** Mixture of Experts (MoE) Transformer
-"""
-# Check GPU availability
-if torch.cuda.is_available():
-    num_gpus = torch.cuda.device_count()
-    total_vram = sum(torch.cuda.get_device_properties(i).total_memory / 1024**3 for i in range(num_gpus))
-    logger.info(f"🎮 {num_gpus} GPU(s) detected with {total_vram:.1f}GB total VRAM")
-else:
-    logger.warning("⚠️ No GPUs detected - running on CPU (will be slow)")
-class ModelInference:
-    def __init__(self):
-        self.model = None
-        self.tokenizer = None
-        self.is_loaded = False
-    def load_model(self, progress=gr.Progress()):
-        """Load the model and tokenizer"""
-        if self.is_loaded:
-            return "✅ Model already loaded"
-        try:
-            progress(0.2, desc="Loading tokenizer...")
-            logger.info(f"Loading tokenizer from: {MODEL_NAME}")
-            self.tokenizer = AutoTokenizer.from_pretrained(
-                MODEL_NAME,
-                trust_remote_code=True
-            )
-            progress(0.4, desc="Loading model (this may take several minutes)...")
-            logger.info(f"Loading model from: {MODEL_NAME}")
-            # Configure for multi-GPU
-            num_gpus = torch.cuda.device_count()
-            max_memory = {}
-            if num_gpus > 0:
-                for i in range(num_gpus):
-                    gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
-                    max_memory[i] = f"{int(gpu_memory - 3)}GB"
-            self.model = AutoModelForCausalLM.from_pretrained(
-                MODEL_NAME,
-                torch_dtype=torch.bfloat16,
-                device_map="auto",
-                max_memory=max_memory if max_memory else None,
-                trust_remote_code=True,
-                low_cpu_mem_usage=True,
-            )
-            self.model.eval()
-            self.is_loaded = True
-            progress(1.0, desc="Model loaded!")
-            logger.info("✅ Model loaded successfully")
-            # Get model info
-            total_params = sum(p.numel() for p in self.model.parameters())
-            model_size = (total_params * 2) / 1024**3  # bfloat16 = 2 bytes
-            info_msg = f"""
-✅ **Model Loaded Successfully!**
-**Model Information:**
-- Model: `{MODEL_NAME}`
-- Parameters: {total_params:,}
-- Size: ~{model_size:.1f} GB (bfloat16)
-- Device: {"Multi-GPU" if num_gpus > 1 else "Single GPU" if num_gpus == 1 else "CPU"}
-**You can now start chatting below!** 👇
-"""
-            return info_msg
-        except Exception as e:
-            logger.error(f"Failed to load model: {str(e)}", exc_info=True)
-            self.is_loaded = False
-            return f"❌ **Failed to load model:**\n\n{str(e)}"
-    def generate_response(
-        self,
-        message,
-        history,
-        system_prompt,
-        max_new_tokens,
-        temperature,
-        top_p,
-        top_k,
-        repetition_penalty,
-    ):
-        """Generate a response from the model"""
-        if not self.is_loaded:
-            return "❌ Please load the model first using the 'Load Model' button above."
-        try:
-            # Build conversation context
-            conversation = []
-            # Add system prompt if provided
-            if system_prompt.strip():
-                conversation.append(f"System: {system_prompt.strip()}")
-            # Add chat history
-            for human, assistant in history:
-                conversation.append(f"User: {human}")
-                if assistant:
-                    conversation.append(f"Assistant: {assistant}")
-            # Add current message
-            conversation.append(f"User: {message}")
-            conversation.append("Assistant:")
-            # Format prompt
-            prompt = "\n".join(conversation)
-            # Tokenize
-            inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
-            # Generate
-            with torch.no_grad():
-                outputs = self.model.generate(
-                    **inputs,
-                    max_new_tokens=max_new_tokens,
-                    temperature=temperature,
-                    top_p=top_p,
-                    top_k=top_k,
-                    repetition_penalty=repetition_penalty,
-                    do_sample=True if temperature > 0 else False,
-                    pad_token_id=self.tokenizer.eos_token_id,
-                )
-            # Decode response
-            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
-            # Extract assistant's response (everything after the last "Assistant:")
-            if "Assistant:" in response:
-                response = response.split("Assistant:")[-1].strip()
-            return response
-        except Exception as e:
-            logger.error(f"Generation failed: {str(e)}", exc_info=True)
-            return f"❌ **Generation failed:**\n\n{str(e)}"
-# Initialize inference
-inferencer = ModelInference()
 # Create Gradio interface
-with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference") as demo:
-    gr.Markdown(MODEL_DESCRIPTION)
-    # GPU Info
-    if torch.cuda.is_available():
-        num_gpus = torch.cuda.device_count()
-        total_vram_ui = sum(torch.cuda.get_device_properties(i).total_memory / 1024**3 for i in range(num_gpus))
-        gpu_info = f"### 🎮 Hardware: {num_gpus}x {torch.cuda.get_device_name(0)} ({total_vram_ui:.1f}GB total VRAM)"
-    else:
-        gpu_info = "### ⚠️ Running on CPU (no GPU detected)"
-    gr.Markdown(gpu_info)
-    gr.Markdown("---")
     with gr.Row():
         with gr.Column(scale=1):
-            load_btn = gr.Button("🚀 Load Model", variant="primary", size="lg")
-            load_status = gr.Markdown("**Status:** Model not loaded. Click 'Load Model' to start.")
             gr.Markdown("### ⚙️ Generation Settings")
             system_prompt = gr.Textbox(
@@ -196,13 +130,12 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
                 value=""
             )
-            max_new_tokens = gr.Slider(
                 minimum=50,
                 maximum=4096,
                 value=1024,
                 step=1,
-                label="Max New Tokens",
-                info="Maximum length of generated response"
             )
             temperature = gr.Slider(
@@ -210,8 +143,7 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
                 maximum=2.0,
                 value=0.7,
                 step=0.05,
-                label="Temperature",
-                info="Higher = more creative, Lower = more focused"
             )
             top_p = gr.Slider(
@@ -219,34 +151,24 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
                 maximum=1.0,
                 value=0.9,
                 step=0.05,
-                label="Top P (Nucleus Sampling)",
-                info="Probability threshold for token selection"
             )
-            top_k = gr.Slider(
-                minimum=0,
-                maximum=100,
-                value=50,
-                step=1,
-                label="Top K",
-                info="Number of top tokens to consider (0 = disabled)"
-            )
-            repetition_penalty = gr.Slider(
-                minimum=1.0,
-                maximum=2.0,
-                value=1.1,
-                step=0.05,
-                label="Repetition Penalty",
-                info="Penalty for repeating tokens"
-            )
         with gr.Column(scale=2):
-            gr.Markdown("### 💬 Chat Interface")
             chatbot = gr.Chatbot(
                 height=500,
-                label="Conversation",
                 show_copy_button=True,
                 avatar_images=["👤", "🤖"]
             )
@@ -255,49 +177,32 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
                 msg = gr.Textbox(
                     label="Your Message",
                     placeholder="Type your message here...",
-                    lines=3,
                     scale=4
                 )
                 send_btn = gr.Button("📤 Send", variant="primary", scale=1)
             with gr.Row():
                 clear_btn = gr.Button("🗑️ Clear Chat")
-                retry_btn = gr.Button("🔄 Retry Last")
-            gr.Markdown("""
-            ### 📝 Usage Tips:
-            - First, click **"Load Model"** to initialize the model (takes 2-5 minutes)
-            - Use the **System Prompt** to set the assistant's behavior
-            - Adjust **Temperature** for creativity (0.7-1.0 recommended)
-            - Lower **Top P** for more focused responses
-            - Clear chat to start a new conversation
-            """)
     # Event handlers
-    load_btn.click(
-        fn=inferencer.load_model,
-        outputs=load_status
     )
     def user_message(user_msg, history):
         return "", history + [[user_msg, None]]
-    def bot_response(history, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty):
         user_msg = history[-1][0]
-        bot_msg = inferencer.generate_response(
-            user_msg,
-            history[:-1],
-            system_prompt,
-            max_new_tokens,
-            temperature,
-            top_p,
-            top_k,
-            repetition_penalty
-        )
         history[-1][1] = bot_msg
         return history
-    # Send message
     msg.submit(
         user_message,
         [msg, chatbot],
@@ -305,7 +210,7 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
         queue=False
     ).then(
         bot_response,
-        [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
         chatbot
     )
@@ -316,47 +221,39 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Kimi 48B Fine-tuned - Inference")
         queue=False
     ).then(
         bot_response,
-        [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
         chatbot
     )
-    # Clear chat
     clear_btn.click(lambda: None, None, chatbot, queue=False)
-    # Retry last message
-    def retry_last(history):
-        if history:
-            history[-1][1] = None
-        return history
-    retry_btn.click(
-        retry_last,
-        chatbot,
-        chatbot,
-        queue=False
-    ).then(
-        bot_response,
-        [chatbot, system_prompt, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
-        chatbot
-    )
     gr.Markdown("""
     ---
-    **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
-    **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
-    Fine-tuned with ❤️ using QLoRA
     """)
-# Launch
 if __name__ == "__main__":
-    demo.queue(max_size=10)
     demo.launch(
         server_name="0.0.0.0",
         server_port=7860,
-        share=False,
-        show_error=True
     )

 import gradio as gr
+import requests
+import json
+import subprocess
+import time
+import os
+import signal
+import sys
 # Model configuration
 MODEL_NAME = "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune"
+VLLM_PORT = 8000
+VLLM_PROCESS = None
+def start_vllm_server():
+    """Start vLLM server in background"""
+    global VLLM_PROCESS
+    if VLLM_PROCESS is not None:
+        return "✅ vLLM server already running"
+    try:
+        # Start vLLM server
+        cmd = [
+            "python", "-m", "vllm.entrypoints.openai.api_server",
+            "--model", MODEL_NAME,
+            "--host", "0.0.0.0",
+            "--port", str(VLLM_PORT),
+            "--dtype", "bfloat16",
+            "--trust-remote-code",
+        ]
+        VLLM_PROCESS = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            preexec_fn=os.setsid if sys.platform != 'win32' else None
+        )
+        # Wait for server to start
+        max_retries = 60
+        for i in range(max_retries):
+            try:
+                response = requests.get(f"http://localhost:{VLLM_PORT}/health", timeout=1)
+                if response.status_code == 200:
+                    return "✅ vLLM server started successfully!"
+            except:
+                time.sleep(2)
+        return "⚠️ vLLM server started but health check failed"
+    except Exception as e:
+        return f"❌ Failed to start vLLM server: {str(e)}"
+def chat(message, history, system_prompt, max_tokens, temperature, top_p):
+    """Send chat message to vLLM server"""
+    try:
+        # Build messages
+        messages = []
+        if system_prompt.strip():
+            messages.append({"role": "system", "content": system_prompt.strip()})
+        # Add history
+        for human, assistant in history:
+            messages.append({"role": "user", "content": human})
+            if assistant:
+                messages.append({"role": "assistant", "content": assistant})
+        # Add current message
+        messages.append({"role": "user", "content": message})
+        # Call vLLM API
+        response = requests.post(
+            f"http://localhost:{VLLM_PORT}/v1/chat/completions",
+            headers={"Content-Type": "application/json"},
+            json={
+                "model": MODEL_NAME,
+                "messages": messages,
+                "max_tokens": max_tokens,
+                "temperature": temperature,
+                "top_p": top_p,
+                "stream": False
+            },
+            timeout=300
+        )
+        if response.status_code == 200:
+            result = response.json()
+            assistant_message = result["choices"][0]["message"]["content"]
+            return assistant_message
+        else:
+            return f"❌ Error: {response.status_code} - {response.text}"
+    except requests.exceptions.ConnectionError:
+        return "❌ Cannot connect to vLLM server. Please start the server first."
+    except Exception as e:
+        return f"❌ Error: {str(e)}"
+# Custom CSS
+custom_css = """
+.gradio-container {
+    max-width: 1200px !important;
+}
+"""
 # Create Gradio interface
+with gr.Blocks(theme=gr.themes.Soft(), css=custom_css, title="Kimi 48B Fine-tuned") as demo:
+    gr.Markdown("""
+    # 🚀 Kimi Linear 48B A3B - Fine-tuned Inference
+    High-performance inference using **vLLM** for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.
+    **Model:** `optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune`
+    """)
     with gr.Row():
         with gr.Column(scale=1):
+            gr.Markdown("### 🎛️ Server Control")
+            start_btn = gr.Button("🚀 Start vLLM Server", variant="primary", size="lg")
+            server_status = gr.Markdown("**Status:** Server not started")
+            gr.Markdown("---")
             gr.Markdown("### ⚙️ Generation Settings")
             system_prompt = gr.Textbox(
                 value=""
             )
+            max_tokens = gr.Slider(
                 minimum=50,
                 maximum=4096,
                 value=1024,
                 step=1,
+                label="Max Tokens"
             )
             temperature = gr.Slider(
                 maximum=2.0,
                 value=0.7,
                 step=0.05,
+                label="Temperature"
             )
             top_p = gr.Slider(
                 maximum=1.0,
                 value=0.9,
                 step=0.05,
+                label="Top P"
             )
+            gr.Markdown("""
+            ### 📖 Instructions
+            1. **Start Server** - Click the button above (takes 2-5 min)
+            2. **Wait for "✅"** - Server is ready when you see green checkmark
+            3. **Start Chatting** - Type your message below
+            **Note:** First message may be slow as the model loads into memory.
+            """)
         with gr.Column(scale=2):
+            gr.Markdown("### 💬 Chat")
             chatbot = gr.Chatbot(
                 height=500,
                 show_copy_button=True,
                 avatar_images=["👤", "🤖"]
             )
                 msg = gr.Textbox(
                     label="Your Message",
                     placeholder="Type your message here...",
+                    lines=2,
                     scale=4
                 )
                 send_btn = gr.Button("📤 Send", variant="primary", scale=1)
             with gr.Row():
                 clear_btn = gr.Button("🗑️ Clear Chat")
     # Event handlers
+    start_btn.click(
+        fn=start_vllm_server,
+        outputs=server_status
     )
     def user_message(user_msg, history):
         return "", history + [[user_msg, None]]
+    def bot_response(history, system_prompt, max_tokens, temperature, top_p):
+        if not history or history[-1][1] is not None:
+            return history
         user_msg = history[-1][0]
+        bot_msg = chat(user_msg, history[:-1], system_prompt, max_tokens, temperature, top_p)
         history[-1][1] = bot_msg
         return history
     msg.submit(
         user_message,
         [msg, chatbot],
         queue=False
     ).then(
         bot_response,
+        [chatbot, system_prompt, max_tokens, temperature, top_p],
         chatbot
     )
         queue=False
     ).then(
         bot_response,
+        [chatbot, system_prompt, max_tokens, temperature, top_p],
         chatbot
     )
     clear_btn.click(lambda: None, None, chatbot, queue=False)
     gr.Markdown("""
     ---
+    **Powered by vLLM** - High-performance LLM inference engine
+    **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
     """)
+# Cleanup on exit
+def cleanup():
+    global VLLM_PROCESS
+    if VLLM_PROCESS:
+        try:
+            if sys.platform == 'win32':
+                VLLM_PROCESS.terminate()
+            else:
+                os.killpg(os.getpgid(VLLM_PROCESS.pid), signal.SIGTERM)
+        except:
+            pass
+import atexit
+atexit.register(cleanup)
 if __name__ == "__main__":
+    demo.queue()
     demo.launch(
         server_name="0.0.0.0",
         server_port=7860,
+        share=False
     )

requirements.txt CHANGED Viewed

@@ -1,60 +1,12 @@
-# Core ML Libraries
-transformers>=4.56.0          # Required by Kimi model (has assertion check)
-accelerate>=0.34.2            # Compatible with latest transformers
-peft>=0.13.2                  # Latest stable
-bitsandbytes>=0.45.1          # Compatible with triton 3.x (0.45.0 has triton.ops issue)
-sentencepiece==0.2.0
-protobuf==5.29.2              # Updated for compatibility
-# Training & Optimization
-deepspeed==0.16.3             # Compatible with torch 2.5, more stable than 0.18.x
-triton>=3.2.0                 # Required by fla for optimal performance
-scipy==1.14.1                 # Updated
-scikit-learn==1.6.0           # Updated
-ninja==1.11.1.1
-# Data Processing
-datasets>=3.2.0               # Updated for compatibility
-tokenizers>=0.21.0            # Compatible with latest transformers
-pandas==2.2.3
-numpy==1.26.4                 # Keep for stability (2.x has breaking changes)
-# Monitoring & Logging
-wandb==0.19.1                 # Updated
-tensorboard==2.18.0
-tqdm==4.67.1                  # Updated
-psutil==6.1.1                 # Updated
-pynvml==11.5.3                # Updated
-# Evaluation
-rouge-score==0.1.2
-sacrebleu==2.4.3              # Updated
-bert-score==0.3.13
-# Utilities
-pyyaml==6.0.2
-python-dotenv==1.0.1
-huggingface-hub>=0.34.0       # Required by transformers >=4.56.0
-safetensors==0.4.5
-tiktoken==0.8.0               # Updated
-hf_transfer==0.1.8            # Updated
-# Kimi / Flash Linear Attention runtime (requires torch>=2.5)
-# Install from git to get latest version with fla.layers module
-git+https://github.com/sustcsonglin/flash-linear-attention.git@main
-# Required by Kimi tokenizer (tiktoken BPE loader)
-blobfile==3.0.0               # Updated
-# Web UI for HF Space
-gradio==4.44.1                # Web interface to keep Space alive
-# API (optional - not used with Gradio)
-# fastapi==0.115.6
-# uvicorn[standard]==0.34.0
-# python-multipart==0.0.20
-# Development
-pytest==8.3.4                 # Updated
-black==24.10.0                # Updated
-flake8==7.1.1

+# vLLM for high-performance inference
+vllm>=0.6.0
+# Core dependencies (most are installed with vLLM)
+gradio>=4.44.0
+requests>=2.31.0
+# Note: vLLM automatically installs:
+# - torch
+# - transformers
+# - tokenizers
+# - etc.