Spaces:
Paused
Paused
| title: Kimi 48B Fine-tuned - Inference | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| app_port: 7860 | |
| suggested_hardware: l40sx4 | |
| # π Kimi Linear 48B A3B Instruct - Fine-tuned | |
| High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by **vLLM**. | |
| ## Model Information | |
| - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune) | |
| - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct) | |
| - **Parameters:** 48 Billion | |
| - **Fine-tuning:** QLoRA on attention layers | |
| - **Inference Engine:** vLLM | |
| ## Features | |
| β‘ **High-Performance Inference** | |
| - Powered by vLLM for maximum throughput | |
| - Optimized memory usage with PagedAttention | |
| - Multi-GPU support (automatic) | |
| π¬ **Professional Chat Interface** | |
| - Clean Gradio UI | |
| - Real-time responses | |
| - Chat history | |
| - Copy button for responses | |
| βοΈ **Configurable Generation** | |
| - Temperature control | |
| - Top-P sampling | |
| - Max tokens setting | |
| - System prompt support | |
| ## Usage | |
| ### Quick Start | |
| 1. **Start vLLM Server** | |
| - Click "π Start vLLM Server" button | |
| - Wait 2-5 minutes for initialization | |
| - Look for "β Server started successfully" | |
| 2. **Chat** | |
| - Type your message | |
| - Click "Send" or press Enter | |
| - Get fast, high-quality responses | |
| 3. **Customize** | |
| - Set a system prompt (optional) | |
| - Adjust temperature for creativity | |
| - Modify max tokens for response length | |
| ## Why vLLM? | |
| vLLM is a high-throughput and memory-efficient inference engine: | |
| - **Faster:** Optimized CUDA kernels | |
| - **Efficient:** PagedAttention for KV cache | |
| - **Scalable:** Multi-GPU support | |
| - **Compatible:** OpenAI API format | |
| ## Hardware Requirements | |
| - **Recommended:** 4x NVIDIA L40S (192GB VRAM) | |
| - **Minimum:** 4x NVIDIA L4 (96GB VRAM) | |
| - **Model Size:** ~96GB in bfloat16 | |
| ## Technical Details | |
| ### Fine-tuning Configuration | |
| - **Method:** QLoRA | |
| - **LoRA Rank:** 16 | |
| - **LoRA Alpha:** 32 | |
| - **Target Modules:** q_proj, k_proj, v_proj, o_proj | |
| - **Training:** Attention layers only | |
| ### Generation Parameters | |
| **Temperature (0.0-2.0)** | |
| - 0.1-0.5: Focused, deterministic | |
| - 0.6-0.9: Balanced (recommended) | |
| - 1.0-2.0: Creative, diverse | |
| **Top P (0.0-1.0)** | |
| - Controls nucleus sampling | |
| - 0.9 recommended for most use cases | |
| **Max Tokens** | |
| - Maximum response length | |
| - 1024 default, up to 4096 | |
| ## API Access | |
| vLLM provides OpenAI-compatible API: | |
| ```bash | |
| curl -X POST "http://localhost:8000/v1/chat/completions" \ | |
| -H "Content-Type: application/json" \ | |
| --data '{ | |
| "model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune", | |
| "messages": [ | |
| {"role": "user", "content": "Hello!"} | |
| ] | |
| }' | |
| ``` | |
| ## Support | |
| - [vLLM Documentation](https://docs.vllm.ai/) | |
| - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune) | |
| - [Transformers Documentation](https://huggingface.co/docs/transformers) | |
| --- | |
| **Powered by vLLM** π | Built with β€οΈ | |