fnmodel / README.md
aeb56
Switch to vLLM for high-performance, stable inference
310eb95
|
raw
history blame
3.05 kB
---
title: Kimi 48B Fine-tuned - Inference
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l40sx4
---
# πŸš€ Kimi Linear 48B A3B Instruct - Fine-tuned
High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by **vLLM**.
## Model Information
- **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- **Parameters:** 48 Billion
- **Fine-tuning:** QLoRA on attention layers
- **Inference Engine:** vLLM
## Features
⚑ **High-Performance Inference**
- Powered by vLLM for maximum throughput
- Optimized memory usage with PagedAttention
- Multi-GPU support (automatic)
πŸ’¬ **Professional Chat Interface**
- Clean Gradio UI
- Real-time responses
- Chat history
- Copy button for responses
βš™οΈ **Configurable Generation**
- Temperature control
- Top-P sampling
- Max tokens setting
- System prompt support
## Usage
### Quick Start
1. **Start vLLM Server**
- Click "πŸš€ Start vLLM Server" button
- Wait 2-5 minutes for initialization
- Look for "βœ… Server started successfully"
2. **Chat**
- Type your message
- Click "Send" or press Enter
- Get fast, high-quality responses
3. **Customize**
- Set a system prompt (optional)
- Adjust temperature for creativity
- Modify max tokens for response length
## Why vLLM?
vLLM is a high-throughput and memory-efficient inference engine:
- **Faster:** Optimized CUDA kernels
- **Efficient:** PagedAttention for KV cache
- **Scalable:** Multi-GPU support
- **Compatible:** OpenAI API format
## Hardware Requirements
- **Recommended:** 4x NVIDIA L40S (192GB VRAM)
- **Minimum:** 4x NVIDIA L4 (96GB VRAM)
- **Model Size:** ~96GB in bfloat16
## Technical Details
### Fine-tuning Configuration
- **Method:** QLoRA
- **LoRA Rank:** 16
- **LoRA Alpha:** 32
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Training:** Attention layers only
### Generation Parameters
**Temperature (0.0-2.0)**
- 0.1-0.5: Focused, deterministic
- 0.6-0.9: Balanced (recommended)
- 1.0-2.0: Creative, diverse
**Top P (0.0-1.0)**
- Controls nucleus sampling
- 0.9 recommended for most use cases
**Max Tokens**
- Maximum response length
- 1024 default, up to 4096
## API Access
vLLM provides OpenAI-compatible API:
```bash
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
## Support
- [vLLM Documentation](https://docs.vllm.ai/)
- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
---
**Powered by vLLM** πŸš€ | Built with ❀️