Spaces:
Paused
Paused
metadata
title: Kimi 48B Fine-tuned - Inference
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l40sx4
π Kimi Linear 48B A3B Instruct - Fine-tuned
High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by vLLM.
Model Information
- Model: optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune
- Base Model: moonshotai/Kimi-Linear-48B-A3B-Instruct
- Parameters: 48 Billion
- Fine-tuning: QLoRA on attention layers
- Inference Engine: vLLM
Features
β‘ High-Performance Inference
- Powered by vLLM for maximum throughput
- Optimized memory usage with PagedAttention
- Multi-GPU support (automatic)
π¬ Professional Chat Interface
- Clean Gradio UI
- Real-time responses
- Chat history
- Copy button for responses
βοΈ Configurable Generation
- Temperature control
- Top-P sampling
- Max tokens setting
- System prompt support
Usage
Quick Start
Start vLLM Server
- Click "π Start vLLM Server" button
- Wait 2-5 minutes for initialization
- Look for "β Server started successfully"
Chat
- Type your message
- Click "Send" or press Enter
- Get fast, high-quality responses
Customize
- Set a system prompt (optional)
- Adjust temperature for creativity
- Modify max tokens for response length
Why vLLM?
vLLM is a high-throughput and memory-efficient inference engine:
- Faster: Optimized CUDA kernels
- Efficient: PagedAttention for KV cache
- Scalable: Multi-GPU support
- Compatible: OpenAI API format
Hardware Requirements
- Recommended: 4x NVIDIA L40S (192GB VRAM)
- Minimum: 4x NVIDIA L4 (96GB VRAM)
- Model Size: ~96GB in bfloat16
Technical Details
Fine-tuning Configuration
- Method: QLoRA
- LoRA Rank: 16
- LoRA Alpha: 32
- Target Modules: q_proj, k_proj, v_proj, o_proj
- Training: Attention layers only
Generation Parameters
Temperature (0.0-2.0)
- 0.1-0.5: Focused, deterministic
- 0.6-0.9: Balanced (recommended)
- 1.0-2.0: Creative, diverse
Top P (0.0-1.0)
- Controls nucleus sampling
- 0.9 recommended for most use cases
Max Tokens
- Maximum response length
- 1024 default, up to 4096
API Access
vLLM provides OpenAI-compatible API:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Support
Powered by vLLM π | Built with β€οΈ