Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README.md

aeb56

Switch to vLLM for high-performance, stable inference

310eb95 about 1 month ago

preview code

raw

history blame

3.05 kB

metadata

title: Kimi 48B Fine-tuned - Inference
emoji: 🚀
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l40sx4

🚀 Kimi Linear 48B A3B Instruct - Fine-tuned

High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by vLLM.

Model Information

Model: optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune
Base Model: moonshotai/Kimi-Linear-48B-A3B-Instruct
Parameters: 48 Billion
Fine-tuning: QLoRA on attention layers
Inference Engine: vLLM

Features

⚡ High-Performance Inference

Powered by vLLM for maximum throughput
Optimized memory usage with PagedAttention
Multi-GPU support (automatic)

💬 Professional Chat Interface

Clean Gradio UI
Real-time responses
Chat history
Copy button for responses

⚙️ Configurable Generation

Temperature control
Top-P sampling
Max tokens setting
System prompt support

Usage

Quick Start

Start vLLM Server
- Click "🚀 Start vLLM Server" button
- Wait 2-5 minutes for initialization
- Look for "✅ Server started successfully"
Chat
- Type your message
- Click "Send" or press Enter
- Get fast, high-quality responses
Customize
- Set a system prompt (optional)
- Adjust temperature for creativity
- Modify max tokens for response length

Why vLLM?

vLLM is a high-throughput and memory-efficient inference engine:

Faster: Optimized CUDA kernels
Efficient: PagedAttention for KV cache
Scalable: Multi-GPU support
Compatible: OpenAI API format

Hardware Requirements

Recommended: 4x NVIDIA L40S (192GB VRAM)
Minimum: 4x NVIDIA L4 (96GB VRAM)
Model Size: ~96GB in bfloat16

Technical Details

Fine-tuning Configuration

Method: QLoRA
LoRA Rank: 16
LoRA Alpha: 32
Target Modules: q_proj, k_proj, v_proj, o_proj
Training: Attention layers only

Generation Parameters

Temperature (0.0-2.0)

0.1-0.5: Focused, deterministic
0.6-0.9: Balanced (recommended)
1.0-2.0: Creative, diverse

Top P (0.0-1.0)

Controls nucleus sampling
0.9 recommended for most use cases

Max Tokens

Maximum response length
1024 default, up to 4096

API Access

vLLM provides OpenAI-compatible API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'