fnmodel / README.md
aeb56
Switch to vLLM for high-performance, stable inference
310eb95
|
raw
history blame
3.05 kB
metadata
title: Kimi 48B Fine-tuned - Inference
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l40sx4

πŸš€ Kimi Linear 48B A3B Instruct - Fine-tuned

High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by vLLM.

Model Information

Features

⚑ High-Performance Inference

  • Powered by vLLM for maximum throughput
  • Optimized memory usage with PagedAttention
  • Multi-GPU support (automatic)

πŸ’¬ Professional Chat Interface

  • Clean Gradio UI
  • Real-time responses
  • Chat history
  • Copy button for responses

βš™οΈ Configurable Generation

  • Temperature control
  • Top-P sampling
  • Max tokens setting
  • System prompt support

Usage

Quick Start

  1. Start vLLM Server

    • Click "πŸš€ Start vLLM Server" button
    • Wait 2-5 minutes for initialization
    • Look for "βœ… Server started successfully"
  2. Chat

    • Type your message
    • Click "Send" or press Enter
    • Get fast, high-quality responses
  3. Customize

    • Set a system prompt (optional)
    • Adjust temperature for creativity
    • Modify max tokens for response length

Why vLLM?

vLLM is a high-throughput and memory-efficient inference engine:

  • Faster: Optimized CUDA kernels
  • Efficient: PagedAttention for KV cache
  • Scalable: Multi-GPU support
  • Compatible: OpenAI API format

Hardware Requirements

  • Recommended: 4x NVIDIA L40S (192GB VRAM)
  • Minimum: 4x NVIDIA L4 (96GB VRAM)
  • Model Size: ~96GB in bfloat16

Technical Details

Fine-tuning Configuration

  • Method: QLoRA
  • LoRA Rank: 16
  • LoRA Alpha: 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Training: Attention layers only

Generation Parameters

Temperature (0.0-2.0)

  • 0.1-0.5: Focused, deterministic
  • 0.6-0.9: Balanced (recommended)
  • 1.0-2.0: Creative, diverse

Top P (0.0-1.0)

  • Controls nucleus sampling
  • 0.9 recommended for most use cases

Max Tokens

  • Maximum response length
  • 1024 default, up to 4096

API Access

vLLM provides OpenAI-compatible API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Support


Powered by vLLM πŸš€ | Built with ❀️