Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README.md

aeb56

Switch to vLLM for high-performance, stable inference

310eb95 about 1 month ago

preview code

raw

history blame

3.05 kB

	---
	title: Kimi 48B Fine-tuned - Inference
	emoji: 🚀
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	license: apache-2.0
	app_port: 7860
	suggested_hardware: l40sx4
	---

	# 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned

	High-performance inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model, powered by vLLM.

	## Model Information

	- Model: [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
	- Base Model: [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
	- Parameters: 48 Billion
	- Fine-tuning: QLoRA on attention layers
	- Inference Engine: vLLM

	## Features

	⚡ High-Performance Inference
	- Powered by vLLM for maximum throughput
	- Optimized memory usage with PagedAttention
	- Multi-GPU support (automatic)

	💬 Professional Chat Interface
	- Clean Gradio UI
	- Real-time responses
	- Chat history
	- Copy button for responses

	⚙️ Configurable Generation
	- Temperature control
	- Top-P sampling
	- Max tokens setting
	- System prompt support

	## Usage

	### Quick Start

	1. Start vLLM Server
	- Click "🚀 Start vLLM Server" button
	- Wait 2-5 minutes for initialization
	- Look for "✅ Server started successfully"

	2. Chat
	- Type your message
	- Click "Send" or press Enter
	- Get fast, high-quality responses

	3. Customize
	- Set a system prompt (optional)
	- Adjust temperature for creativity
	- Modify max tokens for response length

	## Why vLLM?

	vLLM is a high-throughput and memory-efficient inference engine:
	- Faster: Optimized CUDA kernels
	- Efficient: PagedAttention for KV cache
	- Scalable: Multi-GPU support
	- Compatible: OpenAI API format

	## Hardware Requirements

	- Recommended: 4x NVIDIA L40S (192GB VRAM)
	- Minimum: 4x NVIDIA L4 (96GB VRAM)
	- Model Size: ~96GB in bfloat16

	## Technical Details

	### Fine-tuning Configuration
	- Method: QLoRA
	- LoRA Rank: 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, k_proj, v_proj, o_proj
	- Training: Attention layers only

	### Generation Parameters

	Temperature (0.0-2.0)
	- 0.1-0.5: Focused, deterministic
	- 0.6-0.9: Balanced (recommended)
	- 1.0-2.0: Creative, diverse

	Top P (0.0-1.0)
	- Controls nucleus sampling
	- 0.9 recommended for most use cases

	Max Tokens
	- Maximum response length
	- 1024 default, up to 4096

	## API Access

	vLLM provides OpenAI-compatible API:

	```bash
	curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
	"model": "optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune",
	"messages": [
	{"role": "user", "content": "Hello!"}
	]
	}'
	```

	## Support

	- [vLLM Documentation](https://docs.vllm.ai/)
	- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
	- [Transformers Documentation](https://huggingface.co/docs/transformers)

	---

	Powered by vLLM 🚀 \| Built with ❤️