Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README_inference.md

aeb56

Transform Space into professional inference UI for fine-tuned model

5e458c4 about 1 month ago

preview code

raw

history blame contribute delete

2.66 kB

	---
	title: Kimi 48B Fine-tuned - Inference
	emoji: 🚀
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	license: apache-2.0
	app_port: 7860
	suggested_hardware: l40sx4
	---

	# 🚀 Kimi Linear 48B A3B Instruct - Fine-tuned

	Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model.

	## Model Information

	- Model: [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
	- Base Model: [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
	- Parameters: 48 Billion
	- Fine-tuning Method: QLoRA (Quantized Low-Rank Adaptation)
	- Architecture: Mixture of Experts (MoE) Transformer

	## Features

	✨ Professional Chat Interface
	- Clean, modern UI for seamless conversations
	- Chat history with copy functionality
	- System prompt customization

	⚙️ Advanced Generation Settings
	- Temperature control for creativity
	- Top-P and Top-K sampling
	- Repetition penalty adjustment
	- Configurable response length

	🎮 Optimized Performance
	- Multi-GPU support (4xL40S recommended)
	- Automatic device mapping
	- bfloat16 precision for efficiency
	- ~96GB VRAM requirement

	## Usage

	1. Click "Load Model" - Initialize the model (takes 2-5 minutes)
	2. Set System Prompt (optional) - Define the assistant's behavior
	3. Start Chatting - Type your message and hit send
	4. Adjust Settings - Fine-tune generation parameters as needed

	## Generation Parameters

	### Temperature (0.0 - 2.0)
	- Low (0.1-0.5): Focused, deterministic responses
	- Medium (0.6-0.9): Balanced creativity
	- High (1.0-2.0): More creative and diverse outputs

	### Top P (0.0 - 1.0)
	- 0.9 (recommended): Good balance
	- Lower values: More focused
	- Higher values: More diverse

	### Max New Tokens
	- Maximum length of generated response
	- 1024 (default): Good for most use cases
	- Increase for longer responses

	## Hardware Requirements

	- Recommended: 4x NVIDIA L40S GPUs (192GB total VRAM)
	- Minimum: 4x NVIDIA L4 GPUs (96GB total VRAM)
	- Memory: ~96GB VRAM in bfloat16 precision

	## Fine-tuning Details

	This model was fine-tuned using QLoRA with the following configuration:
	- LoRA Rank (r): 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, k_proj, v_proj, o_proj (attention layers only)
	- Dropout: 0.05

	## Support

	For issues or questions:
	- [Transformers Documentation](https://huggingface.co/docs/transformers)
	- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)

	---

	Built with ❤️ using Transformers and Gradio