Spaces:
Paused
Paused
| title: Kimi 48B Fine-tuned - Inference | |
| emoji: ๐ | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| app_port: 7860 | |
| suggested_hardware: l40sx4 | |
| # ๐ Kimi Linear 48B A3B Instruct - Fine-tuned | |
| Professional inference Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. | |
| ## Model Information | |
| - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune) | |
| - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct) | |
| - **Parameters:** 48 Billion | |
| - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation) | |
| - **Architecture:** Mixture of Experts (MoE) Transformer | |
| ## Features | |
| โจ **Professional Chat Interface** | |
| - Clean, modern UI for seamless conversations | |
| - Chat history with copy functionality | |
| - System prompt customization | |
| โ๏ธ **Advanced Generation Settings** | |
| - Temperature control for creativity | |
| - Top-P and Top-K sampling | |
| - Repetition penalty adjustment | |
| - Configurable response length | |
| ๐ฎ **Optimized Performance** | |
| - Multi-GPU support (4xL40S recommended) | |
| - Automatic device mapping | |
| - bfloat16 precision for efficiency | |
| - ~96GB VRAM requirement | |
| ## Usage | |
| 1. **Click "Load Model"** - Initialize the model (takes 2-5 minutes) | |
| 2. **Set System Prompt** (optional) - Define the assistant's behavior | |
| 3. **Start Chatting** - Type your message and hit send | |
| 4. **Adjust Settings** - Fine-tune generation parameters as needed | |
| ## Generation Parameters | |
| ### Temperature (0.0 - 2.0) | |
| - **Low (0.1-0.5):** Focused, deterministic responses | |
| - **Medium (0.6-0.9):** Balanced creativity | |
| - **High (1.0-2.0):** More creative and diverse outputs | |
| ### Top P (0.0 - 1.0) | |
| - **0.9 (recommended):** Good balance | |
| - Lower values: More focused | |
| - Higher values: More diverse | |
| ### Max New Tokens | |
| - Maximum length of generated response | |
| - **1024 (default):** Good for most use cases | |
| - Increase for longer responses | |
| ## Hardware Requirements | |
| - **Recommended:** 4x NVIDIA L40S GPUs (192GB total VRAM) | |
| - **Minimum:** 4x NVIDIA L4 GPUs (96GB total VRAM) | |
| - **Memory:** ~96GB VRAM in bfloat16 precision | |
| ## Fine-tuning Details | |
| This model was fine-tuned using QLoRA with the following configuration: | |
| - **LoRA Rank (r):** 16 | |
| - **LoRA Alpha:** 32 | |
| - **Target Modules:** q_proj, k_proj, v_proj, o_proj (attention layers only) | |
| - **Dropout:** 0.05 | |
| ## Support | |
| For issues or questions: | |
| - [Transformers Documentation](https://huggingface.co/docs/transformers) | |
| - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune) | |
| --- | |
| Built with โค๏ธ using Transformers and Gradio | |