Spaces:
Paused
Paused
| title: Kimi 48B Fine-tuned - Evaluation | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| app_port: 7860 | |
| suggested_hardware: l4x4 | |
| # π Kimi Linear 48B A3B Instruct - Evaluation | |
| Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **Chat/inference functionality is currently disabled** - this Space focuses on running benchmarks and evaluations only. | |
| ## Model Information | |
| - **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune) | |
| - **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct) | |
| - **Parameters:** 48 Billion | |
| - **Fine-tuning:** QLoRA on attention layers | |
| - **Evaluation Framework:** LM Evaluation Harness | |
| ## Features | |
| π **Model Evaluation** | |
| - LM Evaluation Harness integration | |
| - Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande) | |
| - Automated testing and reporting | |
| - Results saved for analysis | |
| β‘ **High-Performance** | |
| - Multi-GPU model loading | |
| - Optimized memory distribution | |
| - bfloat16 precision | |
| - Supports 48B parameter models | |
| βοΈ **Easy to Use** | |
| - Simple Gradio interface | |
| - One-click model loading | |
| - Select benchmarks via checkboxes | |
| - Real-time progress updates | |
| ## Usage | |
| ### Quick Start | |
| **Option 1: Direct Evaluation (Recommended)** | |
| 1. Go directly to the "π Evaluation" tab | |
| 2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande) | |
| 3. Click "π Start Evaluation" | |
| 4. lm_eval will automatically load and evaluate the model | |
| 5. Wait 30-60 minutes for results | |
| 6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/` | |
| **Option 2: With Model Verification** | |
| 1. **(Optional)** Click "π Load Model" in Controls tab to verify setup (5-10 min) | |
| 2. Go to the "π Evaluation" tab | |
| 3. Select benchmarks and click "π Start Evaluation" | |
| 4. The pre-loaded model will be automatically unloaded to free VRAM | |
| 5. lm_eval will load its own fresh instance for evaluation | |
| 6. Wait 30-60 minutes for results | |
| **View Results** | |
| - Evaluation results include metrics for each benchmark | |
| - Results are automatically formatted and displayed | |
| - Full results JSON files are saved for detailed analysis | |
| ## Why LM Evaluation Harness? | |
| The LM Evaluation Harness is a standard framework for evaluating language models: | |
| - **Standardized:** Consistent benchmarks across models | |
| - **Comprehensive:** Wide variety of tasks and metrics | |
| - **Reproducible:** Deterministic evaluation results | |
| - **Trusted:** Used by major research organizations | |
| ## Hardware Requirements | |
| - **Recommended:** 4x NVIDIA L40S (192GB VRAM) | |
| - **Minimum:** 4x NVIDIA L4 (96GB VRAM) | |
| - **Model Size:** ~96GB in bfloat16 | |
| ### Memory Management | |
| This Space is optimized for limited VRAM (92GB across 4x L4): | |
| - **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended) | |
| - **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts | |
| - **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time | |
| - **Single Instance:** Only lm_eval's model instance runs during evaluation | |
| - **Batch Size:** Set to 1 to minimize memory usage during evaluation | |
| - **Device Mapping:** Automatic distribution across available GPUs | |
| - **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default | |
| ## Technical Details | |
| ### Fine-tuning Configuration | |
| - **Method:** QLoRA | |
| - **LoRA Rank:** 16 | |
| - **LoRA Alpha:** 32 | |
| - **Target Modules:** q_proj, k_proj, v_proj, o_proj | |
| - **Training:** Attention layers only | |
| ### Benchmark Details | |
| **ARC-Challenge** | |
| - Advanced Reasoning Challenge | |
| - 1,172 multiple-choice science questions | |
| - Tests complex reasoning and knowledge | |
| - Metrics: accuracy, accuracy_norm | |
| **TruthfulQA** | |
| - Tests model's truthfulness | |
| - Multiple-choice format (mc2) | |
| - Evaluates factual correctness | |
| - Metrics: accuracy, bleu, rouge | |
| **Winogrande** | |
| - Common sense reasoning | |
| - Pronoun resolution tasks | |
| - 1,267 test questions | |
| - Metrics: accuracy | |
| ## Support & Resources | |
| - [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) | |
| - [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune) | |
| - [Base Model Page](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct) | |
| - [Transformers Documentation](https://huggingface.co/docs/transformers) | |
| --- | |
| **Powered by LM Evaluation Harness** π | Built with β€οΈ | |