Spaces:
Paused
Paused
File size: 4,465 Bytes
7a80ad4 69cd0c5 5e458c4 7a80ad4 a951334 3e60f36 7a80ad4 69cd0c5 a951334 69cd0c5 a951334 5e458c4 a951334 5e458c4 310eb95 69cd0c5 a951334 69cd0c5 5e458c4 69cd0c5 5e458c4 69cd0c5 a951334 310eb95 a951334 3fb1215 a951334 69cd0c5 310eb95 69cd0c5 a951334 5e458c4 a951334 310eb95 5e458c4 74f609c 3fb1215 74f609c 3fb1215 74f609c 3fb1215 74f609c 310eb95 5e458c4 310eb95 5e458c4 310eb95 69cd0c5 310eb95 69cd0c5 310eb95 69cd0c5 310eb95 69cd0c5 310eb95 69cd0c5 310eb95 69cd0c5 5e458c4 69cd0c5 310eb95 a951334 69cd0c5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
title: Kimi 48B Fine-tuned - Evaluation
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4
---
# π Kimi Linear 48B A3B Instruct - Evaluation
Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **Chat/inference functionality is currently disabled** - this Space focuses on running benchmarks and evaluations only.
## Model Information
- **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- **Parameters:** 48 Billion
- **Fine-tuning:** QLoRA on attention layers
- **Evaluation Framework:** LM Evaluation Harness
## Features
π **Model Evaluation**
- LM Evaluation Harness integration
- Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
- Automated testing and reporting
- Results saved for analysis
β‘ **High-Performance**
- Multi-GPU model loading
- Optimized memory distribution
- bfloat16 precision
- Supports 48B parameter models
βοΈ **Easy to Use**
- Simple Gradio interface
- One-click model loading
- Select benchmarks via checkboxes
- Real-time progress updates
## Usage
### Quick Start
**Option 1: Direct Evaluation (Recommended)**
1. Go directly to the "π Evaluation" tab
2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
3. Click "π Start Evaluation"
4. lm_eval will automatically load and evaluate the model
5. Wait 30-60 minutes for results
6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
**Option 2: With Model Verification**
1. **(Optional)** Click "π Load Model" in Controls tab to verify setup (5-10 min)
2. Go to the "π Evaluation" tab
3. Select benchmarks and click "π Start Evaluation"
4. The pre-loaded model will be automatically unloaded to free VRAM
5. lm_eval will load its own fresh instance for evaluation
6. Wait 30-60 minutes for results
**View Results**
- Evaluation results include metrics for each benchmark
- Results are automatically formatted and displayed
- Full results JSON files are saved for detailed analysis
## Why LM Evaluation Harness?
The LM Evaluation Harness is a standard framework for evaluating language models:
- **Standardized:** Consistent benchmarks across models
- **Comprehensive:** Wide variety of tasks and metrics
- **Reproducible:** Deterministic evaluation results
- **Trusted:** Used by major research organizations
## Hardware Requirements
- **Recommended:** 4x NVIDIA L40S (192GB VRAM)
- **Minimum:** 4x NVIDIA L4 (96GB VRAM)
- **Model Size:** ~96GB in bfloat16
### Memory Management
This Space is optimized for limited VRAM (92GB across 4x L4):
- **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
- **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
- **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
- **Single Instance:** Only lm_eval's model instance runs during evaluation
- **Batch Size:** Set to 1 to minimize memory usage during evaluation
- **Device Mapping:** Automatic distribution across available GPUs
- **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default
## Technical Details
### Fine-tuning Configuration
- **Method:** QLoRA
- **LoRA Rank:** 16
- **LoRA Alpha:** 32
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Training:** Attention layers only
### Benchmark Details
**ARC-Challenge**
- Advanced Reasoning Challenge
- 1,172 multiple-choice science questions
- Tests complex reasoning and knowledge
- Metrics: accuracy, accuracy_norm
**TruthfulQA**
- Tests model's truthfulness
- Multiple-choice format (mc2)
- Evaluates factual correctness
- Metrics: accuracy, bleu, rouge
**Winogrande**
- Common sense reasoning
- Pronoun resolution tasks
- 1,267 test questions
- Metrics: accuracy
## Support & Resources
- [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- [Base Model Page](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
---
**Powered by LM Evaluation Harness** π | Built with β€οΈ
|