---
title: Kimi 48B Fine-tuned - Evaluation
emoji: 📊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4
---

# 📊 Kimi Linear 48B A3B Instruct - Evaluation

Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **Chat/inference functionality is currently disabled** - this Space focuses on running benchmarks and evaluations only.

## Model Information

- **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- **Parameters:** 48 Billion
- **Fine-tuning:** QLoRA on attention layers
- **Evaluation Framework:** LM Evaluation Harness

## Features

📊 **Model Evaluation**
- LM Evaluation Harness integration
- Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
- Automated testing and reporting
- Results saved for analysis

⚡ **High-Performance**
- Multi-GPU model loading
- Optimized memory distribution
- bfloat16 precision
- Supports 48B parameter models

⚙️ **Easy to Use**
- Simple Gradio interface
- One-click model loading
- Select benchmarks via checkboxes
- Real-time progress updates

## Usage

### Quick Start

**Option 1: Direct Evaluation (Recommended)**
1. Go directly to the "📊 Evaluation" tab
2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
3. Click "🚀 Start Evaluation"
4. lm_eval will automatically load and evaluate the model
5. Wait 30-60 minutes for results
6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`

**Option 2: With Model Verification**
1. **(Optional)** Click "🚀 Load Model" in Controls tab to verify setup (5-10 min)
2. Go to the "📊 Evaluation" tab
3. Select benchmarks and click "🚀 Start Evaluation"
4. The pre-loaded model will be automatically unloaded to free VRAM
5. lm_eval will load its own fresh instance for evaluation
6. Wait 30-60 minutes for results

**View Results**
- Evaluation results include metrics for each benchmark
- Results are automatically formatted and displayed
- Full results JSON files are saved for detailed analysis

## Why LM Evaluation Harness?

The LM Evaluation Harness is a standard framework for evaluating language models:
- **Standardized:** Consistent benchmarks across models
- **Comprehensive:** Wide variety of tasks and metrics
- **Reproducible:** Deterministic evaluation results
- **Trusted:** Used by major research organizations

## Hardware Requirements

- **Recommended:** 4x NVIDIA L40S (192GB VRAM)
- **Minimum:** 4x NVIDIA L4 (96GB VRAM)
- **Model Size:** ~96GB in bfloat16

### Memory Management

This Space is optimized for limited VRAM (92GB across 4x L4):
- **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
- **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
- **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
- **Single Instance:** Only lm_eval's model instance runs during evaluation
- **Batch Size:** Set to 1 to minimize memory usage during evaluation
- **Device Mapping:** Automatic distribution across available GPUs
- **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default

## Technical Details

### Fine-tuning Configuration
- **Method:** QLoRA
- **LoRA Rank:** 16
- **LoRA Alpha:** 32
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Training:** Attention layers only

### Benchmark Details

**ARC-Challenge**
- Advanced Reasoning Challenge
- 1,172 multiple-choice science questions
- Tests complex reasoning and knowledge
- Metrics: accuracy, accuracy_norm

**TruthfulQA**
- Tests model's truthfulness
- Multiple-choice format (mc2)
- Evaluates factual correctness
- Metrics: accuracy, bleu, rouge

**Winogrande**
- Common sense reasoning
- Pronoun resolution tasks
- 1,267 test questions
- Metrics: accuracy

## Support & Resources

- [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- [Base Model Page](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- [Transformers Documentation](https://huggingface.co/docs/transformers)

---

**Powered by LM Evaluation Harness** 📊 | Built with ❤️