File size: 4,465 Bytes
7a80ad4
69cd0c5
 
5e458c4
 
7a80ad4
 
a951334
 
3e60f36
7a80ad4
 
69cd0c5
a951334
69cd0c5
a951334
5e458c4
a951334
5e458c4
 
 
310eb95
69cd0c5
a951334
 
 
69cd0c5
 
 
 
 
5e458c4
69cd0c5
 
 
 
 
5e458c4
69cd0c5
 
 
 
 
a951334
 
 
310eb95
a951334
3fb1215
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a951334
69cd0c5
310eb95
69cd0c5
 
 
 
 
a951334
5e458c4
a951334
310eb95
 
 
5e458c4
74f609c
 
3fb1215
 
 
 
74f609c
3fb1215
74f609c
3fb1215
74f609c
310eb95
5e458c4
310eb95
 
 
5e458c4
310eb95
 
 
69cd0c5
310eb95
69cd0c5
 
 
 
 
310eb95
69cd0c5
 
 
 
 
310eb95
69cd0c5
 
 
 
 
310eb95
69cd0c5
310eb95
69cd0c5
5e458c4
69cd0c5
310eb95
a951334
 
 
69cd0c5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: Kimi 48B Fine-tuned - Evaluation
emoji: πŸ“Š
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4
---

# πŸ“Š Kimi Linear 48B A3B Instruct - Evaluation

Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **Chat/inference functionality is currently disabled** - this Space focuses on running benchmarks and evaluations only.

## Model Information

- **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- **Parameters:** 48 Billion
- **Fine-tuning:** QLoRA on attention layers
- **Evaluation Framework:** LM Evaluation Harness

## Features

πŸ“Š **Model Evaluation**
- LM Evaluation Harness integration
- Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
- Automated testing and reporting
- Results saved for analysis

⚑ **High-Performance**
- Multi-GPU model loading
- Optimized memory distribution
- bfloat16 precision
- Supports 48B parameter models

βš™οΈ **Easy to Use**
- Simple Gradio interface
- One-click model loading
- Select benchmarks via checkboxes
- Real-time progress updates

## Usage

### Quick Start

**Option 1: Direct Evaluation (Recommended)**
1. Go directly to the "πŸ“Š Evaluation" tab
2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
3. Click "πŸš€ Start Evaluation"
4. lm_eval will automatically load and evaluate the model
5. Wait 30-60 minutes for results
6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`

**Option 2: With Model Verification**
1. **(Optional)** Click "πŸš€ Load Model" in Controls tab to verify setup (5-10 min)
2. Go to the "πŸ“Š Evaluation" tab
3. Select benchmarks and click "πŸš€ Start Evaluation"
4. The pre-loaded model will be automatically unloaded to free VRAM
5. lm_eval will load its own fresh instance for evaluation
6. Wait 30-60 minutes for results

**View Results**
- Evaluation results include metrics for each benchmark
- Results are automatically formatted and displayed
- Full results JSON files are saved for detailed analysis

## Why LM Evaluation Harness?

The LM Evaluation Harness is a standard framework for evaluating language models:
- **Standardized:** Consistent benchmarks across models
- **Comprehensive:** Wide variety of tasks and metrics
- **Reproducible:** Deterministic evaluation results
- **Trusted:** Used by major research organizations

## Hardware Requirements

- **Recommended:** 4x NVIDIA L40S (192GB VRAM)
- **Minimum:** 4x NVIDIA L4 (96GB VRAM)
- **Model Size:** ~96GB in bfloat16

### Memory Management

This Space is optimized for limited VRAM (92GB across 4x L4):
- **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
- **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
- **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
- **Single Instance:** Only lm_eval's model instance runs during evaluation
- **Batch Size:** Set to 1 to minimize memory usage during evaluation
- **Device Mapping:** Automatic distribution across available GPUs
- **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default

## Technical Details

### Fine-tuning Configuration
- **Method:** QLoRA
- **LoRA Rank:** 16
- **LoRA Alpha:** 32
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Training:** Attention layers only

### Benchmark Details

**ARC-Challenge**
- Advanced Reasoning Challenge
- 1,172 multiple-choice science questions
- Tests complex reasoning and knowledge
- Metrics: accuracy, accuracy_norm

**TruthfulQA**
- Tests model's truthfulness
- Multiple-choice format (mc2)
- Evaluates factual correctness
- Metrics: accuracy, bleu, rouge

**Winogrande**
- Common sense reasoning
- Pronoun resolution tasks
- 1,267 test questions
- Metrics: accuracy

## Support & Resources

- [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- [Base Model Page](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- [Transformers Documentation](https://huggingface.co/docs/transformers)

---

**Powered by LM Evaluation Harness** πŸ“Š | Built with ❀️