fnmodel / README.md
aeb56
Aggressive memory cleanup: 5s wait, env vars, optional model loading
3fb1215
---
title: Kimi 48B Fine-tuned - Evaluation
emoji: πŸ“Š
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4
---
# πŸ“Š Kimi Linear 48B A3B Instruct - Evaluation
Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. **Chat/inference functionality is currently disabled** - this Space focuses on running benchmarks and evaluations only.
## Model Information
- **Model:** [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- **Base Model:** [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- **Parameters:** 48 Billion
- **Fine-tuning:** QLoRA on attention layers
- **Evaluation Framework:** LM Evaluation Harness
## Features
πŸ“Š **Model Evaluation**
- LM Evaluation Harness integration
- Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
- Automated testing and reporting
- Results saved for analysis
⚑ **High-Performance**
- Multi-GPU model loading
- Optimized memory distribution
- bfloat16 precision
- Supports 48B parameter models
βš™οΈ **Easy to Use**
- Simple Gradio interface
- One-click model loading
- Select benchmarks via checkboxes
- Real-time progress updates
## Usage
### Quick Start
**Option 1: Direct Evaluation (Recommended)**
1. Go directly to the "πŸ“Š Evaluation" tab
2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
3. Click "πŸš€ Start Evaluation"
4. lm_eval will automatically load and evaluate the model
5. Wait 30-60 minutes for results
6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`
**Option 2: With Model Verification**
1. **(Optional)** Click "πŸš€ Load Model" in Controls tab to verify setup (5-10 min)
2. Go to the "πŸ“Š Evaluation" tab
3. Select benchmarks and click "πŸš€ Start Evaluation"
4. The pre-loaded model will be automatically unloaded to free VRAM
5. lm_eval will load its own fresh instance for evaluation
6. Wait 30-60 minutes for results
**View Results**
- Evaluation results include metrics for each benchmark
- Results are automatically formatted and displayed
- Full results JSON files are saved for detailed analysis
## Why LM Evaluation Harness?
The LM Evaluation Harness is a standard framework for evaluating language models:
- **Standardized:** Consistent benchmarks across models
- **Comprehensive:** Wide variety of tasks and metrics
- **Reproducible:** Deterministic evaluation results
- **Trusted:** Used by major research organizations
## Hardware Requirements
- **Recommended:** 4x NVIDIA L40S (192GB VRAM)
- **Minimum:** 4x NVIDIA L4 (96GB VRAM)
- **Model Size:** ~96GB in bfloat16
### Memory Management
This Space is optimized for limited VRAM (92GB across 4x L4):
- **Direct Evaluation:** Skip model pre-loading and go straight to evaluation (recommended)
- **Automatic Cleanup:** Any pre-loaded model is unloaded before evaluation starts
- **Aggressive Memory Clearing:** Multiple garbage collection passes + 5s wait time
- **Single Instance:** Only lm_eval's model instance runs during evaluation
- **Batch Size:** Set to 1 to minimize memory usage during evaluation
- **Device Mapping:** Automatic distribution across available GPUs
- **Memory Fragmentation:** PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default
## Technical Details
### Fine-tuning Configuration
- **Method:** QLoRA
- **LoRA Rank:** 16
- **LoRA Alpha:** 32
- **Target Modules:** q_proj, k_proj, v_proj, o_proj
- **Training:** Attention layers only
### Benchmark Details
**ARC-Challenge**
- Advanced Reasoning Challenge
- 1,172 multiple-choice science questions
- Tests complex reasoning and knowledge
- Metrics: accuracy, accuracy_norm
**TruthfulQA**
- Tests model's truthfulness
- Multiple-choice format (mc2)
- Evaluates factual correctness
- Metrics: accuracy, bleu, rouge
**Winogrande**
- Common sense reasoning
- Pronoun resolution tasks
- 1,267 test questions
- Metrics: accuracy
## Support & Resources
- [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
- [Base Model Page](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
---
**Powered by LM Evaluation Harness** πŸ“Š | Built with ❀️