Spaces:
Paused
Paused
metadata
title: Kimi 48B Fine-tuned - Evaluation
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4
π Kimi Linear 48B A3B Instruct - Evaluation
Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. Chat/inference functionality is currently disabled - this Space focuses on running benchmarks and evaluations only.
Model Information
- Model: optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune
- Base Model: moonshotai/Kimi-Linear-48B-A3B-Instruct
- Parameters: 48 Billion
- Fine-tuning: QLoRA on attention layers
- Evaluation Framework: LM Evaluation Harness
Features
π Model Evaluation
- LM Evaluation Harness integration
- Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
- Automated testing and reporting
- Results saved for analysis
β‘ High-Performance
- Multi-GPU model loading
- Optimized memory distribution
- bfloat16 precision
- Supports 48B parameter models
βοΈ Easy to Use
- Simple Gradio interface
- One-click model loading
- Select benchmarks via checkboxes
- Real-time progress updates
Usage
Quick Start
Option 1: Direct Evaluation (Recommended)
- Go directly to the "π Evaluation" tab
- Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
- Click "π Start Evaluation"
- lm_eval will automatically load and evaluate the model
- Wait 30-60 minutes for results
- Results will be displayed and saved to
/tmp/eval_results_[timestamp]/
Option 2: With Model Verification
- (Optional) Click "π Load Model" in Controls tab to verify setup (5-10 min)
- Go to the "π Evaluation" tab
- Select benchmarks and click "π Start Evaluation"
- The pre-loaded model will be automatically unloaded to free VRAM
- lm_eval will load its own fresh instance for evaluation
- Wait 30-60 minutes for results
View Results
- Evaluation results include metrics for each benchmark
- Results are automatically formatted and displayed
- Full results JSON files are saved for detailed analysis
Why LM Evaluation Harness?
The LM Evaluation Harness is a standard framework for evaluating language models:
- Standardized: Consistent benchmarks across models
- Comprehensive: Wide variety of tasks and metrics
- Reproducible: Deterministic evaluation results
- Trusted: Used by major research organizations
Hardware Requirements
- Recommended: 4x NVIDIA L40S (192GB VRAM)
- Minimum: 4x NVIDIA L4 (96GB VRAM)
- Model Size: ~96GB in bfloat16
Memory Management
This Space is optimized for limited VRAM (92GB across 4x L4):
- Direct Evaluation: Skip model pre-loading and go straight to evaluation (recommended)
- Automatic Cleanup: Any pre-loaded model is unloaded before evaluation starts
- Aggressive Memory Clearing: Multiple garbage collection passes + 5s wait time
- Single Instance: Only lm_eval's model instance runs during evaluation
- Batch Size: Set to 1 to minimize memory usage during evaluation
- Device Mapping: Automatic distribution across available GPUs
- Memory Fragmentation: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default
Technical Details
Fine-tuning Configuration
- Method: QLoRA
- LoRA Rank: 16
- LoRA Alpha: 32
- Target Modules: q_proj, k_proj, v_proj, o_proj
- Training: Attention layers only
Benchmark Details
ARC-Challenge
- Advanced Reasoning Challenge
- 1,172 multiple-choice science questions
- Tests complex reasoning and knowledge
- Metrics: accuracy, accuracy_norm
TruthfulQA
- Tests model's truthfulness
- Multiple-choice format (mc2)
- Evaluates factual correctness
- Metrics: accuracy, bleu, rouge
Winogrande
- Common sense reasoning
- Pronoun resolution tasks
- 1,267 test questions
- Metrics: accuracy
Support & Resources
Powered by LM Evaluation Harness π | Built with β€οΈ