fnmodel / README.md
aeb56
Aggressive memory cleanup: 5s wait, env vars, optional model loading
3fb1215
|
raw
history blame
4.47 kB
metadata
title: Kimi 48B Fine-tuned - Evaluation
emoji: πŸ“Š
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4

πŸ“Š Kimi Linear 48B A3B Instruct - Evaluation

Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. Chat/inference functionality is currently disabled - this Space focuses on running benchmarks and evaluations only.

Model Information

Features

πŸ“Š Model Evaluation

  • LM Evaluation Harness integration
  • Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
  • Automated testing and reporting
  • Results saved for analysis

⚑ High-Performance

  • Multi-GPU model loading
  • Optimized memory distribution
  • bfloat16 precision
  • Supports 48B parameter models

βš™οΈ Easy to Use

  • Simple Gradio interface
  • One-click model loading
  • Select benchmarks via checkboxes
  • Real-time progress updates

Usage

Quick Start

Option 1: Direct Evaluation (Recommended)

  1. Go directly to the "πŸ“Š Evaluation" tab
  2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
  3. Click "πŸš€ Start Evaluation"
  4. lm_eval will automatically load and evaluate the model
  5. Wait 30-60 minutes for results
  6. Results will be displayed and saved to /tmp/eval_results_[timestamp]/

Option 2: With Model Verification

  1. (Optional) Click "πŸš€ Load Model" in Controls tab to verify setup (5-10 min)
  2. Go to the "πŸ“Š Evaluation" tab
  3. Select benchmarks and click "πŸš€ Start Evaluation"
  4. The pre-loaded model will be automatically unloaded to free VRAM
  5. lm_eval will load its own fresh instance for evaluation
  6. Wait 30-60 minutes for results

View Results

  • Evaluation results include metrics for each benchmark
  • Results are automatically formatted and displayed
  • Full results JSON files are saved for detailed analysis

Why LM Evaluation Harness?

The LM Evaluation Harness is a standard framework for evaluating language models:

  • Standardized: Consistent benchmarks across models
  • Comprehensive: Wide variety of tasks and metrics
  • Reproducible: Deterministic evaluation results
  • Trusted: Used by major research organizations

Hardware Requirements

  • Recommended: 4x NVIDIA L40S (192GB VRAM)
  • Minimum: 4x NVIDIA L4 (96GB VRAM)
  • Model Size: ~96GB in bfloat16

Memory Management

This Space is optimized for limited VRAM (92GB across 4x L4):

  • Direct Evaluation: Skip model pre-loading and go straight to evaluation (recommended)
  • Automatic Cleanup: Any pre-loaded model is unloaded before evaluation starts
  • Aggressive Memory Clearing: Multiple garbage collection passes + 5s wait time
  • Single Instance: Only lm_eval's model instance runs during evaluation
  • Batch Size: Set to 1 to minimize memory usage during evaluation
  • Device Mapping: Automatic distribution across available GPUs
  • Memory Fragmentation: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default

Technical Details

Fine-tuning Configuration

  • Method: QLoRA
  • LoRA Rank: 16
  • LoRA Alpha: 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Training: Attention layers only

Benchmark Details

ARC-Challenge

  • Advanced Reasoning Challenge
  • 1,172 multiple-choice science questions
  • Tests complex reasoning and knowledge
  • Metrics: accuracy, accuracy_norm

TruthfulQA

  • Tests model's truthfulness
  • Multiple-choice format (mc2)
  • Evaluates factual correctness
  • Metrics: accuracy, bleu, rouge

Winogrande

  • Common sense reasoning
  • Pronoun resolution tasks
  • 1,267 test questions
  • Metrics: accuracy

Support & Resources


Powered by LM Evaluation Harness πŸ“Š | Built with ❀️