fnmodel / README.md
aeb56
Disable chat/inference, focus on evaluation only
69cd0c5
|
raw
history blame
3.59 kB
metadata
title: Kimi 48B Fine-tuned - Evaluation
emoji: πŸ“Š
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4

πŸ“Š Kimi Linear 48B A3B Instruct - Evaluation

Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. Chat/inference functionality is currently disabled - this Space focuses on running benchmarks and evaluations only.

Model Information

Features

πŸ“Š Model Evaluation

  • LM Evaluation Harness integration
  • Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
  • Automated testing and reporting
  • Results saved for analysis

⚑ High-Performance

  • Multi-GPU model loading
  • Optimized memory distribution
  • bfloat16 precision
  • Supports 48B parameter models

βš™οΈ Easy to Use

  • Simple Gradio interface
  • One-click model loading
  • Select benchmarks via checkboxes
  • Real-time progress updates

Usage

Quick Start

  1. Load Model

    • Click "πŸš€ Load Model" button in the Controls tab
    • Wait 5-10 minutes for model initialization
    • Model will be distributed across available GPUs
    • Look for "βœ… Model loaded successfully"
  2. Run Evaluation

    • Go to the "πŸ“Š Evaluation" tab
    • Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
    • Click "πŸš€ Start Evaluation"
    • Wait 30-60 minutes for results
    • Results will be displayed and saved to /tmp/eval_results_[timestamp]/
  3. View Results

    • Evaluation results include metrics for each benchmark
    • Results are automatically formatted and displayed
    • Full results JSON files are saved for detailed analysis

Why LM Evaluation Harness?

The LM Evaluation Harness is a standard framework for evaluating language models:

  • Standardized: Consistent benchmarks across models
  • Comprehensive: Wide variety of tasks and metrics
  • Reproducible: Deterministic evaluation results
  • Trusted: Used by major research organizations

Hardware Requirements

  • Recommended: 4x NVIDIA L40S (192GB VRAM)
  • Minimum: 4x NVIDIA L4 (96GB VRAM)
  • Model Size: ~96GB in bfloat16

Technical Details

Fine-tuning Configuration

  • Method: QLoRA
  • LoRA Rank: 16
  • LoRA Alpha: 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Training: Attention layers only

Benchmark Details

ARC-Challenge

  • Advanced Reasoning Challenge
  • 1,172 multiple-choice science questions
  • Tests complex reasoning and knowledge
  • Metrics: accuracy, accuracy_norm

TruthfulQA

  • Tests model's truthfulness
  • Multiple-choice format (mc2)
  • Evaluates factual correctness
  • Metrics: accuracy, bleu, rouge

Winogrande

  • Common sense reasoning
  • Pronoun resolution tasks
  • 1,267 test questions
  • Metrics: accuracy

Support & Resources


Powered by LM Evaluation Harness πŸ“Š | Built with ❀️