Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README.md

aeb56

Disable chat/inference, focus on evaluation only

69cd0c5 about 1 month ago

preview code

raw

history blame

3.59 kB

metadata

title: Kimi 48B Fine-tuned - Evaluation
emoji: 📊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4

📊 Kimi Linear 48B A3B Instruct - Evaluation

Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. Chat/inference functionality is currently disabled - this Space focuses on running benchmarks and evaluations only.

Model Information

Model: optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune
Base Model: moonshotai/Kimi-Linear-48B-A3B-Instruct
Parameters: 48 Billion
Fine-tuning: QLoRA on attention layers
Evaluation Framework: LM Evaluation Harness

Features

📊 Model Evaluation

LM Evaluation Harness integration
Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
Automated testing and reporting
Results saved for analysis

⚡ High-Performance

Multi-GPU model loading
Optimized memory distribution
bfloat16 precision
Supports 48B parameter models

⚙️ Easy to Use

Simple Gradio interface
One-click model loading
Select benchmarks via checkboxes
Real-time progress updates

Usage

Quick Start

Load Model
- Click "🚀 Load Model" button in the Controls tab
- Wait 5-10 minutes for model initialization
- Model will be distributed across available GPUs
- Look for "✅ Model loaded successfully"
Run Evaluation
- Go to the "📊 Evaluation" tab
- Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
- Click "🚀 Start Evaluation"
- Wait 30-60 minutes for results
- Results will be displayed and saved to /tmp/eval_results_[timestamp]/
View Results
- Evaluation results include metrics for each benchmark
- Results are automatically formatted and displayed
- Full results JSON files are saved for detailed analysis

Why LM Evaluation Harness?

The LM Evaluation Harness is a standard framework for evaluating language models:

Standardized: Consistent benchmarks across models
Comprehensive: Wide variety of tasks and metrics
Reproducible: Deterministic evaluation results
Trusted: Used by major research organizations

Hardware Requirements

Recommended: 4x NVIDIA L40S (192GB VRAM)
Minimum: 4x NVIDIA L4 (96GB VRAM)
Model Size: ~96GB in bfloat16

Technical Details

Fine-tuning Configuration

Method: QLoRA
LoRA Rank: 16
LoRA Alpha: 32
Target Modules: q_proj, k_proj, v_proj, o_proj
Training: Attention layers only

Benchmark Details

ARC-Challenge

Advanced Reasoning Challenge
1,172 multiple-choice science questions
Tests complex reasoning and knowledge
Metrics: accuracy, accuracy_norm

TruthfulQA

Tests model's truthfulness
Multiple-choice format (mc2)
Evaluates factual correctness
Metrics: accuracy, bleu, rouge

Winogrande

Common sense reasoning
Pronoun resolution tasks
1,267 test questions
Metrics: accuracy