Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README.md

aeb56

Aggressive memory cleanup: 5s wait, env vars, optional model loading

3fb1215 30 days ago

preview code

raw

history blame

4.47 kB

metadata

title: Kimi 48B Fine-tuned - Evaluation
emoji: 📊
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
suggested_hardware: l4x4

📊 Kimi Linear 48B A3B Instruct - Evaluation

Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. Chat/inference functionality is currently disabled - this Space focuses on running benchmarks and evaluations only.

Model Information

Model: optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune
Base Model: moonshotai/Kimi-Linear-48B-A3B-Instruct
Parameters: 48 Billion
Fine-tuning: QLoRA on attention layers
Evaluation Framework: LM Evaluation Harness

Features

📊 Model Evaluation

LM Evaluation Harness integration
Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
Automated testing and reporting
Results saved for analysis

⚡ High-Performance

Multi-GPU model loading
Optimized memory distribution
bfloat16 precision
Supports 48B parameter models

⚙️ Easy to Use

Simple Gradio interface
One-click model loading
Select benchmarks via checkboxes
Real-time progress updates

Usage

Quick Start

Option 1: Direct Evaluation (Recommended)

Go directly to the "📊 Evaluation" tab
Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
Click "🚀 Start Evaluation"
lm_eval will automatically load and evaluate the model
Wait 30-60 minutes for results
Results will be displayed and saved to /tmp/eval_results_[timestamp]/

Option 2: With Model Verification

(Optional) Click "🚀 Load Model" in Controls tab to verify setup (5-10 min)
Go to the "📊 Evaluation" tab
Select benchmarks and click "🚀 Start Evaluation"
The pre-loaded model will be automatically unloaded to free VRAM
lm_eval will load its own fresh instance for evaluation
Wait 30-60 minutes for results

View Results

Evaluation results include metrics for each benchmark
Results are automatically formatted and displayed
Full results JSON files are saved for detailed analysis

Why LM Evaluation Harness?

The LM Evaluation Harness is a standard framework for evaluating language models:

Standardized: Consistent benchmarks across models
Comprehensive: Wide variety of tasks and metrics
Reproducible: Deterministic evaluation results
Trusted: Used by major research organizations

Hardware Requirements

Recommended: 4x NVIDIA L40S (192GB VRAM)
Minimum: 4x NVIDIA L4 (96GB VRAM)
Model Size: ~96GB in bfloat16

Memory Management

This Space is optimized for limited VRAM (92GB across 4x L4):

Direct Evaluation: Skip model pre-loading and go straight to evaluation (recommended)
Automatic Cleanup: Any pre-loaded model is unloaded before evaluation starts
Aggressive Memory Clearing: Multiple garbage collection passes + 5s wait time
Single Instance: Only lm_eval's model instance runs during evaluation
Batch Size: Set to 1 to minimize memory usage during evaluation
Device Mapping: Automatic distribution across available GPUs
Memory Fragmentation: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default

Technical Details

Fine-tuning Configuration

Method: QLoRA
LoRA Rank: 16
LoRA Alpha: 32
Target Modules: q_proj, k_proj, v_proj, o_proj
Training: Attention layers only

Benchmark Details

ARC-Challenge

Advanced Reasoning Challenge
1,172 multiple-choice science questions
Tests complex reasoning and knowledge
Metrics: accuracy, accuracy_norm

TruthfulQA

Tests model's truthfulness
Multiple-choice format (mc2)
Evaluates factual correctness
Metrics: accuracy, bleu, rouge

Winogrande

Common sense reasoning
Pronoun resolution tasks
1,267 test questions
Metrics: accuracy