Spaces:

optiviseapp
/

fnmodel

Paused

App Files Files Community

fnmodel / README.md

aeb56

Aggressive memory cleanup: 5s wait, env vars, optional model loading

3fb1215 30 days ago

preview code

raw

history blame contribute delete

4.47 kB

	---
	title: Kimi 48B Fine-tuned - Evaluation
	emoji: 📊
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	license: apache-2.0
	app_port: 7860
	suggested_hardware: l4x4
	---

	# 📊 Kimi Linear 48B A3B Instruct - Evaluation

	Model evaluation Space for the fine-tuned Kimi-Linear-48B-A3B-Instruct model. Chat/inference functionality is currently disabled - this Space focuses on running benchmarks and evaluations only.

	## Model Information

	- Model: [optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
	- Base Model: [moonshotai/Kimi-Linear-48B-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
	- Parameters: 48 Billion
	- Fine-tuning: QLoRA on attention layers
	- Evaluation Framework: LM Evaluation Harness

	## Features

	📊 Model Evaluation
	- LM Evaluation Harness integration
	- Multiple benchmark support (ARC-Challenge, TruthfulQA, Winogrande)
	- Automated testing and reporting
	- Results saved for analysis

	⚡ High-Performance
	- Multi-GPU model loading
	- Optimized memory distribution
	- bfloat16 precision
	- Supports 48B parameter models

	⚙️ Easy to Use
	- Simple Gradio interface
	- One-click model loading
	- Select benchmarks via checkboxes
	- Real-time progress updates

	## Usage

	### Quick Start

	Option 1: Direct Evaluation (Recommended)
	1. Go directly to the "📊 Evaluation" tab
	2. Select benchmarks to run (ARC-Challenge, TruthfulQA, Winogrande)
	3. Click "🚀 Start Evaluation"
	4. lm_eval will automatically load and evaluate the model
	5. Wait 30-60 minutes for results
	6. Results will be displayed and saved to `/tmp/eval_results_[timestamp]/`

	Option 2: With Model Verification
	1. (Optional) Click "🚀 Load Model" in Controls tab to verify setup (5-10 min)
	2. Go to the "📊 Evaluation" tab
	3. Select benchmarks and click "🚀 Start Evaluation"
	4. The pre-loaded model will be automatically unloaded to free VRAM
	5. lm_eval will load its own fresh instance for evaluation
	6. Wait 30-60 minutes for results

	View Results
	- Evaluation results include metrics for each benchmark
	- Results are automatically formatted and displayed
	- Full results JSON files are saved for detailed analysis

	## Why LM Evaluation Harness?

	The LM Evaluation Harness is a standard framework for evaluating language models:
	- Standardized: Consistent benchmarks across models
	- Comprehensive: Wide variety of tasks and metrics
	- Reproducible: Deterministic evaluation results
	- Trusted: Used by major research organizations

	## Hardware Requirements

	- Recommended: 4x NVIDIA L40S (192GB VRAM)
	- Minimum: 4x NVIDIA L4 (96GB VRAM)
	- Model Size: ~96GB in bfloat16

	### Memory Management

	This Space is optimized for limited VRAM (92GB across 4x L4):
	- Direct Evaluation: Skip model pre-loading and go straight to evaluation (recommended)
	- Automatic Cleanup: Any pre-loaded model is unloaded before evaluation starts
	- Aggressive Memory Clearing: Multiple garbage collection passes + 5s wait time
	- Single Instance: Only lm_eval's model instance runs during evaluation
	- Batch Size: Set to 1 to minimize memory usage during evaluation
	- Device Mapping: Automatic distribution across available GPUs
	- Memory Fragmentation: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set by default

	## Technical Details

	### Fine-tuning Configuration
	- Method: QLoRA
	- LoRA Rank: 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, k_proj, v_proj, o_proj
	- Training: Attention layers only

	### Benchmark Details

	ARC-Challenge
	- Advanced Reasoning Challenge
	- 1,172 multiple-choice science questions
	- Tests complex reasoning and knowledge
	- Metrics: accuracy, accuracy_norm

	TruthfulQA
	- Tests model's truthfulness
	- Multiple-choice format (mc2)
	- Evaluates factual correctness
	- Metrics: accuracy, bleu, rouge

	Winogrande
	- Common sense reasoning
	- Pronoun resolution tasks
	- 1,267 test questions
	- Metrics: accuracy

	## Support & Resources

	- [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
	- [Model Page](https://huggingface.co/optiviseapp/kimi-linear-48b-a3b-instruct-fine-tune)
	- [Base Model Page](https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct)
	- [Transformers Documentation](https://huggingface.co/docs/transformers)

	---

	Powered by LM Evaluation Harness 📊 \| Built with ❤️