Spaces:
Sleeping
Sleeping
metadata
title: AI Product Evals Framework
emoji: ⚖️
colorFrom: gray
colorTo: purple
sdk: docker
app_port: 3000
pinned: false
short_description: Your AI Product needs Evals
license: mit
AI Product Evals Framework
Unit Tests | Model & Human Eval | A/B Testing
The complete evaluation framework for AI Products.
How It Works
LLM Invocations → Logging Traces → Eval & Curation → Improve Model
↑ ↓
└──────────── Fine-Tune + Prompt Eng. ←──────────────┘
Every production AI system needs a feedback loop. This framework provides three levels of Evaluation:
Level 1: Unit Tests
- Write Unit Tests - Define what to test
- Create Test Cases - Build dataset (prompts, expected, criteria)
- Run & Review - Run in the UI and view pass/fail summary
Level 2: Model & Human Eval
- Log Traces - Upload here as JSON/CSV
- Look at Traces - Browse and inspect in UI
- Model & Human - Model scoring + human accept/reject
Level 3: A/B Testing
- Define Variants - Two prompt/system configurations
- Run Comparison - Same test cases, both variants
- Analyze Results - Winner recommendation
Models
Powered by HuggingFace Inference API - supports the latest open source LLMs.
Usage
- Choose a mode: Unit Tests, Model & Human Eval, or A/B Testing
- Load demo data or upload your own JSON/CSV files
- Run evaluations and review results
Development
# Create .env.local with your HuggingFace token
echo "HF_TOKEN=hf_your_token_here" > .env.local
# Install and run
npm install
npm run dev
Visit http://localhost:3000