|
|
--- |
|
|
title: Agentic Evaluation Framework |
|
|
emoji: π€ |
|
|
colorFrom: indigo |
|
|
colorTo: blue |
|
|
sdk: gradio |
|
|
sdk_version: 5.45.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
--- |
|
|
|
|
|
# Agentic Evaluation Framework β Hugging Face Space |
|
|
|
|
|
Upload a CSV/JSON/JSONL file with rows containing: |
|
|
- `prompt` (or `instruction`) |
|
|
- `response` |
|
|
- `task` (qa, summarization, reasoning, etc.) |
|
|
- `agent` |
|
|
- `reference` (optional β used for accuracy / hallucination checks) |
|
|
|
|
|
Features: |
|
|
- Rule-based scoring (instruction-following, coherence, grammar). |
|
|
- Optional LLM-based hallucination detection (ComprehensiveHallucinationDetector) β toggleable in UI. |
|
|
- Per-task tabs with: |
|
|
- Per-example metrics table |
|
|
- Radar (spider) charts comparing agents |
|
|
- Horizontal leaderboard (downloadable) |
|
|
- Heatmap of metric correlations |
|
|
- Exportable CSV report. |
|
|
|
|
|
Notes: |
|
|
- The LLM-judge uses transformer models and may be memory-heavy. Only enable when you have sufficient resources. The app will fall back if model loading fails. |
|
|
- No Java dependency: the grammar check uses LanguageToolPublicAPI, so it works on Hugging Face Spaces. |