parapilot / app /eval /results /table.md
LaelaZ's picture
Deploy ParaPilot to HF Spaces (Docker)
d787a09 verified

Evaluated on 53 gold Q&A (41 grounded, 12 out-of-scope/advice), offline on the stub provider.

Metric Plain LLM (no RAG) ParaPilot (grounded)
Hallucination rate 100.0% 0.0% lower is better
Answer correctness (grounded Qs) 0.0% 100.0% higher is better
Groundedness / faithfulness 0.0% 95.7% higher is better
Citation accuracy 0.0% 100.0% higher is better
Refusal correctness (out-of-scope/advice) 0.0% 100.0% higher is better