parapilot / app /eval /results /table.md
LaelaZ's picture
Deploy ParaPilot to HF Spaces (Docker)
d787a09 verified
Evaluated on 53 gold Q&A (41 grounded, 12 out-of-scope/advice), offline on the stub provider.
| Metric | Plain LLM (no RAG) | ParaPilot (grounded) | |
|---|---|---|---|
| Hallucination rate | 100.0% | **0.0%** | lower is better |
| Answer correctness (grounded Qs) | 0.0% | **100.0%** | higher is better |
| Groundedness / faithfulness | 0.0% | **95.7%** | higher is better |
| Citation accuracy | 0.0% | **100.0%** | higher is better |
| Refusal correctness (out-of-scope/advice) | 0.0% | **100.0%** | higher is better |