Cập nhật readme
Browse files
README.md
CHANGED
|
@@ -296,6 +296,27 @@ python scripts/run_eval.py --samples 20 --mode all
|
|
| 296 |
| **Context Recall** | How well the retrieved contexts cover the ground truth |
|
| 297 |
| **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers |
|
| 298 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps.
|
| 300 |
|
| 301 |
---
|
|
|
|
| 296 |
| **Context Recall** | How well the retrieved contexts cover the ground truth |
|
| 297 |
| **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers |
|
| 298 |
|
| 299 |
+
### Results
|
| 300 |
+
|
| 301 |
+
Benchmark on HUST student regulation Q&A dataset (200 samples):
|
| 302 |
+
|
| 303 |
+
| Metric | vector_only | bm25_only | hybrid | hybrid_rerank |
|
| 304 |
+
|---------------------|:-----------:|:---------:|:------:|:-------------:|
|
| 305 |
+
| **Answer Relevancy** | 0.749 | 0.635 | 0.832 | **0.872** |
|
| 306 |
+
| **Context Precision** | 0.678 | 0.538 | 0.795 | **0.861** |
|
| 307 |
+
| **Context Recall** | 0.815 | 0.732 | 0.849 | **0.872** |
|
| 308 |
+
| **Faithfulness** | 0.912 | 0.938 | 0.942 | **0.937** |
|
| 309 |
+
| **ROUGE-1** | 0.557 | 0.533 | 0.576 | **0.598** |
|
| 310 |
+
| **ROUGE-2** | 0.408 | 0.385 | 0.421 | **0.439** |
|
| 311 |
+
| **ROUGE-L** | 0.526 | 0.508 | 0.545 | **0.567** |
|
| 312 |
+
|
| 313 |
+
**Key takeaways:**
|
| 314 |
+
|
| 315 |
+
- **`hybrid_rerank` achieves the best scores in 6 out of 7 metrics**, confirming it as the optimal default retrieval mode.
|
| 316 |
+
- **Faithfulness is consistently high (>0.91 across all modes)**, meaning the LLM reliably grounds its answers in the provided context with minimal hallucination.
|
| 317 |
+
- **Reranking significantly boosts Context Precision** (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents.
|
| 318 |
+
- **Hybrid search substantially outperforms single-mode retrieval**, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search.
|
| 319 |
+
|
| 320 |
Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps.
|
| 321 |
|
| 322 |
---
|