hungnha commited on
Commit
225bdac
·
1 Parent(s): 92c9b4d

Cập nhật readme

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -296,6 +296,27 @@ python scripts/run_eval.py --samples 20 --mode all
296
  | **Context Recall** | How well the retrieved contexts cover the ground truth |
297
  | **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers |
298
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
  Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps.
300
 
301
  ---
 
296
  | **Context Recall** | How well the retrieved contexts cover the ground truth |
297
  | **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers |
298
 
299
+ ### Results
300
+
301
+ Benchmark on HUST student regulation Q&A dataset (200 samples):
302
+
303
+ | Metric | vector_only | bm25_only | hybrid | hybrid_rerank |
304
+ |---------------------|:-----------:|:---------:|:------:|:-------------:|
305
+ | **Answer Relevancy** | 0.749 | 0.635 | 0.832 | **0.872** |
306
+ | **Context Precision** | 0.678 | 0.538 | 0.795 | **0.861** |
307
+ | **Context Recall** | 0.815 | 0.732 | 0.849 | **0.872** |
308
+ | **Faithfulness** | 0.912 | 0.938 | 0.942 | **0.937** |
309
+ | **ROUGE-1** | 0.557 | 0.533 | 0.576 | **0.598** |
310
+ | **ROUGE-2** | 0.408 | 0.385 | 0.421 | **0.439** |
311
+ | **ROUGE-L** | 0.526 | 0.508 | 0.545 | **0.567** |
312
+
313
+ **Key takeaways:**
314
+
315
+ - **`hybrid_rerank` achieves the best scores in 6 out of 7 metrics**, confirming it as the optimal default retrieval mode.
316
+ - **Faithfulness is consistently high (>0.91 across all modes)**, meaning the LLM reliably grounds its answers in the provided context with minimal hallucination.
317
+ - **Reranking significantly boosts Context Precision** (+60% over BM25-only, +8% over hybrid), demonstrating the value of Qwen3-Reranker in filtering irrelevant documents.
318
+ - **Hybrid search substantially outperforms single-mode retrieval**, validating the ensemble approach of combining semantic (vector) and lexical (BM25) search.
319
+
320
  Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps.
321
 
322
  ---