Spaces:

MissSqui
/

Static_H

Running

App Files Files Community

MissSqui commited on Jun 3, 2025

Commit

2d6c260

verified ·

1 Parent(s): 58fe6ee

Create abc12

Browse files

Files changed (1) hide show

abc12 +100 -0

abc12 ADDED Viewed

	@@ -0,0 +1,100 @@

+RAG Response Evaluation Strategy Documentation
+Overview:
+This document explains the rationale behind the evaluation of RAG (Retrieval-Augmented Generation) responses using various NLP metrics.
+What is Needed:
+- PDF document text
+- Retrieved chunks from the retriever (top-k)
+- Relevant chunks (manually identified or labeled)
+- User's question
+- Generated answer (from LLM)
+Evaluation Metrics Used:
+1. BLEU:
+Measures word-level overlap between reference and generated text. Useful for factual correctness.
+2. ROUGE-L:
+Measures recall-oriented overlap of longest common subsequences. Good for summarization-type responses.
+3. Cosine Similarity:
+Computes semantic similarity between embeddings of reference and generated text using sentence-transformers.
+4. Perplexity:
+Indicates the fluency or surprise of the text to a language model. Lower is better.
+5. Precision@K:
+How many of the top-K retrieved chunks are relevant.
+6. Recall@K:
+How many relevant chunks are recovered in top-K results.
+7. nDCG@K:
+Rewards higher-ranked relevant chunks more heavily.
+8. HIT@K:
+Simple check if at least one relevant chunk is retrieved in top-K.
+------------------------------------------------------------------
+Proposed Enhancements to Existing RAG Pipeline
+1. RAG Evaluation Metric
+Introduce a comprehensive metric to evaluate the performance of the RAG system.
+Proposed Approach
+Composite Score with weighted components to reflect retrieval and generation quality.
+HITs Score: Leverage Human Intelligence Task-based (HITs) scoring to measure the relevance and accuracy of retrieved documents.
+Additional components (TBD): May include BLEU, ROUGE, or semantic similarity for generation quality.
+Shape
+2. Summarization Response Optimization
+Current Approach
+Final summary is generated by aggregating summaries of all retrieved chunks, leading to high latency and increased compute cost.
+Proposed Optimizations
+2.1 Top-K Chunk Summarization
+Limit summarization to only top_k most relevant chunks (based on similarity or retrieval score).
+Reduces number of summaries → Lower inference time.
+2.2 Parallel Processing
+Utilize ThreadPoolExecutor to parallelize summarization of individual chunks.
+Each chunk processed by a worker → Improves throughput, especially in multi-core environments.
+Pseudo Code for implementation:
+function summarize_chunk(chunk):
+return summarize(chunk)  // apply summarization logic to a single chunk
+function parallel_summarize(top_k_chunks, num_workers):
+create thread pool with num_workers
+for each chunk in top_k_chunks:
+assign summarize_chunk(chunk) to a worker thread
+wait for all threads to finish
+collect all individual summaries into a list
+return aggregated_summary(list_of_summaries)Shape
+Summary of Benefits
+Improved Evaluation: Quantifiable metric to track RAG effectiveness.
+Performance Gains: Reduced response time and compute overhead.
+Scalability: Efficient parallel processing supports production-grade usage.