RAG / Assignment_3_extracted.txt
Gaurav Khatwani
Initial commit with LFS tracking
0b28a50
Page 1 of 4
Assignment 3 (Mini-Project 1)
NLP with Deep Learning
Due Date: 5th April (1155 PM)
Marks = 7.5
NOTE: Do not submit Google Drive or shared document links, as they can be modified after
submission. The report must be submitted as a Word or PDF file on the LMS. If other files
exceed the size limit, they may be uploaded via Dropbox. However, any submission via
shared links will be considered incomplete and will not be graded.
RAG-based Question-Answering System Development
1. Introduction
This is a group assignment where the group size can be of 2 -3 members. In this assignment,
your task is to:
• Develop a Retrieval-Augmented Generation (RAG) based question-answering system.
• The system should retrieve information from a specific domain.
• You must move beyond basic semantic search. Your system must implement Hybrid
Search, Re-Ranking, and Automated Evaluation (LLM-as-a-Judge). You will also deploy
a live web interface for interaction.
Core Goals:
• Implement advanced retrieval strategies (Hybrid/RRF/Re-ranking).
• Deploy a live web application on Hugging Face Spaces or any other hosting service.
• Evaluate system performance using LLM-as-a-Judge (Faithfulness & Relevancy).
• Conduct an Ablation Study to justify design choices.
Domain Corpus Requirements:
• Your corpus must contain at least 50-100 documents or 500+ chunks of text
• You may choose any domain: medical, legal, human resources, scientific papers,
company policies, etc.
• Clearly document the source of your corpus and provide access (if publicly available)
or include a representative sample in your report
2. Technical Stack Requirements
To ensure consistency and production readiness, all groups must adhere to the following
infrastructure constraints:
Component Requirement Notes
Hosting Hugging Face Spaces or
any other free service
Streamlit or Gradio based interface.
Must be publicly accessible.
Vector DB Pinecone (Free Starter) Cloud-based. Do not store indexes
locally on the host.
Page 2 of 4
Embeddings Pre-computed Generate document embeddings
locally → Upsert to Pinecone.
Query Embedding On-App Use a lightweight model (e.g., all-
MiniLM-L6-v2) directly in the Space for
query encoding.
LLM Generation HF Inference API Use models like Mistral-7B, Llama-3-
8B, or TinyAya via API
Retrieval Hybrid + Re-ranking Mandatory implementation of BM25 +
Semantic Search with RRF or Cross-
Encoder.
2A. Evaluation Protocol: LLM-as-a-Judge
Do not rely solely on human inspection. You must implement an automated evaluation
pipeline using an LLM API on a fixed test set of 10-20 queries:
Faithfulness: Implement Claim Extraction & Verification.
• Extract claims from the generated answer.
• Verify each claim against the retrieved context using an LLM.
• Report the Faithfulness Score (% of claims supported).
• In your report, show extracted claims and verification results for at least 3 example
queries
Relevancy: Implement Alternate Query Generation.
• Generate 3 questions from the generated answer.
• Compute cosine similarity between generated questions and the original query.
• Report the Average Relevancy Score (mean of 3 similarity scores).
2B. Ablation Study
You must include a table in your report comparing performance across variations:
• Chunking: Compare at least two strategies (e.g., Fixed vs. Recursive vs. Semantic).
• Retrieval: Compare Semantic-Only vs. Hybrid + Re-ranking.
• Metrics: Report Faithfulness and Relevancy scores for each variation.
2C. Live Web Interface
Deploy a working UI on Hugging Face Spaces or a free hosting site. The UI must allow a user
to input a query and display:
• The Generated Answer
• The Retrieved Context (chunks)
• Faithfulness and Relevance scores
• (Optional) source citations
Page 3 of 4
3. Report Guidelines
Organize your report into clearly defined sections. Your report should include the following
components to ensure that someone else can replicate your process.
A. Platform Details: Specify the platform used for experimentation (e.g., local machine,
Kaggle, Colab). If multiple platforms were used, clarify where each stage was
executed.
B. Data Details: Clearly state the source of the dataset, including the size and number of
documents used in the corpus.
C. Algorithms, Models, and Retrieval Methods : Clearly document the experimental
setup and results, highlighting insights gained from multiple trials.
o Describe the retrieval methods employed in your system. Did you use semantic
search, keyword-based search, or another method? Justify your approach.
o Specify the algorithms and large language models (LLMs) you used and explain
your choices.
o Explain your chunking strategy, including how you segmented the documents
and whether different chunking approaches were tested. Discuss how chunk
size and overlap affected retrieval and answer quality.
o Explain the prompt structure used for the LLM generation and for LLM -as-a-
judge.
D. Performance Metrics: Compare results across different models, retrieval strategies,
and parameter settings, providing insights into how various choices impact
performance.
o Implement and report evaluation metrics for generated answers, specifically
faithfulness and relevance. You may use automated libraries like RAGAS but if
they do not work reliably, design your own prompting method to evaluate
these metrics.
o These evaluation metrics should be analyzed across different retrieval
strategies, LLM choices, chunking methods, etc.
o Moreover, analyze the latency and computational efficiency by measuring
inference time, retrieval time, and overall system response time.
E. Best Model Selection: Justify your best model selection by assessing the effectiveness
of using the above-mentioned performance metrics.
F. Reproducibility: Your report must provide enough detail to enable others to replicate
your work. Include any information that is critical for reproduction, such as
preprocessing steps, system configuration, or model fine-tuning techniques.
4. Urdu Low-Resource Language Bonus Track (+10%)
Groups may opt to build their RAG system for Urdu to earn a 10% bonus. This is optional but
strict. Partial attempts do not qualify. To qualify for the bonus, ALL the following criteria
must be met:
Page 4 of 4
• Working Live Demo: The HF Spaces link must accept native Urdu script (Nastaliq)
queries and return answers. (Romanized Urdu does not qualify).
• Urdu Embeddings: You must use a multilingual or Urdu-specific embedding model
(e.g., paraphrase-multilingual-MiniLM-L12-v2, bge-multilingual-base). English-only
models will not work for Urdu.
• Hybrid Search: You must still implement Hybrid Search (BM25 + Semantic) with RRF
or Re-ranking.
• Evaluation Adaptation: Since LLM judges are English-biased, you must use a
multilingual LLM (e.g., Qwen-7B, Aya-23) for evaluation.
• Challenge Report: Include a 1-page Appendix documenting Urdu-specific challenges
encountered (e.g., tokenization, script handling, data scarcity, judge bias) and how
you mitigated them.
5. Additional Instructions
• Figures and Tables: Ensure that all figures and tables are properly numbered and cited
in the text. Avoid vague references like “the figure below”; instead, use precise
citations such as “Table 1 shows…” or “As shown in Figure 7…”.
• Code Submission: Either include the working code in your submission or provide a link
to a repository (e.g., GitHub) where the code can be accessed. The date of upload must
be before the deadline.
• References: If you have used external resources, such as blogs or GitHub repositories,
ensure they are appropriately cited. Include a reference section before the appendix
to acknowledge all sources and avoid any potential issues of plagiarism. Proper
citation is a key part of your academic and professional training.
• Submission File Name: The file name should be as per the group members name and
don’t name it Assignment1 or Project1. So if there are two group members, Aamna
and Zaid, then name it Aamna_Zaid.docx.