Page 1 of 4 
 
Assignment 3 (Mini-Project 1) 
NLP with Deep Learning 
Due Date: 5th April (1155 PM) 
Marks = 7.5 
  
NOTE: Do not submit Google Drive or shared document links, as they can be modified after 
submission. The report must be submitted as a Word or PDF file on the LMS. If other files 
exceed the size limit, they may be uploaded via Dropbox. However, any submission via 
shared links will be considered incomplete and will not be graded. 
RAG-based Question-Answering System Development 
1. Introduction 
This is a group assignment where the group size can be of 2 -3 members. In this assignment, 
your task is to: 
• Develop a Retrieval-Augmented Generation (RAG) based question-answering system.  
• The system should retrieve information from a specific domain. 
• You must move beyond basic semantic search. Your system must implement Hybrid 
Search, Re-Ranking, and Automated Evaluation (LLM-as-a-Judge). You will also deploy 
a live web interface for interaction. 
Core Goals: 
• Implement advanced retrieval strategies (Hybrid/RRF/Re-ranking). 
• Deploy a live web application on Hugging Face Spaces or any other hosting service. 
• Evaluate system performance using LLM-as-a-Judge (Faithfulness & Relevancy). 
• Conduct an Ablation Study to justify design choices. 
Domain Corpus Requirements: 
• Your corpus must contain at least 50-100 documents or 500+ chunks of text 
• You may choose any domain: medical, legal, human resources, scientific papers, 
company policies, etc. 
• Clearly document the source of your corpus and provide access (if publicly available) 
or include a representative sample in your report 
2. Technical Stack Requirements 
To ensure consistency and production readiness, all groups must adhere to the following 
infrastructure constraints: 
Component Requirement Notes 
Hosting Hugging Face Spaces or 
any other free service 
Streamlit or Gradio based interface. 
Must be publicly accessible. 
Vector DB Pinecone (Free Starter) Cloud-based. Do not store indexes 
locally on the host. 

Page 2 of 4 
 
Embeddings Pre-computed Generate document embeddings 
locally → Upsert to Pinecone. 
Query Embedding On-App Use a lightweight model (e.g., all-
MiniLM-L6-v2) directly in the Space for 
query encoding. 
LLM Generation HF Inference API Use models like Mistral-7B, Llama-3-
8B, or TinyAya via API 
Retrieval Hybrid + Re-ranking Mandatory implementation of BM25 + 
Semantic Search with RRF or Cross-
Encoder. 
 
2A. Evaluation Protocol: LLM-as-a-Judge 
Do not rely solely on human inspection. You must implement an automated evaluation 
pipeline using an LLM API on a fixed test set of 10-20 queries: 
Faithfulness: Implement Claim Extraction & Verification. 
• Extract claims from the generated answer. 
• Verify each claim against the retrieved context using an LLM. 
• Report the Faithfulness Score (% of claims supported). 
• In your report, show extracted claims and verification results for at least 3 example 
queries 
Relevancy: Implement Alternate Query Generation. 
• Generate 3 questions from the generated answer. 
• Compute cosine similarity between generated questions and the original query. 
• Report the Average Relevancy Score (mean of 3 similarity scores). 
2B. Ablation Study 
You must include a table in your report comparing performance across variations: 
• Chunking: Compare at least two strategies (e.g., Fixed vs. Recursive vs. Semantic). 
• Retrieval: Compare Semantic-Only vs. Hybrid + Re-ranking. 
• Metrics: Report Faithfulness and Relevancy scores for each variation. 
2C. Live Web Interface 
Deploy a working UI on Hugging Face Spaces or a free hosting site. The UI must allow a user 
to input a query and display: 
• The Generated Answer 
• The Retrieved Context (chunks) 
• Faithfulness and Relevance scores 
• (Optional) source citations 
 

Page 3 of 4 
 
3. Report Guidelines 
Organize your report into clearly defined sections.  Your report should include the following 
components to ensure that someone else can replicate your process.  
A. Platform Details: Specify the platform used for experimentation (e.g., local machine, 
Kaggle, Colab). If multiple platforms were used, clarify where each stage was 
executed. 
B. Data Details: Clearly state the source of the dataset, including the size and number of 
documents used in the corpus. 
C. Algorithms, Models, and Retrieval Methods : Clearly document the experimental 
setup and results, highlighting insights gained from multiple trials. 
o Describe the retrieval methods employed in your system. Did you use semantic 
search, keyword-based search, or another method? Justify your approach. 
o Specify the algorithms and large language models (LLMs) you used and explain 
your choices. 
o Explain your chunking strategy, including how you segmented the documents 
and whether different chunking approaches were tested. Discuss how chunk 
size and overlap affected retrieval and answer quality. 
o Explain the prompt structure used for the LLM  generation and for LLM -as-a-
judge. 
D. Performance Metrics: Compare results across different models, retrieval strategies, 
and parameter settings, providing insights into how various choices impact 
performance.  
o Implement and report evaluation metrics for generated answers, specifically 
faithfulness and relevance. You may use automated libraries like RAGAS but if 
they do not work reliably, design your own prompting method to evaluate 
these metrics. 
o These evaluation metrics should be analyzed across different retrieval 
strategies, LLM choices, chunking methods, etc. 
o Moreover, analyze the latency and computational efficiency by measuring 
inference time, retrieval time, and overall system response time. 
E. Best Model Selection: Justify your best model selection by assessing the effectiveness 
of using the above-mentioned performance metrics. 
F. Reproducibility: Your report must provide enough detail to enable others to replicate 
your work. Include any information that is critical for reproduction, such as 
preprocessing steps, system configuration, or model fine-tuning techniques. 
4.  Urdu Low-Resource Language Bonus Track (+10%) 
Groups may opt to build their RAG system for Urdu to earn a 10% bonus. This is optional but 
strict. Partial attempts do not qualify. To qualify for the bonus, ALL the following criteria 
must be met: 

Page 4 of 4 
 
• Working Live Demo: The HF Spaces link must accept native Urdu script (Nastaliq) 
queries and return answers. (Romanized Urdu does not qualify). 
• Urdu Embeddings: You must use a multilingual or Urdu-specific embedding model 
(e.g., paraphrase-multilingual-MiniLM-L12-v2, bge-multilingual-base). English-only 
models will not work for Urdu. 
• Hybrid Search: You must still implement Hybrid Search (BM25 + Semantic) with RRF 
or Re-ranking. 
• Evaluation Adaptation: Since LLM judges are English-biased, you must use a 
multilingual LLM (e.g., Qwen-7B, Aya-23) for evaluation. 
• Challenge Report: Include a 1-page Appendix documenting Urdu-specific challenges 
encountered (e.g., tokenization, script handling, data scarcity, judge bias) and how 
you mitigated them. 
5. Additional Instructions 
• Figures and Tables: Ensure that all figures and tables are properly numbered and cited 
in the text. Avoid vague references like “the figure below”; instead, use precise 
citations such as “Table 1 shows…” or “As shown in Figure 7…”. 
• Code Submission: Either include the working code in your submission or provide a link 
to a repository (e.g., GitHub) where the code can be accessed. The date of upload must 
be before the deadline. 
• References: If you have used external resources, such as blogs or GitHub repositories, 
ensure they are appropriately cited. Include a reference section before the appendix 
to acknowledge all sources and avoid any potential issues of plagiarism. Proper 
citation is a key part of your academic and professional training. 
• Submission File Name: The file name should be as per the group members name and 
don’t name it Assignment1 or Project1. So if there are two group members, Aamna 
and Zaid, then name it Aamna_Zaid.docx.