File size: 3,169 Bytes
4e9bc27
 
 
 
 
 
 
1ae4774
62f9b6c
e4743be
23c0c53
04bd77e
5576a44
c400e48
04bd77e
5576a44
4c984a3
04bd77e
5576a44
c400e48
 
04bd77e
5576a44
 
04bd77e
1512de1
6f3f6bb
04bd77e
5576a44
 
04bd77e
5576a44
b9f417d
 
62f9b6c
 
 
12691d3
62f9b6c
 
03ed9f9
 
 
b9f417d
 
52b8525
b9f417d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# 🚀 Live Test Assignment — Hosted on Hugging Face
I have successfully deployed my assignment on Hugging Face using their free server. You can explore the working demo from the link below:

Live App Link: https://huggingface.co/spaces/WalidAlHassan/AEA

---

# 🛠️ Full Guideline for Setting Up the FastApi Project on Server
Open a new terminal then follow the instruction below
# 1. Install Tesseract OCR with Bengali language pack
sudo apt install -y tesseract-ocr && sudo apt install -y tesseract-ocr-ben

# 2. Clone the project
git clone https://huggingface.co/WalidAlHassan/AEA-10MS && cd AEA-10MS

# 3. Install Python (if not installed)
sudo apt install python3 python3-venv python3-pip

# 4. Create and activate a virtual environment (recommended)

python3 -m venv venv && source venv/bin/activate

# 5. Install required dependencies
pip install -r requirements.txt

# 6. Set your Google API key (get your api key from https://aistudio.google.com/app/apikey)
Open .env file then put your api key in GOOGLE_API_KEY="pest_your_google_api_key_here"

# 7. Parsing pdf and create database
python pdf_parsing_n_create_vector_db.py

# 8. Run project
python main_api.py

# 9. Auto test using post method
Open a new terminal in the AEA-10MS directory then run

source venv/bin/activate && python test_sample.py

---

Sample query are in the sample_queries.py file

---

# Must Answer Questions
# 1. What method or library did you use to extract the text, and why? Did you face any formatting challenges with the PDF content?

Used PyMuPDF to render pages as images and Tesseract OCR (pytesseract) to extract text because the PDF is scanned. Handled formatting by cleaning line breaks and spaces.

# 2. What chunking strategy did you choose (e.g. paragraph-based, sentence-based, character limit)? Why do you think it works well for semantic retrieval?
Used fixed-size character chunks (400 chars with 50 overlap) via RecursiveCharacterTextSplitter to keep context while enabling precise semantic retrieval.

# 3. What embedding model did you use? Why did you choose it? How does it capture the meaning of the text?
Used GoogleGenerativeAIEmbeddings (models/embedding-001) for strong multilingual semantic representations suitable for Bengali and English text.

# 4. How are you comparing the query with your stored chunks? Why did you choose this similarity method and storage setup?
Stored embeddings in FAISS for fast approximate nearest neighbor search, comparing query embeddings with chunk embeddings for efficient semantic matching.

# 5. How do you ensure that the question and the document chunks are compared meaningfully? What would happen if the query is vague or missing context?
Same embedding model for queries and docs ensures relevant vector space comparison; vague queries may return less relevant chunks, but prompt instructs “answer not in context” if none found.

# 6. Do the results seem relevant? If not, what might improve them (e.g. better chunking, better embedding model, larger document)?
Better chunking methods, enhanced OCR cleaning, trying other embeddings, or increasing the number of search candidates could improve relevance.