File size: 3,169 Bytes
4e9bc27 1ae4774 62f9b6c e4743be 23c0c53 04bd77e 5576a44 c400e48 04bd77e 5576a44 4c984a3 04bd77e 5576a44 c400e48 04bd77e 5576a44 04bd77e 1512de1 6f3f6bb 04bd77e 5576a44 04bd77e 5576a44 b9f417d 62f9b6c 12691d3 62f9b6c 03ed9f9 b9f417d 52b8525 b9f417d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | # 🚀 Live Test Assignment — Hosted on Hugging Face
I have successfully deployed my assignment on Hugging Face using their free server. You can explore the working demo from the link below:
Live App Link: https://huggingface.co/spaces/WalidAlHassan/AEA
---
# 🛠️ Full Guideline for Setting Up the FastApi Project on Server
Open a new terminal then follow the instruction below
# 1. Install Tesseract OCR with Bengali language pack
sudo apt install -y tesseract-ocr && sudo apt install -y tesseract-ocr-ben
# 2. Clone the project
git clone https://huggingface.co/WalidAlHassan/AEA-10MS && cd AEA-10MS
# 3. Install Python (if not installed)
sudo apt install python3 python3-venv python3-pip
# 4. Create and activate a virtual environment (recommended)
python3 -m venv venv && source venv/bin/activate
# 5. Install required dependencies
pip install -r requirements.txt
# 6. Set your Google API key (get your api key from https://aistudio.google.com/app/apikey)
Open .env file then put your api key in GOOGLE_API_KEY="pest_your_google_api_key_here"
# 7. Parsing pdf and create database
python pdf_parsing_n_create_vector_db.py
# 8. Run project
python main_api.py
# 9. Auto test using post method
Open a new terminal in the AEA-10MS directory then run
source venv/bin/activate && python test_sample.py
---
Sample query are in the sample_queries.py file
---
# Must Answer Questions
# 1. What method or library did you use to extract the text, and why? Did you face any formatting challenges with the PDF content?
Used PyMuPDF to render pages as images and Tesseract OCR (pytesseract) to extract text because the PDF is scanned. Handled formatting by cleaning line breaks and spaces.
# 2. What chunking strategy did you choose (e.g. paragraph-based, sentence-based, character limit)? Why do you think it works well for semantic retrieval?
Used fixed-size character chunks (400 chars with 50 overlap) via RecursiveCharacterTextSplitter to keep context while enabling precise semantic retrieval.
# 3. What embedding model did you use? Why did you choose it? How does it capture the meaning of the text?
Used GoogleGenerativeAIEmbeddings (models/embedding-001) for strong multilingual semantic representations suitable for Bengali and English text.
# 4. How are you comparing the query with your stored chunks? Why did you choose this similarity method and storage setup?
Stored embeddings in FAISS for fast approximate nearest neighbor search, comparing query embeddings with chunk embeddings for efficient semantic matching.
# 5. How do you ensure that the question and the document chunks are compared meaningfully? What would happen if the query is vague or missing context?
Same embedding model for queries and docs ensures relevant vector space comparison; vague queries may return less relevant chunks, but prompt instructs “answer not in context” if none found.
# 6. Do the results seem relevant? If not, what might improve them (e.g. better chunking, better embedding model, larger document)?
Better chunking methods, enhanced OCR cleaning, trying other embeddings, or increasing the number of search candidates could improve relevance. |