| # 🚀 Live Test Assignment — Hosted on Hugging Face | |
| I have successfully deployed my assignment on Hugging Face using their free server. You can explore the working demo from the link below: | |
| Live App Link: https://huggingface.co/spaces/WalidAlHassan/AEA | |
| --- | |
| # 🛠️ Full Guideline for Setting Up the FastApi Project on Server | |
| Open a new terminal then follow the instruction below | |
| # 1. Install Tesseract OCR with Bengali language pack | |
| sudo apt install -y tesseract-ocr && sudo apt install -y tesseract-ocr-ben | |
| # 2. Clone the project | |
| git clone https://huggingface.co/WalidAlHassan/AEA-10MS && cd AEA-10MS | |
| # 3. Install Python (if not installed) | |
| sudo apt install python3 python3-venv python3-pip | |
| # 4. Create and activate a virtual environment (recommended) | |
| python3 -m venv venv && source venv/bin/activate | |
| # 5. Install required dependencies | |
| pip install -r requirements.txt | |
| # 6. Set your Google API key (get your api key from https://aistudio.google.com/app/apikey) | |
| Open .env file then put your api key in GOOGLE_API_KEY="pest_your_google_api_key_here" | |
| # 7. Parsing pdf and create database | |
| python pdf_parsing_n_create_vector_db.py | |
| # 8. Run project | |
| python main_api.py | |
| # 9. Auto test using post method | |
| Open a new terminal in the AEA-10MS directory then run | |
| source venv/bin/activate && python test_sample.py | |
| --- | |
| Sample query are in the sample_queries.py file | |
| --- | |
| # Must Answer Questions | |
| # 1. What method or library did you use to extract the text, and why? Did you face any formatting challenges with the PDF content? | |
| Used PyMuPDF to render pages as images and Tesseract OCR (pytesseract) to extract text because the PDF is scanned. Handled formatting by cleaning line breaks and spaces. | |
| # 2. What chunking strategy did you choose (e.g. paragraph-based, sentence-based, character limit)? Why do you think it works well for semantic retrieval? | |
| Used fixed-size character chunks (400 chars with 50 overlap) via RecursiveCharacterTextSplitter to keep context while enabling precise semantic retrieval. | |
| # 3. What embedding model did you use? Why did you choose it? How does it capture the meaning of the text? | |
| Used GoogleGenerativeAIEmbeddings (models/embedding-001) for strong multilingual semantic representations suitable for Bengali and English text. | |
| # 4. How are you comparing the query with your stored chunks? Why did you choose this similarity method and storage setup? | |
| Stored embeddings in FAISS for fast approximate nearest neighbor search, comparing query embeddings with chunk embeddings for efficient semantic matching. | |
| # 5. How do you ensure that the question and the document chunks are compared meaningfully? What would happen if the query is vague or missing context? | |
| Same embedding model for queries and docs ensures relevant vector space comparison; vague queries may return less relevant chunks, but prompt instructs “answer not in context” if none found. | |
| # 6. Do the results seem relevant? If not, what might improve them (e.g. better chunking, better embedding model, larger document)? | |
| Better chunking methods, enhanced OCR cleaning, trying other embeddings, or increasing the number of search candidates could improve relevance. |