WalidAlHassan
/

AEA-10MS

Model card Files Files and versions

AEA-10MS / README.md

WalidAlHassan's picture

update

12691d3 7 months ago

|

history blame contribute delete

3.17 kB

	# 🚀 Live Test Assignment — Hosted on Hugging Face
	I have successfully deployed my assignment on Hugging Face using their free server. You can explore the working demo from the link below:

	Live App Link: https://huggingface.co/spaces/WalidAlHassan/AEA

	---

	# 🛠️ Full Guideline for Setting Up the FastApi Project on Server
	Open a new terminal then follow the instruction below
	# 1. Install Tesseract OCR with Bengali language pack
	sudo apt install -y tesseract-ocr && sudo apt install -y tesseract-ocr-ben

	# 2. Clone the project
	git clone https://huggingface.co/WalidAlHassan/AEA-10MS && cd AEA-10MS

	# 3. Install Python (if not installed)
	sudo apt install python3 python3-venv python3-pip

	# 4. Create and activate a virtual environment (recommended)

	python3 -m venv venv && source venv/bin/activate

	# 5. Install required dependencies
	pip install -r requirements.txt

	# 6. Set your Google API key (get your api key from https://aistudio.google.com/app/apikey)
	Open .env file then put your api key in GOOGLE_API_KEY="pest_your_google_api_key_here"

	# 7. Parsing pdf and create database
	python pdf_parsing_n_create_vector_db.py

	# 8. Run project
	python main_api.py

	# 9. Auto test using post method
	Open a new terminal in the AEA-10MS directory then run

	source venv/bin/activate && python test_sample.py

	---

	Sample query are in the sample_queries.py file

	---

	# Must Answer Questions
	# 1. What method or library did you use to extract the text, and why? Did you face any formatting challenges with the PDF content?

	Used PyMuPDF to render pages as images and Tesseract OCR (pytesseract) to extract text because the PDF is scanned. Handled formatting by cleaning line breaks and spaces.

	# 2. What chunking strategy did you choose (e.g. paragraph-based, sentence-based, character limit)? Why do you think it works well for semantic retrieval?
	Used fixed-size character chunks (400 chars with 50 overlap) via RecursiveCharacterTextSplitter to keep context while enabling precise semantic retrieval.

	# 3. What embedding model did you use? Why did you choose it? How does it capture the meaning of the text?
	Used GoogleGenerativeAIEmbeddings (models/embedding-001) for strong multilingual semantic representations suitable for Bengali and English text.

	# 4. How are you comparing the query with your stored chunks? Why did you choose this similarity method and storage setup?
	Stored embeddings in FAISS for fast approximate nearest neighbor search, comparing query embeddings with chunk embeddings for efficient semantic matching.

	# 5. How do you ensure that the question and the document chunks are compared meaningfully? What would happen if the query is vague or missing context?
	Same embedding model for queries and docs ensures relevant vector space comparison; vague queries may return less relevant chunks, but prompt instructs “answer not in context” if none found.

	# 6. Do the results seem relevant? If not, what might improve them (e.g. better chunking, better embedding model, larger document)?
	Better chunking methods, enhanced OCR cleaning, trying other embeddings, or increasing the number of search candidates could improve relevance.