| | --- |
| | title: AskMyPDF |
| | emoji: ⚡ |
| | colorFrom: red |
| | colorTo: gray |
| | sdk: gradio |
| | sdk_version: 5.8.0 |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # AskMyPDF |
| |
|
| | A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks. |
| |
|
| | ## Features |
| |
|
| | - **PDF Parsing & Splitting:** |
| | Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings. |
| | |
| | - **Semantic Embeddings & Vector Store:** |
| | Use `sentence-transformers/all-MiniLM-L6-v2` embeddings to represent text as vectors. |
| | FAISS vector storage for efficient similarity search. |
| | |
| | - **Few-Shot Prompting & Structured Answers:** |
| | Integrate few-shot examples to guide the model towards a specific output format. |
| | Return answers in a structured JSON format. |
| | |
| | - **Chain Orchestration with LangChain:** |
| | Utilize LangChain’s `LLMChain` and prompt templates for controlled and reproducible queries. |
| | |
| | - **Token-Safe Implementation:** |
| | Custom token splitting and truncation ensure input fits within model token limits, avoiding errors. |
| |
|
| | ## Installation |
| |
|
| | This project requires **Python 3.11**. We recommend using a virtual environment to keep dependencies isolated. |
| |
|
| | 1. **Clone the Repository** |
| | ```bash |
| | git clone https://huggingface.co/spaces/agoyal496/AskMyPDF |
| | cd AskMyPDF |
| | ``` |
| |
|
| | 2. Set up a Python 3.11 environment (optional but recommended) |
| | ```bash |
| | python3.11 -m venv venv |
| | source venv/bin/activate |
| | ``` |
| |
|
| |
|
| | 3. Install Dependencies |
| | ```bash |
| | pip install --upgrade pip |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| |
|
| |
|
| | 4. Usage |
| | ```bash |
| | gradio app.py |
| | ``` |
| |
|
| |
|
| | ## Output |
| | The system will: |
| | - Parse and split the PDF into token-limited chunks. |
| | - Embed the chunks using all-MiniLM embeddings. |
| | - Store them in FAISS. |
| | - Retrieve the top chunks relevant to your query. |
| | - Use the language model to produce a final JSON-structured answer. |
| |
|
| | ## Implementation Details |
| | - Token-Based Splitting: |
| | We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected. |
| | - Vector Store & Retrieval: |
| | With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses. |
| | - Few-Shot Prompting: |
| | The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output. |
| | - Chain Invocation: |
| | Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later. |
| |
|
| | ## Improvements |
| | - **Multi-File Support:** |
| | - Extend the script to handle multiple PDFs at once. |
| | - Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections. |
| | - **Model Agnosticism:** |
| | - Easily switch embeddings or language models. |
| | - Try different Sentence Transformers models or local LLMs like LLaMA or Falcon. |
| | - **Caching & Persistence:** |
| | - Store FAISS indexes on disk for instant reloads without re-embedding. |
| | - Implement caching of embeddings and query results to speed up repeated queries. |
| | - **Advanced Prompt Engineering:** |
| | - Experiment with different few-shot examples, chain-of-thought prompting, system messages, and instructions to improve answer quality and formatting. |
| | - **Chunking Strategies:** |
| | - Implement advanced chunking strategies: |
| | - Use semantic chunking to divide text based on meaning or coherence rather than fixed sizes. |
| | - Include options for overlapping chunks to improve retrieval precision. |
| | - Integrate hierarchical chunking to preserve context across sections (e.g., chapters, headings, subheadings). |
| | - **Improved Retrieval Techniques:** |
| | - Leverage Approximate Nearest Neighbor (ANN) algorithms to accelerate similarity search. |
| | - Integrate with advanced vector databases (e.g., Pinecone, Weaviate, Milvus) for efficient and scalable retrieval. |
| | - Use hybrid retrieval models, combining vector similarity with traditional keyword-based retrieval for better query coverage. |
| | - **Cross-Encoder Reranker:** |
| | - Introduce a cross-encoder reranker to improve the quality of retrieved results: |
| | - Apply a fine-tuned cross-encoder model to rerank top candidates from the initial vector search. |
| | - Use a pre-trained or task-specific cross-encoder (e.g., models from Hugging Face like cross-encoder/ms-marco-TinyBERT-L-6). |
| | - Improve relevance by jointly encoding the query and candidate passages, allowing contextual alignment and a more accurate similarity score. |
| | - Dynamically adjust the balance between retrieval speed and reranking quality by tuning the number of top candidates to rerank. |
| | - **Graph-Based Retrieval Augmentation:** |
| | - Adopt GraphRAG approaches: |
| | - Represent documents and queries as nodes in a graph for relational context. |
| | - Use graph-based algorithms to enhance retrieval by modeling relationships (e.g., citations, semantic links). |
| | - Introduce parent document retrievers that prioritize and rank content based on its originating document or source reliability. |
| |
|
| |
|
| | With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source. |
| |
|
| |
|