Spaces:

agoyal496
/

AskMyPDF

Build error

App Files Files Community

agoyal496 commited on Dec 7, 2024

Commit

bb6ef4f

verified ·

1 Parent(s): 69992ee

Update Readme.md

Browse files

Files changed (1) hide show

README.md +89 -0

README.md CHANGED Viewed

@@ -10,3 +10,92 @@ pinned: false
 ---
 # AskMyPDF

 ---
 # AskMyPDF
+A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks.
+## Features
+- **PDF Parsing & Splitting:**
+  Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings.
+- **Semantic Embeddings & Vector Store:**
+  Use `sentence-transformers/all-MiniLM-L6-v2` embeddings to represent text as vectors.
+  FAISS vector storage for efficient similarity search.
+- **Few-Shot Prompting & Structured Answers:**
+  Integrate few-shot examples to guide the model towards a specific output format.
+  Return answers in a structured JSON format.
+- **Chain Orchestration with LangChain:**
+  Utilize LangChain’s `LLMChain` and prompt templates for controlled and reproducible queries.
+- **Token-Safe Implementation:**
+  Custom token splitting and truncation ensure input fits within model token limits, avoiding errors.
+## Installation
+This project requires **Python 3.11**. We recommend using a virtual environment to keep dependencies isolated.
+1. **Clone the Repository**
+   ```bash
+   git clone https://github.com/yourusername/AskMyPDF.git
+   cd AskMyPDF
+   ```
+2.	Set up a Python 3.11 environment (optional but recommended)
+  ```bash
+  python3.11 -m venv venv
+  source venv/bin/activate
+  ```
+3.	Install Dependencies
+  ```bash
+  pip install --upgrade pip
+  pip install -r requirements.txt
+  ```
+4.  Usage
+  ```bash
+  gradio app.py
+  ```
+## Output
+The system will:
+- Parse and split the PDF into token-limited chunks.
+- Embed the chunks using all-MiniLM embeddings.
+- Store them in FAISS.
+- Retrieve the top chunks relevant to your query.
+- Use the language model to produce a final JSON-structured answer.
+## Implementation Details
+- Token-Based Splitting:
+We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected.
+- Vector Store & Retrieval:
+With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses.
+- Few-Shot Prompting:
+The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output.
+- Chain Invocation:
+Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later.
+## Improvements
+- Multi-File Support:
+  - Extend the script to handle multiple PDFs at once.
+  - Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections.
+- Model Agnosticism:
+  - Easily switch embeddings or language models.
+  - Try different Sentence Transformers models or local LLMs like LLaMA or Falcon.
+- User Interface:
+  - Add a simple command-line interface or a web UI (e.g., Streamlit or Gradio) for a more user-friendly querying experience.
+- Caching & Persistence:
+  - Store FAISS indexes on disk for instant reloads without re-embedding.
+  - Implement caching of embeddings and query results to speed up repeated queries.
+- Advanced Prompt Engineering:
+  - Experiment with different few-shot examples, system messages, and instructions to improve answer quality and formatting.
+With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source.