Update Readme.md
Browse files
README.md
CHANGED
|
@@ -10,3 +10,92 @@ pinned: false
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# AskMyPDF
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
# AskMyPDF
|
| 13 |
+
|
| 14 |
+
A comprehensive solution to query your PDFs using modern LLM-based techniques. This tool extracts and embeds PDF contents into a vector store, enabling natural language queries and context-rich answers. By leveraging LangChain, FAISS, and HuggingFace embeddings, it provides flexible and fast semantic search over document chunks.
|
| 15 |
+
|
| 16 |
+
## Features
|
| 17 |
+
|
| 18 |
+
- **PDF Parsing & Splitting:**
|
| 19 |
+
Automatically load PDF content and break it down into chunks suitable for all-MiniLM embeddings.
|
| 20 |
+
|
| 21 |
+
- **Semantic Embeddings & Vector Store:**
|
| 22 |
+
Use `sentence-transformers/all-MiniLM-L6-v2` embeddings to represent text as vectors.
|
| 23 |
+
FAISS vector storage for efficient similarity search.
|
| 24 |
+
|
| 25 |
+
- **Few-Shot Prompting & Structured Answers:**
|
| 26 |
+
Integrate few-shot examples to guide the model towards a specific output format.
|
| 27 |
+
Return answers in a structured JSON format.
|
| 28 |
+
|
| 29 |
+
- **Chain Orchestration with LangChain:**
|
| 30 |
+
Utilize LangChain’s `LLMChain` and prompt templates for controlled and reproducible queries.
|
| 31 |
+
|
| 32 |
+
- **Token-Safe Implementation:**
|
| 33 |
+
Custom token splitting and truncation ensure input fits within model token limits, avoiding errors.
|
| 34 |
+
|
| 35 |
+
## Installation
|
| 36 |
+
|
| 37 |
+
This project requires **Python 3.11**. We recommend using a virtual environment to keep dependencies isolated.
|
| 38 |
+
|
| 39 |
+
1. **Clone the Repository**
|
| 40 |
+
```bash
|
| 41 |
+
git clone https://github.com/yourusername/AskMyPDF.git
|
| 42 |
+
cd AskMyPDF
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
2. Set up a Python 3.11 environment (optional but recommended)
|
| 46 |
+
```bash
|
| 47 |
+
python3.11 -m venv venv
|
| 48 |
+
source venv/bin/activate
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
3. Install Dependencies
|
| 53 |
+
```bash
|
| 54 |
+
pip install --upgrade pip
|
| 55 |
+
pip install -r requirements.txt
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
4. Usage
|
| 61 |
+
```bash
|
| 62 |
+
gradio app.py
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
## Output
|
| 67 |
+
The system will:
|
| 68 |
+
- Parse and split the PDF into token-limited chunks.
|
| 69 |
+
- Embed the chunks using all-MiniLM embeddings.
|
| 70 |
+
- Store them in FAISS.
|
| 71 |
+
- Retrieve the top chunks relevant to your query.
|
| 72 |
+
- Use the language model to produce a final JSON-structured answer.
|
| 73 |
+
|
| 74 |
+
## Implementation Details
|
| 75 |
+
- Token-Based Splitting:
|
| 76 |
+
We tokenize the PDF text using Hugging Face’s AutoTokenizer for the all-MiniLM model. By maintaining a chunk_size and chunk_overlap, and adding truncation at the embedding stage, we ensure that the embedding model’s maximum token length is respected.
|
| 77 |
+
- Vector Store & Retrieval:
|
| 78 |
+
With FAISS indexing, similarity search is fast and scalable. Queries are answered by referencing only relevant chunks, ensuring context-aware responses.
|
| 79 |
+
- Few-Shot Prompting:
|
| 80 |
+
The prompt includes a few-shot example, demonstrating how the model should respond with a JSON-formatted answer. This guides the LLM to produce consistent and machine-readable output.
|
| 81 |
+
- Chain Invocation:
|
| 82 |
+
Instead of chain.run(), we use chain.invoke({}). This approach can be more flexible and allows for passing parameters in a structured manner if needed later.
|
| 83 |
+
|
| 84 |
+
## Improvements
|
| 85 |
+
- Multi-File Support:
|
| 86 |
+
- Extend the script to handle multiple PDFs at once.
|
| 87 |
+
- Aggregate or differentiate embeddings by metadata, ensuring queries can target specific documents or sections.
|
| 88 |
+
- Model Agnosticism:
|
| 89 |
+
- Easily switch embeddings or language models.
|
| 90 |
+
- Try different Sentence Transformers models or local LLMs like LLaMA or Falcon.
|
| 91 |
+
- User Interface:
|
| 92 |
+
- Add a simple command-line interface or a web UI (e.g., Streamlit or Gradio) for a more user-friendly querying experience.
|
| 93 |
+
- Caching & Persistence:
|
| 94 |
+
- Store FAISS indexes on disk for instant reloads without re-embedding.
|
| 95 |
+
- Implement caching of embeddings and query results to speed up repeated queries.
|
| 96 |
+
- Advanced Prompt Engineering:
|
| 97 |
+
- Experiment with different few-shot examples, system messages, and instructions to improve answer quality and formatting.
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
With AskMyPDF, harness the power of LLMs and embeddings to transform your PDFs into a fully interactive, queryable knowledge source.
|
| 101 |
+
|