Spaces:

OnyxMunk
/

GravityFalls

Paused

App Files Files Community

Rafael Uzarowski commited on May 3, 2025

Commit

e3ab7e8

unverified ·

1 Parent(s): 7206d38

feat: DocumentQuery initial version

Browse files

Files changed (9) hide show

prompts/default/agent.system.tool.document_query.md +58 -0
prompts/default/agent.system.tools.md +2 -0
prompts/default/fw.document_query.optmimize_query.md +7 -0
prompts/default/fw.document_query.system_prompt.md +5 -0
prompts/default/{tool.knowledge.response.md → fw.knowledge_tool.response.md} +0 -0
python/helpers/document_query.py +702 -0
python/tools/document_query.py +20 -0
python/tools/knowledge_tool.py +44 -2
requirements.txt +7 -2

prompts/default/agent.system.tool.document_query.md ADDED Viewed

	@@ -0,0 +1,58 @@

+### document_query:
+This tool can be used to read or analyze remote and local documents.
+It can be used to:
+ *  Get webpage or remote document text content
+ *  Get local document text content
+ *  Answer queries about a webpage, remote or local document
+By default, when the "queries" argument is empty, this tool returns the text content of the document retrieved using OCR.
+Additionally, you can pass a list of "queries" - in this case, the tool returns the answers to all the passed queries about the document.
+!!! This is a universal document reader qnd query tool
+!!! Supported document dormats: HTML, PDF, Office Documents (word,excel, powerpoint), Textfiles and many more.
+#### Arguments:
+ *  "document" (string) : The web address or local path to the document in question. Webdocuments need "http://" or "https://" protocol prefix. For local files the "file:" protocol prefix is optional. Local files MUST be passed with full filesystem path.
+ *  "queries" (Optional, list[str]) : Optionally, here you can pass one or more queries to be answered (using and/or about) the document
+#### Usage example 1:
+##### Request:
+```json
+{
+    "thoughts": [
+        "...",
+    ],
+    "tool_name": "document_query",
+    "tool_args": {
+        "document": "https://...somexample",
+    }
+}
+```
+##### Response:
+```plaintext
+... Here is the entire content of the web document requested ...
+```
+#### Usage example 2:
+##### Request:
+```json
+{
+    "thoughts": [
+        "...",
+    ],
+    "tool_name": "document_query",
+    "tool_args": {
+        "document": "https://...somexample",
+        "queries": [
+            "What is the topic?",
+            "Who is the audience?"
+        ]
+    }
+}
+```
+##### Response:
+```plaintext
+# What is the topic?
+... Description of the document topic ...
+# Who is the audience?
+... The intended document audience list with short descriptions ...
+```

prompts/default/agent.system.tools.md CHANGED Viewed

@@ -19,3 +19,5 @@
 {{ include './agent.system.tool.browser.md' }}
 {{ include './agent.system.tool.scheduler.md' }}

 {{ include './agent.system.tool.browser.md' }}
 {{ include './agent.system.tool.scheduler.md' }}
+{{ include './agent.system.tool.document_query.md' }}

prompts/default/fw.document_query.optmimize_query.md ADDED Viewed

	@@ -0,0 +1,7 @@

+You are an AI assistant being part of a larger RAG system based on vector similarity search.
+Your job is to take a human written question and convert it into a concise vector store search query.
+The goal is to yield as many correct results and as few false positives as possible.
+!! You will be given a "Search Query".
+!! The response should ONLY contain the optimized search query
+!! Do not include any other text, confirmations or explanations. Do not prefix your response.
+!! You are working as a tool and not as a conversational agent.

prompts/default/fw.document_query.system_prompt.md ADDED Viewed

	@@ -0,0 +1,5 @@

+You are an AI assistant who can answer questions about a given document text.
+The assistant is part of a larger application that is used to answer questions about a document.
+The assistant is given a document and a list of queries and the assistant must answer the quries based on the document.
+!! The response should be in markdown format.
+!! The response should only include the queries as headings and the answers to the queries. The markdown should contain paragraphs with "#### <Query>" as headings (<Query> being the original query) followed by the query answer as the paragraph text content.

prompts/default/{tool.knowledge.response.md → fw.knowledge_tool.response.md} RENAMED Viewed

File without changes

python/helpers/document_query.py ADDED Viewed

	@@ -0,0 +1,702 @@

+import mimetypes
+import os
+import asyncio
+import aiohttp
+import json
+os.environ["USER_AGENT"] = "@mixedbread-ai/unstructured"  # noqa E402
+from langchain_unstructured import UnstructuredLoader  # noqa E402
+from urllib.parse import urlparse
+from typing import Sequence, List, Optional, Tuple
+from datetime import datetime
+from langchain_community.document_loaders import AsyncHtmlLoader
+from langchain_community.document_loaders.text import TextLoader
+from langchain_community.document_loaders.pdf import PyMuPDFLoader
+from langchain_community.document_transformers import MarkdownifyTransformer
+from langchain_community.document_loaders.parsers.images import TesseractBlobParser
+from langchain_core.documents import Document
+from langchain.prompts import ChatPromptTemplate
+from langchain.schema import SystemMessage, HumanMessage
+from langchain.storage import LocalFileStore
+from langchain.embeddings import CacheBackedEmbeddings
+from langchain_community.vectorstores import FAISS
+import faiss
+from langchain_community.docstore.in_memory import InMemoryDocstore
+from langchain_community.vectorstores.utils import (
+    DistanceStrategy,
+)
+from langchain_core.embeddings import Embeddings
+from python.helpers.print_style import PrintStyle
+from python.helpers import files
+from agent import Agent
+import models
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+class DocumentQueryStore:
+    """
+    FAISS Store for document query results.
+    Manages documents identified by URI for storage, retrieval, and searching.
+    """
+    # Default chunking parameters
+    DEFAULT_CHUNK_SIZE = 1000
+    DEFAULT_CHUNK_OVERLAP = 100
+    # Cache for initialized stores
+    _stores: dict[str, "DocumentQueryStore"] = {}
+    @staticmethod
+    async def get(agent: Agent):
+        """Get or create a DocumentQueryStore instance for the specified agent."""
+        if not agent or not agent.config:
+            raise ValueError("Agent and agent config must be provided")
+        memory_subdir = agent.config.memory_subdir or "default"
+        store_key = f"{memory_subdir}/document_query"
+        if store_key not in DocumentQueryStore._stores:
+            # Initialize embeddings model from agent config
+            embeddings_model = agent.get_embedding_model()
+            # Initialize store
+            store = DocumentQueryStore(agent, embeddings_model, memory_subdir)
+            DocumentQueryStore._stores[store_key] = store
+            return store
+        else:
+            return DocumentQueryStore._stores[store_key]
+    @staticmethod
+    async def reload(agent: Agent):
+        """Reload the DocumentQueryStore for the specified agent."""
+        memory_subdir = agent.config.memory_subdir or "default"
+        store_key = f"{memory_subdir}/document_query"
+        if store_key in DocumentQueryStore._stores:
+            del DocumentQueryStore._stores[store_key]
+        return await DocumentQueryStore.get(agent)
+    def __init__(
+        self,
+        agent: Agent,
+        embeddings_model: Embeddings,
+        memory_subdir: str,
+    ):
+        """Initialize a DocumentQueryStore instance."""
+        self.agent = agent
+        self.memory_subdir = memory_subdir
+        # Get directory paths
+        db_dir = self._get_db_dir()
+        em_dir = os.path.join(db_dir, "embeddings")
+        # Create directories
+        os.makedirs(db_dir, exist_ok=True)
+        os.makedirs(em_dir, exist_ok=True)
+        # Setup embeddings cache
+        store = LocalFileStore(em_dir)
+        self.embeddings = CacheBackedEmbeddings.from_bytes_store(
+            embeddings_model,
+            store,
+            namespace=f"document_query_{getattr(embeddings_model, 'model', getattr(embeddings_model, 'model_name', 'default'))}"
+        )
+        # Initialize vector store
+        index_path = os.path.join(db_dir, "index.faiss")
+        docstore_path = os.path.join(db_dir, "docstore.json")
+        if os.path.exists(index_path) and os.path.exists(docstore_path):
+            PrintStyle.standard(f"Loading existing vector store from {db_dir}")
+            try:
+                self.vectorstore = FAISS.load_local(
+                    folder_path=db_dir,
+                    embeddings=self.embeddings,
+                    allow_dangerous_deserialization=True,
+                    distance_strategy=DistanceStrategy.COSINE,
+                )
+            except Exception as e:
+                PrintStyle.error(f"Error loading vector store: {str(e)}")
+                self._initialize_new_vectorstore()
+        else:
+            PrintStyle.standard("Creating new vector store in '{db_dir}'")
+            self._initialize_new_vectorstore()
+    def _initialize_new_vectorstore(self):
+        """Initialize a new vector store."""
+        dimension = len(self.embeddings.embed_query("test"))
+        index = faiss.IndexFlatIP(dimension)
+        self.vectorstore = FAISS(
+            embedding_function=self.embeddings,
+            index=index,
+            docstore=InMemoryDocstore(),
+            index_to_docstore_id={},
+            distance_strategy=DistanceStrategy.COSINE,
+        )
+    def _get_db_dir(self) -> str:
+        """Get the absolute path to the database directory."""
+        return files.get_abs_path("memory", self.memory_subdir, "document_query")
+    def _save_vectorstore(self):
+        """Save the vector store to disk."""
+        db_dir = self._get_db_dir()
+        PrintStyle.standard(f"Saving vector store to {db_dir}")
+        self.vectorstore.save_local(folder_path=db_dir)
+        PrintStyle.standard(f"Vector store saved with {len(self.vectorstore.index_to_docstore_id)} documents")
+    @staticmethod
+    def _normalize_uri(uri: str) -> str:
+        """
+        Normalize a document URI to ensure consistent lookup.
+        Args:
+            uri: The URI to normalize
+        Returns:
+            Normalized URI
+        """
+        # Convert to lowercase
+        normalized = uri.lower()
+        # Parse the URL to get scheme
+        parsed = urlparse(normalized)
+        scheme = parsed.scheme or "file"
+        # Normalize based on scheme
+        if scheme == "file":
+            if not normalized.startswith("file:"):
+                normalized = "file:" + normalized
+            if normalized.startswith("file://"):
+                normalized = normalized.replace("file://", "file:")
+        elif scheme in ["http", "https"]:
+            # Always use https for web URLs
+            normalized = normalized.replace("http://", "https://")
+        return normalized
+    async def add_document(self, text: str, document_uri: str, metadata: dict = None) -> bool:
+        """
+        Add a document to the store with the given URI.
+        Args:
+            text: The document text content
+            document_uri: The URI that uniquely identifies this document
+            metadata: Optional metadata for the document
+        Returns:
+            True if successful, False otherwise
+        """
+        # Normalize the URI
+        document_uri = self._normalize_uri(document_uri)
+        # Delete existing document if it exists to avoid duplicates
+        await self.delete_document(document_uri)
+        # Initialize metadata
+        doc_metadata = metadata or {}
+        doc_metadata["document_uri"] = document_uri
+        doc_metadata["timestamp"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        # Split text into chunks
+        text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=self.DEFAULT_CHUNK_SIZE,
+            chunk_overlap=self.DEFAULT_CHUNK_OVERLAP
+        )
+        chunks = text_splitter.split_text(text)
+        # Create documents
+        docs = []
+        for i, chunk in enumerate(chunks):
+            chunk_metadata = doc_metadata.copy()
+            chunk_metadata["chunk_index"] = i
+            chunk_metadata["total_chunks"] = len(chunks)
+            docs.append(Document(page_content=chunk, metadata=chunk_metadata))
+        if not docs:
+            PrintStyle.error(f"No chunks created for document: {document_uri}")
+            return False
+        # Apply rate limiter
+        try:
+            docs_text = "".join(chunk.page_content for chunk in docs)
+            await self.agent.rate_limiter(
+                model_config=self.agent.config.embeddings_model,
+                input=docs_text
+            )
+            # Add documents to vector store
+            self.vectorstore.add_documents(documents=docs)
+            self._save_vectorstore()
+            PrintStyle.standard(f"Added document '{document_uri}' with {len(docs)} chunks")
+            return True
+        except Exception as e:
+            PrintStyle.error(f"Error adding document '{document_uri}': {str(e)}")
+            return False
+    async def get_document(self, document_uri: str) -> Optional[Document]:
+        """
+        Retrieve a document by its URI.
+        Args:
+            document_uri: The URI of the document to retrieve
+        Returns:
+            The complete document if found, None otherwise
+        """
+        # Normalize the URI
+        document_uri = self._normalize_uri(document_uri)
+        # Get all chunks for this document
+        docs = await self._get_document_chunks(document_uri)
+        if not docs:
+            PrintStyle.error(f"Document not found: {document_uri}")
+            return None
+        # Combine chunks into a single document
+        chunks = sorted(docs, key=lambda x: x.metadata.get("chunk_index", 0))
+        full_content = "\n".join(chunk.page_content for chunk in chunks)
+        # Use metadata from first chunk
+        metadata = chunks[0].metadata.copy()
+        metadata.pop("chunk_index", None)
+        metadata.pop("total_chunks", None)
+        return Document(page_content=full_content, metadata=metadata)
+    async def _get_document_chunks(self, document_uri: str) -> List[Document]:
+        """
+        Get all chunks for a document.
+        Args:
+            document_uri: The URI of the document
+        Returns:
+            List of document chunks
+        """
+        # Normalize the URI
+        document_uri = self._normalize_uri(document_uri)
+        # Access docstore directly
+        chunks = []
+        for doc_id, doc in self.vectorstore.docstore._dict.items():  # type: ignore
+            if isinstance(doc.metadata, dict) and doc.metadata.get("document_uri") == document_uri:
+                chunks.append(doc)
+        PrintStyle.standard(f"Found {len(chunks)} chunks for document: {document_uri}")
+        return chunks
+    async def document_exists(self, document_uri: str) -> bool:
+        """
+        Check if a document exists in the store.
+        Args:
+            document_uri: The URI of the document to check
+        Returns:
+            True if the document exists, False otherwise
+        """
+        # Normalize the URI
+        document_uri = self._normalize_uri(document_uri)
+        chunks = await self._get_document_chunks(document_uri)
+        return len(chunks) > 0
+    async def delete_document(self, document_uri: str) -> bool:
+        """
+        Delete a document from the store.
+        Args:
+            document_uri: The URI of the document to delete
+        Returns:
+            True if deleted, False if not found
+        """
+        # Normalize the URI
+        document_uri = self._normalize_uri(document_uri)
+        chunks = await self._get_document_chunks(document_uri)
+        if not chunks:
+            return False
+        # Collect IDs to delete
+        ids_to_delete = []
+        for chunk in chunks:
+            for doc_id, doc_ref in self.vectorstore.docstore._dict.items():  # type: ignore
+                if doc_ref == chunk:
+                    ids_to_delete.append(doc_id)
+        # Delete from vector store
+        if ids_to_delete:
+            self.vectorstore.delete(ids_to_delete)
+            self._save_vectorstore()
+            PrintStyle.standard(f"Deleted document '{document_uri}' with {len(ids_to_delete)} chunks")
+            return True
+        return False
+    async def expire_documents(self, older_than_days: float) -> int:
+        """
+        Delete documents older than the specified number of days.
+        Args:
+            older_than_days: Number of days (can be fractional) before current time
+        Returns:
+            Number of documents deleted
+        """
+        if older_than_days <= 0:
+            return 0
+        # Calculate cutoff timestamp
+        cutoff_date = datetime.now().timestamp() - (older_than_days * 24 * 60 * 60)
+        # Find expired documents
+        expired_uris = set()
+        # Check all documents in the store
+        for doc_id, doc in self.vectorstore.docstore._dict.items():  # type: ignore
+            if not isinstance(doc.metadata, dict):
+                continue
+            # Only process each document once (first chunk)
+            if doc.metadata.get("chunk_index", 0) != 0:
+                continue
+            doc_uri = doc.metadata.get("document_uri")
+            if not doc_uri:
+                continue
+            try:
+                # Check timestamp
+                timestamp_str = doc.metadata.get("timestamp")
+                if not timestamp_str:
+                    continue
+                doc_timestamp = datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S").timestamp()
+                if doc_timestamp < cutoff_date:
+                    expired_uris.add(doc_uri)
+            except (ValueError, TypeError):
+                # Skip documents with invalid timestamps
+                continue
+        # Delete expired documents
+        deleted_count = 0
+        for uri in expired_uris:
+            if await self.delete_document(uri):
+                deleted_count += 1
+        PrintStyle.standard(f"Expired {deleted_count} documents older than {older_than_days} days")
+        return deleted_count
+    async def search_documents(self, query: str, limit: int = 10, threshold: float = 0.5) -> List[Document]:
+        """
+        Search for documents similar to the query across the entire store.
+        Args:
+            query: The search query string
+            limit: Maximum number of results to return
+            threshold: Minimum similarity score threshold (0-1)
+        Returns:
+            List of matching documents
+        """
+        # Handle empty query
+        if not query:
+            PrintStyle.standard("Empty search query, returning empty results")
+            return []
+        # Apply rate limiter
+        await self.agent.rate_limiter(
+            model_config=self.agent.config.embeddings_model,
+            input=query
+        )
+        # Perform search
+        try:
+            results = self.vectorstore.similarity_search_with_score(
+                query=query,
+                k=limit,
+                score_threshold=threshold
+            )
+            # Extract documents from results (which are (doc, score) pairs)
+            docs = [doc for doc, score in results]
+            PrintStyle.standard(f"Search '{query}' returned {len(docs)} results")
+            return docs
+        except Exception as e:
+            PrintStyle.error(f"Error searching documents: {str(e)}")
+            return []
+    async def search_document(self, document_uri: str, query: str, limit: int = 10, threshold: float = 0.5) -> List[Document]:
+        """
+        Search for content within a specific document.
+        Args:
+            document_uri: The URI of the document to search within
+            query: The search query string
+            limit: Maximum number of results to return
+            threshold: Minimum similarity score threshold (0-1)
+        Returns:
+            List of matching document chunks
+        """
+        # Normalize the URI
+        document_uri = self._normalize_uri(document_uri)
+        # Handle empty query
+        if not query:
+            PrintStyle.standard("Empty search query, returning empty results")
+            return []
+        # Check if document exists
+        if not await self.document_exists(document_uri):
+            PrintStyle.error(f"Document not found: {document_uri}")
+            return []
+        # Apply rate limiter
+        await self.agent.rate_limiter(
+            model_config=self.agent.config.embeddings_model,
+            input=query
+        )
+        # Perform search with document filter
+        try:
+            # Create metadata filter function
+            def filter_fn(doc_metadata):
+                return doc_metadata.get("document_uri") == document_uri
+            results = self.vectorstore.similarity_search_with_score(
+                query=query,
+                k=limit,
+                score_threshold=threshold,
+                filter=filter_fn
+            )
+            # Extract documents from results
+            docs = [doc for doc, score in results]
+            PrintStyle.standard(f"Search '{query}' in document '{document_uri}' returned {len(docs)} results")
+            # Try with lower threshold if no results
+            if not docs and threshold > 0.3:
+                PrintStyle.standard("No results found, trying with lower threshold (0.3)")
+                results = self.vectorstore.similarity_search_with_score(
+                    query=query,
+                    k=limit,
+                    score_threshold=0.3,
+                    filter=filter_fn
+                )
+                docs = [doc for doc, score in results]
+                PrintStyle.standard(f"Retry search returned {len(docs)} results")
+            return docs
+        except Exception as e:
+            PrintStyle.error(f"Error searching within document: {str(e)}")
+            return []
+    async def list_documents(self) -> List[str]:
+        """
+        Get a list of all document URIs in the store.
+        Returns:
+            List of document URIs
+        """
+        # Extract unique URIs
+        uris = set()
+        for doc in self.vectorstore.docstore._dict.values():  # type: ignore
+            if isinstance(doc.metadata, dict):
+                uri = doc.metadata.get("document_uri")
+                if uri:
+                    uris.add(uri)
+        return sorted(list(uris))
+class DocumentQueryHelper:
+    def __init__(self, agent: Agent):
+        self.agent = agent
+        self.store: DocumentQueryStore = asyncio.run(DocumentQueryStore.get(agent))
+    async def document_qa(self, document_uri: str, questions: Sequence[str]) -> Tuple[bool, str]:
+        _ = await self.document_get_content(document_uri)
+        content = ""
+        for question in questions:
+            human_content = f'Search Query: "{question}"'
+            system_content = self.agent.parse_prompt("fw.document_query.optmimize_query.md")
+            optimized_query = await self.agent.call_utility_model(
+                system=system_content,
+                message=human_content
+            )
+            chunks = await self.store.search_document(
+                document_uri=document_uri,
+                query=str(optimized_query),
+                limit=10000,
+                threshold=0.66
+            )
+            content += "\n\n----\n\n".join([chunk.page_content for chunk in chunks]) + "\n\n----\n\n"
+        if not content:
+            content = f"!!! No content found for document: {document_uri} matching queries: {json.dumps(questions)}"
+            return False, content
+        questions_str = "\n".join([f" *  {question}" for question in questions])
+        qa_system_message = self.agent.parse_prompt("fw.document_query.system_prompt.md")
+        qa_user_message = f"# Document:\n{content}\n\n# Queries:\n{questions_str}"
+        ai_response = await self.agent.call_chat_model(
+            prompt=ChatPromptTemplate.from_messages([
+                SystemMessage(content=qa_system_message),
+                HumanMessage(content=qa_user_message),
+            ])
+        )
+        return True, str(ai_response)
+    async def document_get_content(self, document_uri: str) -> str:
+        url = urlparse(document_uri)
+        scheme = url.scheme or "file"
+        mimetype, encoding = mimetypes.guess_type(document_uri)
+        mimetype = mimetype or "application/octet-stream"
+        if mimetype == "application/octet-stream":
+            if url.scheme in ["http", "https"]:
+                response: aiohttp.ClientResponse | None = None
+                retries = 0
+                last_error = ""
+                while not response and retries < 3:
+                    try:
+                        async with aiohttp.ClientSession() as session:
+                            response = await session.head(document_uri, timeout=aiohttp.ClientTimeout(total=2.0), allow_redirects=True)
+                            if response.status > 399:
+                                raise Exception(response.status)
+                            break
+                    except Exception as e:
+                        await asyncio.sleep(1)
+                        last_error = str(e)
+                    retries += 1
+                if not response:
+                    raise ValueError(f"DocumentQueryHelper::document_get_content: Document fetch error: {document_uri} ({last_error})")
+                mimetype = response.headers["content-type"]
+                if "content-length" in response.headers:
+                    content_length = float(response.headers["content-length"]) / 1024 / 1024  # MB
+                    if content_length > 25.0:
+                        raise ValueError(f"Document content length exceeds max. 25MB: {content_length} MB ({document_uri})")
+                if mimetype and '; charset=' in mimetype:
+                    mimetype = mimetype.split('; charset=')[0]
+        if scheme == "file":
+            try:
+                document_uri = os.path.abspath(url.path)
+            except Exception as e:
+                raise ValueError(f"Invalid document path '{url.path}'") from e
+        if encoding:
+            raise ValueError(f"Compressed documents are unsupported '{encoding}' ({document_uri})")
+        if mimetype == "application/octet-stream":
+            raise ValueError(f"Unsupported document mimetype '{mimetype}' ({document_uri})")
+        # Use the store's normalization method
+        document_uri_norm = self.store._normalize_uri(document_uri)
+        await self.store.expire_documents(7)
+        exists = await self.store.document_exists(document_uri_norm)
+        document_content = ""
+        if not exists:
+            if mimetype.startswith("image/"):
+                document_content = self.handle_image_document(document_uri, scheme)
+            elif mimetype == "text/html":
+                document_content = self.handle_html_document(document_uri, scheme)
+            elif mimetype.startswith("text/") or mimetype == "application/json":
+                document_content = self.handle_text_document(document_uri, scheme)
+            elif mimetype == "application/pdf":
+                document_content = self.handle_pdf_document(document_uri, scheme)
+            else:
+                document_content = self.handle_unstructured_document(document_uri, scheme)
+            await self.store.add_document(document_content, document_uri_norm)
+        else:
+            doc = await self.store.get_document(document_uri_norm)
+            if doc:
+                document_content = doc.page_content
+            else:
+                raise ValueError(f"DocumentQueryHelper::document_get_content: Document not found: {document_uri_norm}")
+        return document_content
+    def handle_image_document(self, document: str, scheme: str) -> str:
+        return self.handle_unstructured_document(document, scheme)
+    def handle_html_document(self, document: str, scheme: str) -> str:
+        if scheme in ["http", "https"]:
+            loader = AsyncHtmlLoader(web_path=document)
+        elif scheme == "file":
+            loader = TextLoader(file_path=document)
+        else:
+            raise ValueError(f"Unsupported scheme: {scheme}")
+        parts: list[Document] = loader.load()
+        return "\n".join([element.page_content for element in MarkdownifyTransformer().transform_documents(parts)])
+    def handle_text_document(self, document: str, scheme: str) -> str:
+        if scheme in ["http", "https"]:
+            loader = AsyncHtmlLoader(web_path=document)
+        elif scheme == "file":
+            loader = TextLoader(file_path=document)
+        else:
+            raise ValueError(f"Unsupported scheme: {scheme}")
+        elements: list[Document] = loader.load()
+        return "\n".join([element.page_content for element in elements])
+    def handle_pdf_document(self, document: str, scheme: str) -> str:
+        if scheme not in ["file", "http", "https"]:
+            raise ValueError(f"Unsupported scheme: {scheme}")
+        loader = PyMuPDFLoader(
+            document,
+            mode="single",
+            extract_tables="markdown",
+            extract_images=True,
+            images_inner_format="text",
+            images_parser=TesseractBlobParser(),
+            pages_delimiter="\n",
+        )
+        elements: list[Document] = loader.load()
+        return "\n".join([element.page_content for element in elements])
+    def handle_unstructured_document(self, document: str, scheme: str) -> str:
+        if scheme in ["http", "https"]:
+            # loader = UnstructuredURLLoader(urls=[document], mode="single")
+            loader = UnstructuredLoader(
+                web_url=document,
+                mode="single",
+                partition_via_api=False,
+                # chunking_strategy="by_page",
+                strategy="hi_res",
+            )
+        elif scheme == "file":
+            loader = UnstructuredLoader(
+                file_path=document,
+                mode="single",
+                partition_via_api=False,
+                # chunking_strategy="by_page",
+                strategy="hi_res",
+            )
+        else:
+            raise ValueError(f"Unsupported scheme: {scheme}")
+        elements: list[Document] = loader.load()
+        return "\n".join([element.page_content for element in elements])

python/tools/document_query.py ADDED Viewed

	@@ -0,0 +1,20 @@

+from python.helpers.tool import Tool, Response
+from python.helpers.document_query import DocumentQueryHelper
+class DocumentQueryTool(Tool):
+    async def execute(self, **kwargs):
+        document_uri = kwargs["document"] or None
+        queries = kwargs["queries"] if "queries" in kwargs else [kwargs["query"]] if ("query" in kwargs and kwargs["query"]) else []
+        if not isinstance(document_uri, str) or not document_uri:
+            return Response(message="Error: no document provided", break_loop=False)
+        try:
+            helper = DocumentQueryHelper(self.agent)
+            if not queries:
+                content = await helper.document_get_content(document_uri)
+            else:
+                _, content = await helper.document_qa(document_uri, queries)
+            return Response(message=content, break_loop=False)
+        except Exception as e:  # pylint: disable=broad-exception-caught
+            return Response(message=f"Error processing document: {e}", break_loop=False)

python/tools/knowledge_tool.py CHANGED Viewed

@@ -6,6 +6,7 @@ from python.helpers.print_style import PrintStyle
 from python.helpers.errors import handle_error
 from python.helpers.searxng import search as searxng
 from python.tools.memory_load import DEFAULT_THRESHOLD as DEFAULT_MEMORY_THRESHOLD
 SEARCH_ENGINE_RESULTS = 10
@@ -26,6 +27,9 @@ class Knowledge(Tool):
         # perplexity_result, duckduckgo_result, memory_result = results
         searxng_result, memory_result = results
         # Handle exceptions and format results
         # perplexity_result = self.format_result(perplexity_result, "Perplexity")
         # duckduckgo_result = self.format_result(duckduckgo_result, "DuckDuckGo")
@@ -33,7 +37,7 @@ class Knowledge(Tool):
         memory_result = self.format_result(memory_result, "Memory")
         msg = self.agent.read_prompt(
-            "tool.knowledge.response.md",
             #   online_sources = ((perplexity_result + "\n\n") if perplexity_result else "") + str(duckduckgo_result),
             online_sources=((searxng_result + "\n\n") if searxng_result else ""),
             memory=memory_result,
@@ -66,6 +70,30 @@ class Knowledge(Tool):
     async def searxng_search(self, question):
         return await searxng(question)
     async def mem_search(self, question: str):
         db = await memory.Memory.get(self.agent)
         docs = await db.search_similarity_threshold(
@@ -87,6 +115,20 @@ class Knowledge(Tool):
         outputs = []
         for item in result["results"]:
-            outputs.append(f"{item['title']}\n{item['url']}\n{item['content']}")
         return "\n\n".join(outputs[:SEARCH_ENGINE_RESULTS]).strip()

 from python.helpers.errors import handle_error
 from python.helpers.searxng import search as searxng
 from python.tools.memory_load import DEFAULT_THRESHOLD as DEFAULT_MEMORY_THRESHOLD
+from python.helpers.document_query import DocumentQueryHelper
 SEARCH_ENGINE_RESULTS = 10
         # perplexity_result, duckduckgo_result, memory_result = results
         searxng_result, memory_result = results
+        # enrich results with qa
+        searxng_result = await self.searxng_document_qa(searxng_result, question)
         # Handle exceptions and format results
         # perplexity_result = self.format_result(perplexity_result, "Perplexity")
         # duckduckgo_result = self.format_result(duckduckgo_result, "DuckDuckGo")
         memory_result = self.format_result(memory_result, "Memory")
         msg = self.agent.read_prompt(
+            "fw.knowledge_tool.response.md",
             #   online_sources = ((perplexity_result + "\n\n") if perplexity_result else "") + str(duckduckgo_result),
             online_sources=((searxng_result + "\n\n") if searxng_result else ""),
             memory=memory_result,
     async def searxng_search(self, question):
         return await searxng(question)
+    async def searxng_document_qa(self, result, query):
+        if isinstance(result, Exception) or not query or not result or not result["results"]:
+            return result
+        result["results"] = result["results"][:SEARCH_ENGINE_RESULTS]
+        tasks = []
+        helper = DocumentQueryHelper(self.agent)
+        for index, item in enumerate(result["results"]):
+            tasks.append(helper.document_qa(item["url"], [query]))
+        task_results = list(await asyncio.gather(*tasks, return_exceptions=True))
+        for index, item in enumerate(result["results"]):
+            if isinstance(task_results[index], BaseException):
+                continue
+            found, qa = task_results[index]  # type: ignore
+            if not found:
+                continue
+            result["results"][index]["qa"] = qa
+        return result
     async def mem_search(self, question: str):
         db = await memory.Memory.get(self.agent)
         docs = await db.search_similarity_threshold(
         outputs = []
         for item in result["results"]:
+            if "qa" in item:
+                outputs.append(
+                    f"## Next Result\n"
+                    f"Title: {item['title'].strip()}\n"
+                    f"URL: {item['url'].strip()}\n"
+                    f"Search Engine Summary: {item['content'].strip()}\n"
+                    f"Query Result: {item['qa'].strip()}"
+                )
+            else:
+                outputs.append(
+                    f"## Next Result\n"
+                    f"Title: {item['title'].strip()}\n"
+                    f"URL: {item['url'].strip()}\n"
+                    f"Search Engine Summary: {item['content'].strip()}"
+                )
         return "\n\n".join(outputs[:SEARCH_ENGINE_RESULTS]).strip()

requirements.txt CHANGED Viewed

@@ -19,6 +19,7 @@ langchain-huggingface==0.1.2
 langchain-mistralai==0.2.4
 langchain-ollama==0.2.2
 langchain-openai==0.3.1
 openai-whisper==20240930
 lxml_html_clean==0.3.1
 markdown==3.7
@@ -31,8 +32,12 @@ python-dotenv==1.1.0
 pytz==2024.2
 sentence-transformers==3.0.1
 tiktoken==0.8.0
-unstructured==0.15.13
-unstructured-client==0.25.9
 webcolors==24.6.0
 nest-asyncio==1.6.0
 crontab==1.0.1

 langchain-mistralai==0.2.4
 langchain-ollama==0.2.2
 langchain-openai==0.3.1
+langchain-unstructured[all-docs]==0.1.6
 openai-whisper==20240930
 lxml_html_clean==0.3.1
 markdown==3.7
 pytz==2024.2
 sentence-transformers==3.0.1
 tiktoken==0.8.0
+unstructured[all-docs]==0.16.23
+unstructured-client==0.31.0
 webcolors==24.6.0
 nest-asyncio==1.6.0
+markdownify==0.14.1
+pymupdf==1.25.3
+pytesseract==0.3.13
+pdf2image==1.17.0
 crontab==1.0.1