Spaces:
Build error
Build error
| # **Document Search System** | |
| ## **Overview** | |
| The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses. | |
| --- | |
| ## **Features** | |
| 1. **Query Classification:** | |
| - Detects malicious or inappropriate queries using a sentiment analysis model. | |
| - Blocks malicious queries and prevents them from further processing. | |
| 2. **Query Transformation:** | |
| - Rephrases or enhances ambiguous queries to improve retrieval accuracy. | |
| - Uses rule-based transformations and advanced text-to-text models. | |
| 3. **RAG Pipeline:** | |
| - Retrieves top-k documents based on semantic similarity. | |
| - Generates context-aware responses using generative models. | |
| 4. **Blockchain Integration (Chagu):** | |
| - Logs all stages of query processing into a blockchain for integrity and traceability. | |
| - Validates blockchain integrity. | |
| 5. **Neo4j Integration:** | |
| - Stores and visualizes relationships between queries, responses, and documents. | |
| - Allows detailed querying and visualization of the data flow. | |
| --- | |
| ## **Workflow** | |
| The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries: | |
| ### **1. Input Query** | |
| - A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent. | |
| --- | |
| ### **2. Detection Module** | |
| - **Purpose**: Classify the query as "bad" or "good." | |
| - **Steps**: | |
| 1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent. | |
| 2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message. | |
| 3. If "good," proceed to the **Transformation Module**. | |
| --- | |
| ### **3. Transformation Module** | |
| - **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval. | |
| - **Steps**: | |
| 1. Identify missing context or ambiguous phrasing. | |
| 2. Transform the query using: | |
| - Rule-based transformations for simple fixes. | |
| - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing. | |
| 3. Pass the transformed query to the **RAG Pipeline**. | |
| --- | |
| ### **4. RAG Pipeline** | |
| - **Purpose**: Retrieve relevant data and generate a context-aware response. | |
| - **Steps**: | |
| 1. **Document Retrieval**: | |
| - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`. | |
| - Compute semantic similarity between the query and stored documents. | |
| - Retrieve the top-k documents relevant to the query. | |
| 2. **Response Generation**: | |
| - Use the retrieved documents as context. | |
| - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response. | |
| --- | |
| ### **5. Semantic Response Generation** | |
| - **Purpose**: Provide a concise and meaningful answer. | |
| - **Steps**: | |
| 1. Combine the retrieved documents into a coherent context. | |
| 2. Generate a response tailored to the query using the generative model. | |
| 3. Return the response to the user, ensuring clarity and relevance. | |
| --- | |
| ### **6. Logging and Storage** | |
| - **Blockchain Logging:** | |
| - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability. | |
| - Ensures data integrity and tamper-proof records. | |
| - **Neo4j Storage:** | |
| - Relationships between queries, responses, and retrieved documents are stored in Neo4j. | |
| - Enables detailed analysis and graph-based visualization. | |
| --- | |
| ## **Neo4j Visualization** | |
| Here is an example of how the relationships between queries, responses, and documents appear in Neo4j: | |
|  | |
| - **Nodes**: | |
| - Query: Represents the user query. | |
| - TransformedQuery: Rephrased or improved query. | |
| - Document: Relevant documents retrieved based on the query. | |
| - Response: The generated response. | |
| - **Relationships**: | |
| - `RETRIEVED`: Links the query to retrieved documents. | |
| - `TRANSFORMED_TO`: Links the original query to the transformed query. | |
| - `GENERATED`: Links the query to the generated response. | |
| --- | |
| ## **Setup Instructions** | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/your-repo/document-search-system.git | |
| ``` | |
| Here’s the updated README.md content in proper Markdown format with the embedded image reference: | |
| markdown | |
| # **Document Search System** | |
| ## **Overview** | |
| The **Document Search System** provides context-aware and secure responses to user queries by combining query analysis, document retrieval, semantic response generation, and blockchain-powered logging. The system also integrates Neo4j for storing and visualizing relationships between queries, documents, and responses. | |
| --- | |
| ## **Features** | |
| 1. **Query Classification:** | |
| - Detects malicious or inappropriate queries using a sentiment analysis model. | |
| - Blocks malicious queries and prevents them from further processing. | |
| 2. **Query Transformation:** | |
| - Rephrases or enhances ambiguous queries to improve retrieval accuracy. | |
| - Uses rule-based transformations and advanced text-to-text models. | |
| 3. **RAG Pipeline:** | |
| - Retrieves top-k documents based on semantic similarity. | |
| - Generates context-aware responses using generative models. | |
| 4. **Blockchain Integration (Chagu):** | |
| - Logs all stages of query processing into a blockchain for integrity and traceability. | |
| - Validates blockchain integrity. | |
| 5. **Neo4j Integration:** | |
| - Stores and visualizes relationships between queries, responses, and documents. | |
| - Allows detailed querying and visualization of the data flow. | |
| --- | |
| ## **Workflow** | |
| The system follows a well-structured workflow to ensure accurate, secure, and context-aware responses to user queries: | |
| ### **1. Input Query** | |
| - A user provides a query that can be a general question, ambiguous statement, or potentially malicious intent. | |
| --- | |
| ### **2. Detection Module** | |
| - **Purpose**: Classify the query as "bad" or "good." | |
| - **Steps**: | |
| 1. Use a sentiment analysis model (`distilbert-base-uncased-finetuned-sst-2-english`) to detect malicious or inappropriate intent. | |
| 2. If the query is classified as "bad" (e.g., SQL injection or inappropriate tone), block further processing and provide a warning message. | |
| 3. If "good," proceed to the **Transformation Module**. | |
| --- | |
| ### **3. Transformation Module** | |
| - **Purpose**: Rephrase or enhance ambiguous or poorly structured queries for better retrieval. | |
| - **Steps**: | |
| 1. Identify missing context or ambiguous phrasing. | |
| 2. Transform the query using: | |
| - Rule-based transformations for simple fixes. | |
| - Text-to-text models (e.g., `google/flan-t5-small`) for more sophisticated rephrasing. | |
| 3. Pass the transformed query to the **RAG Pipeline**. | |
| --- | |
| ### **4. RAG Pipeline** | |
| - **Purpose**: Retrieve relevant data and generate a context-aware response. | |
| - **Steps**: | |
| 1. **Document Retrieval**: | |
| - Encode the transformed query and documents into embeddings using `all-MiniLM-L6-v2`. | |
| - Compute semantic similarity between the query and stored documents. | |
| - Retrieve the top-k documents relevant to the query. | |
| 2. **Response Generation**: | |
| - Use the retrieved documents as context. | |
| - Pass the query and context to a generative model (e.g., `distilgpt2`) to synthesize a meaningful response. | |
| --- | |
| ### **5. Semantic Response Generation** | |
| - **Purpose**: Provide a concise and meaningful answer. | |
| - **Steps**: | |
| 1. Combine the retrieved documents into a coherent context. | |
| 2. Generate a response tailored to the query using the generative model. | |
| 3. Return the response to the user, ensuring clarity and relevance. | |
| --- | |
| ### **6. Logging and Storage** | |
| - **Blockchain Logging:** | |
| - Each query, transformed query, response, and document retrieval stage is logged into the blockchain for traceability. | |
| - Ensures data integrity and tamper-proof records. | |
| - **Neo4j Storage:** | |
| - Relationships between queries, responses, and retrieved documents are stored in Neo4j. | |
| - Enables detailed analysis and graph-based visualization. | |
| --- | |
| ## **Neo4j Visualization** | |
| Here is an example of how the relationships between queries, responses, and documents appear in Neo4j: | |
|  | |
| - **Nodes**: | |
| - Query: Represents the user query. | |
| - TransformedQuery: Rephrased or improved query. | |
| - Document: Relevant documents retrieved based on the query. | |
| - Response: The generated response. | |
| - **Relationships**: | |
| - `RETRIEVED`: Links the query to retrieved documents. | |
| - `TRANSFORMED_TO`: Links the original query to the transformed query. | |
| - `GENERATED`: Links the query to the generated response. | |
| --- | |
| ## **Setup Instructions** | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/your-repo/document-search-system.git | |
| ``` | |
| Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| Initialize the Neo4j database: | |
| Connect to your Neo4j Aura instance. | |
| Set up credentials in the code. | |
| Load the dataset: | |
| Place your documents in the dataset directory (e.g., data-sets/aclImdb/train). | |
| Run the system: | |
| ```bash | |
| python document_search_system.py | |
| ``` | |
| Neo4j Queries | |
| Retrieve All Queries Logged | |
| ```cypher | |
| MATCH (q:Query) | |
| RETURN q.text AS query, q.timestamp AS timestamp | |
| ORDER BY timestamp DESC | |
| ``` | |
| Visualize Query Relationships | |
| ```cypher | |
| MATCH (n)-[r]->(m) | |
| RETURN n, r, m | |
| Find Documents for a Query | |
| ``` | |
| ```cypher | |
| MATCH (q:Query {text: "How to improve acting skills?"})-[:RETRIEVED]->(d:Document) | |
| RETURN d.name AS document_name | |
| ``` | |
| ### Key Technologies | |
| Machine Learning Models: | |
| distilbert-base-uncased-finetuned-sst-2-english for sentiment analysis. | |
| google/flan-t5-small for query transformation. | |
| distilgpt2 for response generation. | |
| Vector Similarity Search: | |
| all-MiniLM-L6-v2 embeddings for document retrieval. | |
| Blockchain Logging: | |
| Powered by chainguard.blockchain_logger. | |
| Graph-Based Storage: | |
| Relationships visualized and queried via Neo4j. | |
| vbnet | |