Spaces:
Build error
Build error
| title: Chagu Demo | |
| emoji: π | |
| colorFrom: pink | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: 1.40.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: 'this is demo for chain guard protocol, assistant, RAG ' | |
| # **AI-Powered Document Search with Malicious Query Detection** | |
| This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model. | |
| ## **Features** | |
| - **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches. | |
| - **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries. | |
| - **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files. | |
| - **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable. | |
| ## **Technologies Used** | |
| - **Python 3.8+** | |
| - **Transformers**: For NLP-based malicious query detection. | |
| - **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis. | |
| - **Pathlib**: For robust file and path handling. | |
| ## **Project Structure** | |
| βββ rag_chagu_demo.py # Main script containing the DocumentSearcher class | |
| βββ README.md # This file | |
| βββ data-sets/ - this part shifted to $HOME | |
| β βββ aclImdb/ | |
| β β βββ train/ | |
| β β β βββ pos/ # Positive movie reviews | |
| β β β βββ neg/ # Negative movie reviews | |
| β βββ txt-files/ # Additional .txt files for document search | |
| ## **Installation** | |
| Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies: | |
| ```bash | |
| pip install transformers | |
| ``` | |
| Dataset Setup | |
| Place the IMDB dataset in the following structure: | |
| bash | |
| Copy code | |
| $HOME/data-sets/aclImdb/train/pos/ | |
| $HOME/data-sets/aclImdb/train/neg/ | |
| Optionally, place additional .txt files under: | |
| bash | |
| Copy code | |
| $HOME/data-sets/txt-files/ | |
| Usage | |
| Run the script with the following command: | |
| bash | |
| ``` | |
| python rag_chagu_demo.py | |
| ``` | |
| Example Output | |
| ``` | |
| Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos | |
| Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg | |
| Loaded 5000 movie reviews from IMDB dataset. | |
| Normal Query Results: | |
| Document: This movie had great acting and a compelling storyline. The characters were well-developed... | |
| Malicious Query Detected - Confidence: 0.95 | |
| Malicious Query Results: | |
| Document: ANOMALY: Query blocked due to detected malicious intent. | |
| ``` | |
| ## How It Works | |
| The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents. | |
| The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis. | |
| If a query is flagged as malicious, it is blocked and an anomaly message is returned. | |
| For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches. | |
| AI Model Used | |
| The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis. | |
| ## Why Use AI for Malicious Query Detection? | |
| Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries. | |
| #### Improvements and Future Work | |
| Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results. | |
| Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process. | |
| Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis. | |
| Contributing | |
| Feel free to fork this repository and submit pull requests. Contributions are welcome! | |
| #### License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| #### Contact | |
| For any questions or issues, please contact the project maintainer: | |
| Name: Talex Maxim | |
| Email: taimax13@gmail.com | |
| GitHub: taimax13 |