Spaces:
Build error
Build error
talexm
commited on
Commit
Β·
aeb8626
1
Parent(s):
595bead
descriptive doc
Browse files
README.md
CHANGED
|
@@ -11,5 +11,98 @@ license: mit
|
|
| 11 |
short_description: 'this is demo for chain guard protocol, assistant, RAG '
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
short_description: 'this is demo for chain guard protocol, assistant, RAG '
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# **AI-Powered Document Search with Malicious Query Detection**
|
| 15 |
+
|
| 16 |
+
This project implements a semantic search engine for documents using **AI-based malicious query detection**. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.
|
| 17 |
+
|
| 18 |
+
## **Features**
|
| 19 |
+
- **Semantic Search**: Uses fuzzy matching for normal queries, allowing context-aware searches.
|
| 20 |
+
- **AI-Based Malicious Query Detection**: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
|
| 21 |
+
- **Flexible Document Ingestion**: Supports loading documents from the IMDB dataset and additional `.txt` files.
|
| 22 |
+
- **Efficient Path Handling**: Automatically handles dataset paths using the `HOME` environment variable.
|
| 23 |
+
|
| 24 |
+
## **Technologies Used**
|
| 25 |
+
- **Python 3.8+**
|
| 26 |
+
- **Transformers**: For NLP-based malicious query detection.
|
| 27 |
+
- **Hugging Face Pipeline**: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
|
| 28 |
+
- **Pathlib**: For robust file and path handling.
|
| 29 |
+
|
| 30 |
+
## **Project Structure**
|
| 31 |
+
βββ rag_chagu_demo.py # Main script containing the DocumentSearcher class
|
| 32 |
+
βββ README.md # This file
|
| 33 |
+
βββ data-sets/ - this part shifted to $HOME
|
| 34 |
+
β βββ aclImdb/
|
| 35 |
+
β β βββ train/
|
| 36 |
+
β β β βββ pos/ # Positive movie reviews
|
| 37 |
+
β β β βββ neg/ # Negative movie reviews
|
| 38 |
+
β βββ txt-files/ # Additional .txt files for document search
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## **Installation**
|
| 42 |
+
Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
pip install transformers
|
| 46 |
+
```
|
| 47 |
+
Dataset Setup
|
| 48 |
+
Place the IMDB dataset in the following structure:
|
| 49 |
+
|
| 50 |
+
bash
|
| 51 |
+
Copy code
|
| 52 |
+
$HOME/data-sets/aclImdb/train/pos/
|
| 53 |
+
$HOME/data-sets/aclImdb/train/neg/
|
| 54 |
+
Optionally, place additional .txt files under:
|
| 55 |
+
|
| 56 |
+
bash
|
| 57 |
+
Copy code
|
| 58 |
+
$HOME/data-sets/txt-files/
|
| 59 |
+
Usage
|
| 60 |
+
Run the script with the following command:
|
| 61 |
+
|
| 62 |
+
bash
|
| 63 |
+
```
|
| 64 |
+
python rag_chagu_demo.py
|
| 65 |
+
```
|
| 66 |
+
Example Output
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
|
| 70 |
+
Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
|
| 71 |
+
Loaded 5000 movie reviews from IMDB dataset.
|
| 72 |
+
|
| 73 |
+
Normal Query Results:
|
| 74 |
+
Document: This movie had great acting and a compelling storyline. The characters were well-developed...
|
| 75 |
+
|
| 76 |
+
Malicious Query Detected - Confidence: 0.95
|
| 77 |
+
Malicious Query Results:
|
| 78 |
+
|
| 79 |
+
Document: ANOMALY: Query blocked due to detected malicious intent.
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
## How It Works
|
| 83 |
+
The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
|
| 84 |
+
The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
|
| 85 |
+
If a query is flagged as malicious, it is blocked and an anomaly message is returned.
|
| 86 |
+
For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
|
| 87 |
+
AI Model Used
|
| 88 |
+
The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.
|
| 89 |
+
|
| 90 |
+
## Why Use AI for Malicious Query Detection?
|
| 91 |
+
Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.
|
| 92 |
+
|
| 93 |
+
#### Improvements and Future Work
|
| 94 |
+
Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
|
| 95 |
+
Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
|
| 96 |
+
Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
|
| 97 |
+
Contributing
|
| 98 |
+
Feel free to fork this repository and submit pull requests. Contributions are welcome!
|
| 99 |
+
|
| 100 |
+
#### License
|
| 101 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 102 |
+
|
| 103 |
+
#### Contact
|
| 104 |
+
For any questions or issues, please contact the project maintainer:
|
| 105 |
+
|
| 106 |
+
Name: Talex Maxim
|
| 107 |
+
Email: taimax13@gmail.com
|
| 108 |
+
GitHub: taimax13
|