Spaces:

vikee
/

chagu-dev

Build error

App Files Files Community

chagu-dev / README.md

talexm

descriptive doc

aeb8626 about 1 year ago

preview code

raw

history blame contribute delete

4.59 kB

	---
	title: Chagu Demo
	emoji: 📊
	colorFrom: pink
	colorTo: purple
	sdk: streamlit
	sdk_version: 1.40.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: 'this is demo for chain guard protocol, assistant, RAG '
	---

	# AI-Powered Document Search with Malicious Query Detection

	This project implements a semantic search engine for documents using AI-based malicious query detection. It allows users to search through movie reviews (IMDB dataset) and additional `.txt` files, while also identifying and blocking potential malicious queries using a pre-trained NLP model.

	## Features
	- Semantic Search: Uses fuzzy matching for normal queries, allowing context-aware searches.
	- AI-Based Malicious Query Detection: Utilizes a pre-trained NLP model (`DistilBERT`) to detect queries with malicious intent, blocking potential SQL injection and other harmful queries.
	- Flexible Document Ingestion: Supports loading documents from the IMDB dataset and additional `.txt` files.
	- Efficient Path Handling: Automatically handles dataset paths using the `HOME` environment variable.

	## Technologies Used
	- Python 3.8+
	- Transformers: For NLP-based malicious query detection.
	- Hugging Face Pipeline: Uses the `distilbert-base-uncased-finetuned-sst-2-english` model for sentiment analysis.
	- Pathlib: For robust file and path handling.

	## Project Structure
	├── rag_chagu_demo.py # Main script containing the DocumentSearcher class
	├── README.md # This file
	├── data-sets/ - this part shifted to $HOME
	│ ├── aclImdb/
	│ │ ├── train/
	│ │ │ ├── pos/ # Positive movie reviews
	│ │ │ └── neg/ # Negative movie reviews
	│ └── txt-files/ # Additional .txt files for document search


	## Installation
	Make sure you have Python installed (version 3.8 or higher). Then, install the required dependencies:

	```bash
	pip install transformers
	```
	Dataset Setup
	Place the IMDB dataset in the following structure:

	bash
	Copy code
	$HOME/data-sets/aclImdb/train/pos/
	$HOME/data-sets/aclImdb/train/neg/
	Optionally, place additional .txt files under:

	bash
	Copy code
	$HOME/data-sets/txt-files/
	Usage
	Run the script with the following command:

	bash
	```
	python rag_chagu_demo.py
	```
	Example Output
	```

	Looking for positive reviews in: /home/user/data-sets/aclImdb/train/pos
	Looking for negative reviews in: /home/user/data-sets/aclImdb/train/neg
	Loaded 5000 movie reviews from IMDB dataset.

	Normal Query Results:
	Document: This movie had great acting and a compelling storyline. The characters were well-developed...

	Malicious Query Detected - Confidence: 0.95
	Malicious Query Results:

	Document: ANOMALY: Query blocked due to detected malicious intent.

	```
	## How It Works
	The script initializes the DocumentSearcher class, which loads movie reviews and additional .txt documents.
	The is_query_malicious() method uses a pre-trained NLP model to detect queries with potential malicious intent based on sentiment analysis.
	If a query is flagged as malicious, it is blocked and an anomaly message is returned.
	For normal queries, it performs a fuzzy search through the documents and returns the most relevant matches.
	AI Model Used
	The project uses the DistilBERT model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face for detecting malicious queries based on sentiment analysis.

	## Why Use AI for Malicious Query Detection?
	Traditional pattern matching for detecting malicious queries is limited and can miss more sophisticated or novel attack patterns. By using a pre-trained NLP model, we can leverage the semantic understanding of the text, allowing the system to detect a wider range of harmful queries.

	#### Improvements and Future Work
	Custom Fine-Tuning: The current model uses a pre-trained sentiment analysis model. In future versions, a custom model fine-tuned on a dataset of malicious queries could provide even better results.
	Integration with Vector Search (FAISS): For larger datasets, integrating a vector search engine like FAISS could speed up the document retrieval process.
	Real-Time Query Monitoring: Adding a real-time monitoring system to detect and log malicious queries for further analysis.
	Contributing
	Feel free to fork this repository and submit pull requests. Contributions are welcome!

	#### License
	This project is licensed under the MIT License - see the LICENSE file for details.

	#### Contact
	For any questions or issues, please contact the project maintainer:

	Name: Talex Maxim
	Email: taimax13@gmail.com
	GitHub: taimax13