Spaces:

vikee
/

chagu-dev

Build error

App Files Files Community

chagu-dev / falocon_api /README.md

talexm

adding LLM for RAg

fdc732d about 1 year ago

preview code

raw

history blame contribute delete

5.41 kB

	### RAG Demo: AI-Powered Document Search with Generative Response
	This project showcases a Retrieval-Augmented Generation (RAG) implementation using
	SentenceTransformer for semantic search and GPT-2 (or a similar generative model)
	for response generation. The system combines the power of semantic search with AI-driven text generation,
	providing relevant answers based on a collection of text documents.

	## Project Overview
	The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual
	responses using Generative AI. It supports secure document search and offers additional protection
	against malicious queries using semantic analysis. The project is built with the following goals:

	# Semantic Search: Retrieve the most relevant documents based on user queries using embeddings.
	# Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model.
	# Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them.

	### Features
	# Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database.
	# Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval.
	# Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context.
	# Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection).

	Technologies Used
	SentenceTransformer: For generating semantic embeddings of text documents.
	Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2).
	SQLite: A lightweight database for storing embeddings and document content.
	Scikit-learn: Used for calculating cosine similarity.
	NumPy: Efficient numerical operations.

	Installation

	Clone the Repository:

	bash
	```
	git clone https://github.com/yourusername/chagu-rag-demo.git
	cd chagu-rag-demo
	```
	Create a Virtual Environment:

	bash
	```
	python3 -m venv .venv
	source .venv/bin/activate
	```
	Install Dependencies:

	bash
	```
	pip install -r requirements.txt
	```
	Authenticate with Hugging Face (if needed):

	bash
	```
	huggingface-cli login
	```

	Setup and Dataset
	Download and Prepare the Dataset:

	You can use the IMDB Movie Reviews dataset or any other text files.
	Place your .txt files in the documents/ directory or specify a custom path.
	Ingest Files:

	The script will process all .txt files in the specified directory and store embeddings in a local SQLite database.
	bash
	```
	python embededGeneratorRAG.py
	```

	Usage
	Ingest Documents
	Ingest .txt files from the documents/ directory:

	python
	```
	embedding_generator = EmbeddingGenerator()
	embedding_generator.ingest_files("documents")
	```

	Perform a Search Query
	Run a semantic search query and generate a response:

	python
	```
	query = "How can I secure my database against SQL injection?"
	response = embedding_generator.find_most_similar_and_generate(query)
	print("Generated Response:")
	print(response)
	```
	Example Output
	sql
	```
	Generated Response:
	To prevent SQL injection, you should use prepared statements and parameterized queries.
	Avoid constructing SQL queries directly using user input.
	```
	File Structure
	bash
	```
	chagu-rag-demo/
	├── embeddings.db # SQLite database for storing embeddings
	├── documents/ # Directory containing .txt files for ingestion
	├── rag_chagu_demo.py # Main script with RAG implementation
	├── embededGeneratorRAG.py # Core Embedding Generator class
	├── requirements.txt # Python dependencies
	├── README.md # Project documentation
	Configuration
	```
	You can update the following configurations in the EmbeddingGenerator class:

	Model Names: Change model_name or gen_model to use different embedding or generative models.
	Database Path: Specify a custom path for the SQLite database.

	python
	```
	embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db")
	```
	### Potential Improvements
	FAISS Integration for Scalability:

	Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search.
	Enhanced Security:

	Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs.
	Deployment on Hugging Face Spaces:

	Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces.
	Known Issues
	Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries.

	Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier.

	## Contributing
	Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project.

	## Fork the repository.
	Create a new feature branch.
	Submit your changes via a pull request.
	License
	This project is licensed under the MIT License - see the LICENSE file for details.

	## Acknowledgments
	Hugging Face for the amazing models and NLP tools.
	Scikit-learn for efficient similarity computation.
	SQLite for providing a lightweight database solution.