Spaces:
Build error
Build error
| ### RAG Demo: AI-Powered Document Search with Generative Response | |
| This project showcases a Retrieval-Augmented Generation (RAG) implementation using | |
| SentenceTransformer for semantic search and GPT-2 (or a similar generative model) | |
| for response generation. The system combines the power of semantic search with AI-driven text generation, | |
| providing relevant answers based on a collection of text documents. | |
| ## Project Overview | |
| The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual | |
| responses using Generative AI. It supports secure document search and offers additional protection | |
| against malicious queries using semantic analysis. The project is built with the following goals: | |
| # Semantic Search: Retrieve the most relevant documents based on user queries using embeddings. | |
| # Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model. | |
| # Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them. | |
| ### Features | |
| # Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database. | |
| # Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval. | |
| # Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context. | |
| # Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection). | |
| Technologies Used | |
| SentenceTransformer: For generating semantic embeddings of text documents. | |
| Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2). | |
| SQLite: A lightweight database for storing embeddings and document content. | |
| Scikit-learn: Used for calculating cosine similarity. | |
| NumPy: Efficient numerical operations. | |
| Installation | |
| Clone the Repository: | |
| bash | |
| ``` | |
| git clone https://github.com/yourusername/chagu-rag-demo.git | |
| cd chagu-rag-demo | |
| ``` | |
| Create a Virtual Environment: | |
| bash | |
| ``` | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| ``` | |
| Install Dependencies: | |
| bash | |
| ``` | |
| pip install -r requirements.txt | |
| ``` | |
| Authenticate with Hugging Face (if needed): | |
| bash | |
| ``` | |
| huggingface-cli login | |
| ``` | |
| Setup and Dataset | |
| Download and Prepare the Dataset: | |
| You can use the IMDB Movie Reviews dataset or any other text files. | |
| Place your .txt files in the documents/ directory or specify a custom path. | |
| Ingest Files: | |
| The script will process all .txt files in the specified directory and store embeddings in a local SQLite database. | |
| bash | |
| ``` | |
| python embededGeneratorRAG.py | |
| ``` | |
| Usage | |
| Ingest Documents | |
| Ingest .txt files from the documents/ directory: | |
| python | |
| ``` | |
| embedding_generator = EmbeddingGenerator() | |
| embedding_generator.ingest_files("documents") | |
| ``` | |
| Perform a Search Query | |
| Run a semantic search query and generate a response: | |
| python | |
| ``` | |
| query = "How can I secure my database against SQL injection?" | |
| response = embedding_generator.find_most_similar_and_generate(query) | |
| print("Generated Response:") | |
| print(response) | |
| ``` | |
| Example Output | |
| sql | |
| ``` | |
| Generated Response: | |
| To prevent SQL injection, you should use prepared statements and parameterized queries. | |
| Avoid constructing SQL queries directly using user input. | |
| ``` | |
| File Structure | |
| bash | |
| ``` | |
| chagu-rag-demo/ | |
| βββ embeddings.db # SQLite database for storing embeddings | |
| βββ documents/ # Directory containing .txt files for ingestion | |
| βββ rag_chagu_demo.py # Main script with RAG implementation | |
| βββ embededGeneratorRAG.py # Core Embedding Generator class | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # Project documentation | |
| Configuration | |
| ``` | |
| You can update the following configurations in the EmbeddingGenerator class: | |
| Model Names: Change model_name or gen_model to use different embedding or generative models. | |
| Database Path: Specify a custom path for the SQLite database. | |
| python | |
| ``` | |
| embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db") | |
| ``` | |
| ### Potential Improvements | |
| FAISS Integration for Scalability: | |
| Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search. | |
| Enhanced Security: | |
| Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs. | |
| Deployment on Hugging Face Spaces: | |
| Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces. | |
| Known Issues | |
| Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries. | |
| Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier. | |
| ## Contributing | |
| Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project. | |
| ## Fork the repository. | |
| Create a new feature branch. | |
| Submit your changes via a pull request. | |
| License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## Acknowledgments | |
| Hugging Face for the amazing models and NLP tools. | |
| Scikit-learn for efficient similarity computation. | |
| SQLite for providing a lightweight database solution. |