Spaces:

vikee
/

chagu-dev

Build error

App Files Files Community

talexm commited on Nov 16, 2024

Commit

fdc732d

1 Parent(s): aeb8626

adding LLM for RAg

Browse files

Files changed (4) hide show

falocon_api/README.md +146 -0
falocon_api/__init__.py +0 -0
falocon_api/embeddingGenerator.py +101 -0
falocon_api/embededGeneratorRAG.py +116 -0

falocon_api/README.md ADDED Viewed

	@@ -0,0 +1,146 @@

+### RAG Demo: AI-Powered Document Search with Generative Response
+This project showcases a Retrieval-Augmented Generation (RAG) implementation using
+SentenceTransformer for semantic search and GPT-2 (or a similar generative model)
+for response generation. The system combines the power of semantic search with AI-driven text generation,
+providing relevant answers based on a collection of text documents.
+## Project Overview
+The Chagu RAG Demo aims to solve the problem of efficient document retrieval and provide contextual
+responses using Generative AI. It supports secure document search and offers additional protection
+against malicious queries using semantic analysis. The project is built with the following goals:
+# Semantic Search: Retrieve the most relevant documents based on user queries using embeddings.
+# Generative AI Response: Generate a coherent and context-aware answer using a pre-trained text generation model.
+# Anomaly Detection: Detect potentially harmful queries (e.g., SQL injections) and block them.
+### Features
+# Embedding-based Document Ingestion: Efficiently process and store text document embeddings in a local SQLite database.
+# Semantic Search: Uses cosine similarity with SentenceTransformer embeddings for accurate information retrieval.
+# Text Generation: Leverages GPT-2 or distilgpt2 for generating responses based on the retrieved context.
+# Security: Includes basic query validation to prevent malicious input (e.g., SQL injection detection).
+Technologies Used
+SentenceTransformer: For generating semantic embeddings of text documents.
+Transformers: Provides the generative model (e.g., we have a wide range of models here: https://huggingface.co/models?sort=trending&search=distilgpt2).
+SQLite: A lightweight database for storing embeddings and document content.
+Scikit-learn: Used for calculating cosine similarity.
+NumPy: Efficient numerical operations.
+Installation
+Clone the Repository:
+bash
+```
+git clone https://github.com/yourusername/chagu-rag-demo.git
+cd chagu-rag-demo
+```
+Create a Virtual Environment:
+bash
+```
+python3 -m venv .venv
+source .venv/bin/activate
+```
+Install Dependencies:
+bash
+```
+pip install -r requirements.txt
+```
+Authenticate with Hugging Face (if needed):
+bash
+```
+huggingface-cli login
+```
+Setup and Dataset
+Download and Prepare the Dataset:
+You can use the IMDB Movie Reviews dataset or any other text files.
+Place your .txt files in the documents/ directory or specify a custom path.
+Ingest Files:
+The script will process all .txt files in the specified directory and store embeddings in a local SQLite database.
+bash
+```
+python embededGeneratorRAG.py
+```
+Usage
+Ingest Documents
+Ingest .txt files from the documents/ directory:
+python
+```
+embedding_generator = EmbeddingGenerator()
+embedding_generator.ingest_files("documents")
+```
+Perform a Search Query
+Run a semantic search query and generate a response:
+python
+```
+query = "How can I secure my database against SQL injection?"
+response = embedding_generator.find_most_similar_and_generate(query)
+print("Generated Response:")
+print(response)
+```
+Example Output
+sql
+```
+Generated Response:
+To prevent SQL injection, you should use prepared statements and parameterized queries.
+ Avoid constructing SQL queries directly using user input.
+```
+File Structure
+bash
+```
+chagu-rag-demo/
+├── embeddings.db             # SQLite database for storing embeddings
+├── documents/                # Directory containing .txt files for ingestion
+├── rag_chagu_demo.py         # Main script with RAG implementation
+├── embededGeneratorRAG.py    # Core Embedding Generator class
+├── requirements.txt          # Python dependencies
+├── README.md                 # Project documentation
+Configuration
+```
+You can update the following configurations in the EmbeddingGenerator class:
+Model Names: Change model_name or gen_model to use different embedding or generative models.
+Database Path: Specify a custom path for the SQLite database.
+python
+```
+embedding_generator = EmbeddingGenerator(model_name="all-MiniLM-L6-v2", gen_model="distilgpt2", db_path="custom_embeddings.db")
+```
+### Potential Improvements
+FAISS Integration for Scalability:
+Replace the current SQLite-based retrieval with FAISS for efficient and scalable vector search.
+Enhanced Security:
+Implement more robust query validation using a fine-tuned BERT model to detect harmful or suspicious inputs.
+Deployment on Hugging Face Spaces:
+Create an interactive demo using Streamlit or Gradio for showcasing the project on Hugging Face Spaces.
+Known Issues
+Input Truncation Warning: If the input text is too long, you may see a warning about truncation. This is handled using truncation=True, but it may affect very long queries.
+Model Availability: Ensure you are using a publicly available model from Hugging Face. If you encounter a 404 Not Found error, check the model identifier.
+## Contributing
+Contributions are welcome! Please open an issue or submit a pull request if you would like to improve the project.
+## Fork the repository.
+Create a new feature branch.
+Submit your changes via a pull request.
+License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Acknowledgments
+Hugging Face for the amazing models and NLP tools.
+Scikit-learn for efficient similarity computation.
+SQLite for providing a lightweight database solution.

falocon_api/__init__.py ADDED Viewed

File without changes

falocon_api/embeddingGenerator.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import os
+import sqlite3
+import numpy as np
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+from typing import List, Dict
+class EmbeddingGenerator:
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2", db_path: str = "embeddings.db"):
+        self.model = SentenceTransformer(model_name)
+        self.db_path = db_path
+        self._initialize_db()
+        print(f"Loaded embedding model: {model_name}")
+    def _initialize_db(self):
+        # Connect to SQLite database and create table
+        self.conn = sqlite3.connect(self.db_path)
+        self.cursor = self.conn.cursor()
+        self.cursor.execute("""
+            CREATE TABLE IF NOT EXISTS embeddings (
+                filename TEXT PRIMARY KEY,
+                content TEXT,
+                embedding BLOB
+            )
+        """)
+        self.conn.commit()
+    def generate_embedding(self, text: str) -> np.ndarray:
+        try:
+            embedding = self.model.encode(text, convert_to_numpy=True)
+            return embedding
+        except Exception as e:
+            print(f"Error generating embedding: {str(e)}")
+            return np.array([])
+    def ingest_files(self, directory: str):
+        for filename in os.listdir(directory):
+            if filename.endswith(".txt"):
+                file_path = os.path.join(directory, filename)
+                with open(file_path, 'r') as f:
+                    content = f.read()
+                    embedding = self.generate_embedding(content)
+                    self._store_embedding(filename, content, embedding)
+    def _store_embedding(self, filename: str, content: str, embedding: np.ndarray):
+        try:
+            self.cursor.execute("INSERT OR REPLACE INTO embeddings (filename, content, embedding) VALUES (?, ?, ?)",
+                                (filename, content, embedding.tobytes()))
+            self.conn.commit()
+        except Exception as e:
+            print(f"Error storing embedding: {str(e)}")
+    def load_embeddings(self) -> List[Dict]:
+        self.cursor.execute("SELECT filename, content, embedding FROM embeddings")
+        rows = self.cursor.fetchall()
+        documents = []
+        for filename, content, embedding_blob in rows:
+            embedding = np.frombuffer(embedding_blob, dtype=np.float32)
+            documents.append({"filename": filename, "content": content, "embedding": embedding})
+        return documents
+    def compute_similarity(self, query_embedding: np.ndarray, document_embeddings: List[np.ndarray]) -> List[float]:
+        try:
+            similarities = cosine_similarity([query_embedding], document_embeddings)[0]
+            return similarities.tolist()
+        except Exception as e:
+            print(f"Error computing similarity: {str(e)}")
+            return []
+    def find_most_similar(self, query: str, top_k: int = 5) -> List[Dict]:
+        query_embedding = self.generate_embedding(query)
+        documents = self.load_embeddings()
+        if query_embedding.size == 0 or len(documents) == 0:
+            print("Error: Invalid embeddings or no documents found.")
+            return []
+        document_embeddings = [doc["embedding"] for doc in documents]
+        similarities = self.compute_similarity(query_embedding, document_embeddings)
+        ranked_results = sorted(
+            [{"filename": doc["filename"], "content": doc["content"][:100], "similarity": sim}
+             for doc, sim in zip(documents, similarities)],
+            key=lambda x: x["similarity"],
+            reverse=True
+        )
+        return ranked_results[:top_k]
+# Example Usage
+if __name__ == "__main__":
+    # Initialize the embedding generator and ingest .txt files from the 'documents' directory
+    embedding_generator = EmbeddingGenerator()
+    embedding_generator.ingest_files(os.path.expanduser("~/data-sets/aclImdb/train/"))
+    # Perform a search query
+    query = "What can be used for document search?"
+    results = embedding_generator.find_most_similar(query, top_k=3)
+    print("Search Results:")
+    for result in results:
+        print(f"Filename: {result['filename']}, Similarity: {result['similarity']:.4f}")
+        print(f"Snippet: {result['content']}\n")

falocon_api/embededGeneratorRAG.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import os
+import sqlite3
+import numpy as np
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+from transformers import pipeline
+from typing import List, Dict
+class EmbeddingGenerator:
+    def __init__(self, model_name: str = "all-MiniLM-L6-v2", gen_model: str = "distilgpt2", db_path: str = "embeddings.db"):
+        self.model = SentenceTransformer(model_name)
+        self.generator = pipeline("text-generation", model=gen_model)
+        self.db_path = db_path
+        self._initialize_db()
+        print(f"Loaded embedding model: {model_name}")
+        print(f"Loaded generative model: {gen_model}")
+    def _initialize_db(self):
+        # Connect to SQLite database and create table
+        self.conn = sqlite3.connect(self.db_path)
+        self.cursor = self.conn.cursor()
+        self.cursor.execute("""
+            CREATE TABLE IF NOT EXISTS embeddings (
+                filename TEXT PRIMARY KEY,
+                content TEXT,
+                embedding BLOB
+            )
+        """)
+        self.conn.commit()
+    def generate_embedding(self, text: str) -> np.ndarray:
+        try:
+            embedding = self.model.encode(text, convert_to_numpy=True)
+            return embedding
+        except Exception as e:
+            print(f"Error generating embedding: {str(e)}")
+            return np.array([])
+    def ingest_files(self, directory: str):
+        for filename in os.listdir(directory):
+            if filename.endswith(".txt"):
+                file_path = os.path.join(directory, filename)
+                with open(file_path, 'r') as f:
+                    content = f.read()
+                    embedding = self.generate_embedding(content)
+                    self._store_embedding(filename, content, embedding)
+    def _store_embedding(self, filename: str, content: str, embedding: np.ndarray):
+        try:
+            self.cursor.execute("INSERT OR REPLACE INTO embeddings (filename, content, embedding) VALUES (?, ?, ?)",
+                                (filename, content, embedding.tobytes()))
+            self.conn.commit()
+        except Exception as e:
+            print(f"Error storing embedding: {str(e)}")
+    def load_embeddings(self) -> List[Dict]:
+        self.cursor.execute("SELECT filename, content, embedding FROM embeddings")
+        rows = self.cursor.fetchall()
+        documents = []
+        for filename, content, embedding_blob in rows:
+            embedding = np.frombuffer(embedding_blob, dtype=np.float32)
+            documents.append({"filename": filename, "content": content, "embedding": embedding})
+        return documents
+    def compute_similarity(self, query_embedding: np.ndarray, document_embeddings: List[np.ndarray]) -> List[float]:
+        try:
+            similarities = cosine_similarity([query_embedding], document_embeddings)[0]
+            return similarities.tolist()
+        except Exception as e:
+            print(f"Error computing similarity: {str(e)}")
+            return []
+    def find_most_similar(self, query: str, top_k: int = 5) -> List[Dict]:
+        query_embedding = self.generate_embedding(query)
+        documents = self.load_embeddings()
+        if query_embedding.size == 0 or len(documents) == 0:
+            print("Error: Invalid embeddings or no documents found.")
+            return []
+        document_embeddings = [doc["embedding"] for doc in documents]
+        similarities = self.compute_similarity(query_embedding, document_embeddings)
+        ranked_results = sorted(
+            [{"filename": doc["filename"], "content": doc["content"][:100], "similarity": sim}
+             for doc, sim in zip(documents, similarities)],
+            key=lambda x: x["similarity"],
+            reverse=True
+        )
+        return ranked_results[:top_k]
+    def generate_response(self, query: str, top_k_docs: List[str]) -> str:
+        # Combine the query with the retrieved documents for context
+        context = " ".join(top_k_docs)
+        input_text = f"Query: {query}\nContext: {context}\nAnswer:"
+        # Generate a response using the generative model
+        response = self.generator(input_text, max_length=1000, num_return_sequences=1)
+        return response[0]["generated_text"]
+    def find_most_similar_and_generate(self, query: str, top_k: int = 5) -> str:
+        top_k_results = self.find_most_similar(query, top_k)
+        top_k_docs = [result["content"] for result in top_k_results]
+        response = self.generate_response(query, top_k_docs)
+        return response
+# Example Usage
+if __name__ == "__main__":
+    # Initialize the embedding generator with RAG capabilities and ingest .txt files from the 'documents' directory
+    embedding_generator = EmbeddingGenerator()
+    embedding_generator.ingest_files(os.path.expanduser("~/data-sets/aclImdb/train/"))
+    # Perform a search query with RAG response generation
+    query = "find user comments tt0118866"
+    response = embedding_generator.find_most_similar_and_generate(query)
+    print("Generated Response:")
+    print(response)