Spaces:

Harshdhsvguyt
/

policy_rag_assistant

Sleeping

App Files Files Community

Harshdhsvguyt commited on Feb 6

Commit

754d8d3

verified ·

1 Parent(s): 5516a25

Upload 19 files

Browse files

Files changed (19) hide show

LICENSE +201 -0
README.md +40 -14
app.py +198 -0
main.py +106 -0
requirements.txt +6 -0
src/__pycache__/chunking.cpython-313.pyc +0 -0
src/__pycache__/evaluation.cpython-313.pyc +0 -0
src/__pycache__/loader.cpython-313.pyc +0 -0
src/__pycache__/prompts.cpython-313.pyc +0 -0
src/__pycache__/rag_pipeline.cpython-313.pyc +0 -0
src/__pycache__/utils.cpython-313.pyc +0 -0
src/__pycache__/vectorstore.cpython-313.pyc +0 -0
src/chunking.py +61 -0
src/evaluation.py +44 -0
src/loader.py +60 -0
src/prompts.py +69 -0
src/rag_pipeline.py +171 -0
src/utils.py +106 -0
src/vectorstore.py +93 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,14 +1,40 @@
----
-title: Policy Rag Assistant
-emoji: 📊
-colorFrom: indigo
-colorTo: purple
-sdk: gradio
-sdk_version: 6.5.1
-app_file: app.py
-pinned: false
-license: mit
-short_description: Policy RAG assistant with prompt comparison, grounded QA
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Policy_RAG_Assistant
+A minimal Retrieval-Augmented Generation (RAG) system that answers questions about company policy documents using grounded retrieval and structured prompting.
+This project focuses on **prompt engineering, hallucination reduction, and evaluation**, rather than complex UI or heavy frameworks.
+---
+## Overview
+The Policy RAG Assistant allows users to upload policy documents (PDF, TXT, MD) and ask questions about them.
+The system:
+- Retrieves relevant document chunks using semantic search
+- Generates grounded answers using an LLM
+- Avoids hallucinations using strict prompt design
+- Provides structured evaluation metrics for responses
+---
+## Architecture Overview
+User Question
+│
+▼
+Semantic Retrieval (ChromaDB)
+│
+▼
+Top-K Relevant Chunks
+│
+▼
+Prompt Builder (Initial / Improved)
+│
+▼
+Groq LLM (Llama 3.1)
+│
+▼
+Structured JSON Response + Evaluation

app.py ADDED Viewed

	@@ -0,0 +1,198 @@

+import streamlit as st
+from src.loader import load_documents
+from src.chunking import chunk_documents
+from src.vectorstore import VectorStore
+from src.rag_pipeline import RAGPipeline
+from src.utils import ensure_directories
+from src.evaluation import analyze_confidence_distribution
+import os
+import tempfile
+from pathlib import Path
+# Page config
+st.set_page_config(page_title="Policy RAG Assistant", layout="wide")
+# Initialize
+ensure_directories()
+# Check API key
+if not os.getenv("GROQ_API_KEY"):
+    st.error("GROQ_API_KEY not set. Please set it as an environment variable.")
+    st.stop()
+# Initialize session state
+if "vector_store" not in st.session_state:
+    st.session_state.vector_store = None
+if "rag_pipeline" not in st.session_state:
+    st.session_state.rag_pipeline = None
+if "uploaded_files_count" not in st.session_state:
+    st.session_state.uploaded_files_count = 0
+# Title
+st.title("Policy RAG Assistant")
+st.markdown("Ask questions about company policies")
+# Sidebar
+with st.sidebar:
+    st.header("Setup")
+    upload_method = st.radio(
+        "Choose upload method:",
+        ["Upload files here", "Load from data/policies/"],
+        key="upload_method"
+    )
+    if upload_method == "Upload files here":
+        uploaded_files = st.file_uploader(
+            "Upload policy documents",
+            type=["pdf", "txt", "md"],
+            accept_multiple_files=True,
+        )
+        if uploaded_files and st.button("Process Uploaded Files"):
+            with st.spinner("Processing uploaded files..."):
+                from src.loader import load_pdf, load_text
+                docs = []
+                for uploaded_file in uploaded_files:
+                    try:
+                        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(uploaded_file.name).suffix) as tmp_file:
+                            tmp_file.write(uploaded_file.getvalue())
+                            tmp_path = Path(tmp_file.name)
+                        if tmp_path.suffix.lower() == ".pdf":
+                            text = load_pdf(tmp_path)
+                        elif tmp_path.suffix.lower() in [".txt", ".md"]:
+                            text = load_text(tmp_path)
+                        else:
+                            continue
+                        if text.strip():
+                            docs.append({
+                                "text": text,
+                                "metadata": {
+                                    "source": uploaded_file.name,
+                                    "type": tmp_path.suffix[1:]
+                                }
+                            })
+                        tmp_path.unlink()
+                    except Exception as e:
+                        st.error(f"Error processing {uploaded_file.name}: {e}")
+                if docs:
+                    chunked = chunk_documents(docs, chunk_size=500, overlap=100)
+                    vector_store = VectorStore()
+                    vector_store.reset()
+                    vector_store.add_documents(chunked)
+                    st.session_state.vector_store = vector_store
+                    st.session_state.rag_pipeline = RAGPipeline(vector_store)
+                    st.session_state.uploaded_files_count = len(docs)
+                    st.success(f"Processed {len(docs)} documents, {len(chunked)} chunks")
+                else:
+                    st.warning("No valid documents were processed")
+    else:
+        if st.button("Load Documents from Folder"):
+            with st.spinner("Loading documents..."):
+                docs = load_documents()
+                if docs:
+                    chunked = chunk_documents(docs, chunk_size=500, overlap=100)
+                    vector_store = VectorStore()
+                    vector_store.reset()
+                    vector_store.add_documents(chunked)
+                    st.session_state.vector_store = vector_store
+                    st.session_state.rag_pipeline = RAGPipeline(vector_store)
+                    st.session_state.uploaded_files_count = len(docs)
+                    st.success(f"Loaded {len(docs)} documents, {len(chunked)} chunks")
+                else:
+                    st.warning("No documents found in data/policies/")
+    if st.session_state.vector_store:
+        st.divider()
+        col1, col2 = st.columns(2)
+        with col1:
+            st.metric("Documents", st.session_state.uploaded_files_count)
+        with col2:
+            st.metric("Total Chunks", st.session_state.vector_store.count())
+    st.divider()
+    st.header("Analytics")
+    if st.button("View Stats"):
+        stats = analyze_confidence_distribution()
+        st.json(stats)
+# Main area
+if st.session_state.rag_pipeline is None:
+    st.info("Upload documents or load from folder in the sidebar to get started")
+else:
+    col1, col2 = st.columns([3, 1])
+    with col1:
+        question = st.text_input("Ask a question:", placeholder="e.g., What is the vacation policy?")
+    with col2:
+        prompt_type = st.selectbox("Prompt:", ["improved", "initial", "compare"])
+    if question:
+        if prompt_type == "compare":
+            colA, colB = st.columns(2)
+            with colA:
+                st.subheader("Initial Prompt Result")
+                result_initial = st.session_state.rag_pipeline.query(question, prompt_type="initial")
+                st.write(result_initial["answer"])
+                st.metric("Confidence", result_initial.get("confidence", "N/A"))
+                if result_initial.get("evaluation"):
+                    st.json(result_initial["evaluation"])
+            with colB:
+                st.subheader("Improved Prompt Result")
+                result_improved = st.session_state.rag_pipeline.query(question, prompt_type="improved")
+                st.write(result_improved["answer"])
+                st.metric("Confidence", result_improved.get("confidence", "N/A"))
+                if result_improved.get("evaluation"):
+                    st.json(result_improved["evaluation"])
+            display_chunks = result_improved["retrieved_chunks"]
+        else:
+            with st.spinner("Searching..."):
+                response = st.session_state.rag_pipeline.query(question, prompt_type=prompt_type)
+            st.markdown("### Answer")
+            st.write(response["answer"])
+            col1, col2 = st.columns(2)
+            with col1:
+                st.metric("Confidence", response.get("confidence", "N/A"))
+            with col2:
+                st.metric("Sources Used", len(response["retrieved_chunks"]))
+            if response.get("evaluation"):
+                st.subheader("Evaluation")
+                st.json(response["evaluation"])
+            if response.get("evidence"):
+                with st.expander("Evidence"):
+                    for i, ev in enumerate(response["evidence"], 1):
+                        st.markdown(f"{i}. {ev}")
+            display_chunks = response["retrieved_chunks"]
+        with st.expander("Retrieved Chunks"):
+            for i, chunk in enumerate(display_chunks, 1):
+                st.markdown(f"Chunk {i} (score: {chunk.get('score', 0):.4f})")
+                st.markdown(f"Source: {chunk.get('metadata', {}).get('source', 'Unknown')}")
+                st.text(chunk["text"][:300] + "..." if len(chunk["text"]) > 300 else chunk["text"])
+                st.divider()

main.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import sys
+import os
+from dotenv import load_dotenv
+from src.loader import load_documents
+from src.chunking import chunk_documents
+from src.vectorstore import VectorStore
+from src.rag_pipeline import RAGPipeline
+from src.utils import ensure_directories
+# Load environment variables
+load_dotenv()
+def setup_vector_store():
+    """Initialize and populate vector store."""
+    print("Loading documents...")
+    docs = load_documents()
+    if not docs:
+        print("No documents found in data/policies/")
+        sys.exit(1)
+    print(f"Loaded {len(docs)} documents")
+    print("Chunking documents...")
+    chunked = chunk_documents(docs, chunk_size=500, overlap=100)
+    print(f"Created {len(chunked)} chunks")
+    print("Initializing vector store...")
+    vector_store = VectorStore()
+    vector_store.reset()
+    vector_store.add_documents(chunked)
+    print("Setup complete!")
+    return vector_store
+def main():
+    """CLI interface for RAG pipeline."""
+    ensure_directories()
+    # ------------------------------------------------
+    # Check API key
+    # ------------------------------------------------
+    if not os.getenv("GROQ_API_KEY"):
+        print("Error: GROQ_API_KEY environment variable not set")
+        sys.exit(1)
+    # ------------------------------------------------
+    # Get question from command line
+    # ------------------------------------------------
+    if len(sys.argv) < 2:
+        print("Usage: python main.py 'Your question here'")
+        sys.exit(1)
+    question = " ".join(sys.argv[1:])
+    # ------------------------------------------------
+    # Setup RAG pipeline
+    # ------------------------------------------------
+    vector_store = setup_vector_store()
+    rag_pipeline = RAGPipeline(vector_store)
+    # ------------------------------------------------
+    # Query
+    # ------------------------------------------------
+    print(f"\nQuestion: {question}\n")
+    response = rag_pipeline.query(question, prompt_type="improved")
+    # ------------------------------------------------
+    # Display Results
+    # ------------------------------------------------
+    print("=" * 80)
+    print("ANSWER:")
+    print(response["answer"])
+    print("\n" + "=" * 80)
+    print(f"Confidence: {response.get('confidence', 'N/A')}")
+    print(f"Sources Retrieved: {len(response['retrieved_chunks'])}")
+    # Show retrieved chunk preview ( looks professional)
+    if response.get("retrieved_chunks"):
+        print("\nRETRIEVED CONTEXT PREVIEW:")
+        for i, chunk in enumerate(response["retrieved_chunks"], 1):
+            preview = chunk["text"][:120].replace("\n", " ")
+            print(f"{i}. {preview}...")
+    if response.get("evidence"):
+        print("\nEVIDENCE:")
+        for i, ev in enumerate(response["evidence"], 1):
+            print(f"{i}. {ev}")
+    #  NEW: Evaluation Metrics
+    if response.get("evaluation"):
+        print("\n" + "=" * 80)
+        print("EVALUATION:")
+        for k, v in response["evaluation"].items():
+            print(f"{k}: {v}")
+    print("\n" + "=" * 80)
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+streamlit
+chromadb
+sentence-transformers
+groq
+python-dotenv
+PyPDF2

src/__pycache__/chunking.cpython-313.pyc ADDED Viewed

Binary file (1.97 kB). View file

src/__pycache__/evaluation.cpython-313.pyc ADDED Viewed

Binary file (2.03 kB). View file

src/__pycache__/loader.cpython-313.pyc ADDED Viewed

Binary file (2.84 kB). View file

src/__pycache__/prompts.cpython-313.pyc ADDED Viewed

Binary file (2.05 kB). View file

src/__pycache__/rag_pipeline.cpython-313.pyc ADDED Viewed

Binary file (5.08 kB). View file

src/__pycache__/utils.cpython-313.pyc ADDED Viewed

Binary file (3.76 kB). View file

src/__pycache__/vectorstore.cpython-313.pyc ADDED Viewed

Binary file (4.37 kB). View file

src/chunking.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from typing import List
+def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
+    """
+    Split text into overlapping chunks based on word count.
+    Args:
+        text: Input text to chunk
+        chunk_size: Number of words per chunk
+        overlap: Number of overlapping words between chunks
+    Returns:
+        List of text chunks
+    """
+    words = text.split()
+    chunks = []
+    if len(words) <= chunk_size:
+        return [text]
+    start = 0
+    while start < len(words):
+        end = start + chunk_size
+        chunk_words = words[start:end]
+        chunks.append(" ".join(chunk_words))
+        if end >= len(words):
+            break
+        start = end - overlap
+    return chunks
+def chunk_documents(documents: List[dict], chunk_size: int = 500, overlap: int = 100) -> List[dict]:
+    """
+    Chunk multiple documents while preserving metadata.
+    Returns:
+        List of dicts with 'text' and 'metadata' keys
+    """
+    chunked_docs = []
+    for doc in documents:
+        text = doc["text"]
+        metadata = doc.get("metadata", {})
+        chunks = chunk_text(text, chunk_size, overlap)
+        for i, chunk in enumerate(chunks):
+            chunked_docs.append({
+                "text": chunk,
+                "metadata": {
+                    **metadata,
+                    "chunk_id": i,
+                    "total_chunks": len(chunks)
+                }
+            })
+    return chunked_docs

src/evaluation.py ADDED Viewed

	@@ -0,0 +1,44 @@

+import json
+from pathlib import Path
+from typing import List, Dict
+def load_queries_log(log_file: str = "logs/queries.jsonl") -> List[Dict]:
+    """Load all logged queries."""
+    queries = []
+    if not Path(log_file).exists():
+        return queries
+    with open(log_file, "r") as f:
+        for line in f:
+            queries.append(json.loads(line))
+    return queries
+def analyze_confidence_distribution(log_file: str = "logs/queries.jsonl") -> Dict:
+    """Analyze confidence score distribution from logs."""
+    queries = load_queries_log(log_file)
+    confidence_counts = {"High": 0, "Medium": 0, "Low": 0, "N/A": 0}
+    for query in queries:
+        confidence = query.get("response", {}).get("confidence", "N/A")
+        confidence_counts[confidence] = confidence_counts.get(confidence, 0) + 1
+    return {
+        "total_queries": len(queries),
+        "confidence_distribution": confidence_counts
+    }
+def compare_prompts(question: str, rag_pipeline) -> Dict:
+    """Compare initial vs improved prompt responses."""
+    initial_response = rag_pipeline.query(question, prompt_type="initial")
+    improved_response = rag_pipeline.query(question, prompt_type="improved")
+    return {
+        "question": question,
+        "initial": initial_response,
+        "improved": improved_response
+    }

src/loader.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import os
+from pathlib import Path
+from typing import List
+import PyPDF2
+def load_documents(directory: str = "data/policies") -> List[dict]:
+    """
+    Load all documents from the policies directory.
+    Supports PDF, TXT, and MD files.
+    Returns:
+        List of dicts with 'text' and 'metadata' keys
+    """
+    documents = []
+    policy_dir = Path(directory)
+    if not policy_dir.exists():
+        print(f"Warning: {directory} does not exist")
+        return documents
+    for file_path in policy_dir.iterdir():
+        if file_path.is_file():
+            try:
+                if file_path.suffix.lower() == ".pdf":
+                    text = load_pdf(file_path)
+                elif file_path.suffix.lower() in [".txt", ".md"]:
+                    text = load_text(file_path)
+                else:
+                    continue
+                if text.strip():
+                    documents.append({
+                        "text": text,
+                        "metadata": {
+                            "source": file_path.name,
+                            "type": file_path.suffix[1:]
+                        }
+                    })
+                    print(f"Loaded: {file_path.name}")
+            except Exception as e:
+                print(f"Error loading {file_path.name}: {e}")
+    return documents
+def load_pdf(file_path: Path) -> str:
+    """Extract text from PDF file."""
+    text = []
+    with open(file_path, "rb") as f:
+        reader = PyPDF2.PdfReader(f)
+        for page in reader.pages:
+            text.append(page.extract_text())
+    return "\n".join(text)
+def load_text(file_path: Path) -> str:
+    """Load text from TXT or MD file."""
+    with open(file_path, "r", encoding="utf-8") as f:
+        return f.read()

src/prompts.py ADDED Viewed

	@@ -0,0 +1,69 @@

+INITIAL_PROMPT = """You are a helpful assistant that answers questions about company policies.
+Context:
+{context}
+Question: {question}
+Answer the question based on the context provided above."""
+IMPROVED_PROMPT = """You are a RETRIEVAL-GROUNDED Policy Question Answering Assistant.
+Your job is to answer strictly using the provided CONTEXT.
+You are NOT allowed to use outside knowledge.
+Follow these steps internally:
+1. Read the context carefully.
+2. Identify exact sentences that answer the question.
+3. If no supporting sentences exist, reply:
+   "I don't know based on the provided documents."
+STRICT RULES:
+- Do NOT guess.
+- Do NOT add new information.
+- Every claim MUST be supported by a quote from CONTEXT.
+- Evidence MUST be SHORT DIRECT QUOTES copied exactly from the context.
+- If evidence is missing → answer must be "I don't know based on the provided documents."
+CONTEXT:
+{context}
+QUESTION:
+{question}
+Return ONLY valid JSON:
+{{
+  "answer": "Grounded answer or 'I don't know based on the provided documents.'",
+  "evidence": ["exact short quote 1", "exact short quote 2"],
+  "confidence": "High|Medium|Low"
+}}
+Confidence Guidelines:
+- High → Answer explicitly stated in one place
+- Medium → Requires combining multiple context sections
+- Low → Weak or partial support
+JSON Response:"""
+def get_prompt(prompt_type: str, context: str, question: str) -> str:
+    """
+    Get formatted prompt.
+    Args:
+        prompt_type: "initial" or "improved"
+        context: Retrieved document context
+        question: User question
+    Returns:
+        Formatted prompt string
+    """
+    if prompt_type == "initial":
+        template = INITIAL_PROMPT
+    else:
+        template = IMPROVED_PROMPT
+    return template.format(context=context, question=question)

src/rag_pipeline.py ADDED Viewed

	@@ -0,0 +1,171 @@

+from groq import Groq
+from typing import List, Dict
+from src.vectorstore import VectorStore
+from src.prompts import get_prompt
+from src.utils import safe_json_parse, log_query, get_groq_api_key, evaluate_response
+import os
+from dotenv import load_dotenv
+load_dotenv()
+class RAGPipeline:
+    """Main RAG pipeline for question answering."""
+    def __init__(self, vector_store: VectorStore, model: str = "llama-3.1-8b-instant"):
+        """Initialize RAG pipeline."""
+        self.vector_store = vector_store
+        self.model = model
+        self.client = Groq(api_key=get_groq_api_key())
+    def query(self, question: str, prompt_type: str = "improved", top_k: int = 5) -> Dict:
+        """
+        Answer a question using RAG.
+        """
+        # ------------------------------------------------
+        # 1️⃣ Retrieve relevant documents
+        # ------------------------------------------------
+        retrieved_chunks = self.vector_store.search(question, top_k=top_k)
+        # Apply simple reranking (BONUS FEATURE)
+        if retrieved_chunks:
+            retrieved_chunks = self.rerank_simple(retrieved_chunks, question)
+        # ------------------------------------------------
+        # 2️⃣ Handle case where nothing retrieved
+        # ------------------------------------------------
+        if not retrieved_chunks:
+            response = {
+                "answer": "I don't know based on the provided documents.",
+                "evidence": [],
+                "confidence": "Low",
+                "retrieved_chunks": []
+            }
+            #  Add evaluation metrics
+            evaluation = evaluate_response(question, response, prompt_type)
+            response["evaluation"] = evaluation
+            log_query(question, [], response, prompt_type)
+            return response
+        # ------------------------------------------------
+        # 3️⃣ Build context
+        # ------------------------------------------------
+        context = self._build_context(retrieved_chunks)
+        # (Optional safety) Prevent overly long context
+        context = context[:4000]
+        # ------------------------------------------------
+        # 4️⃣ Create prompt
+        # ------------------------------------------------
+        prompt = get_prompt(prompt_type, context, question)
+        # ------------------------------------------------
+        # 5️⃣ Call Groq API
+        # ------------------------------------------------
+        try:
+            completion = self.client.chat.completions.create(
+                model=self.model,
+                messages=[{"role": "user", "content": prompt}],
+                temperature=0.0,  #  more deterministic for RAG
+                max_tokens=1024
+            )
+            response_text = completion.choices[0].message.content
+            # ------------------------------------------------
+            # 6️⃣ Parse response
+            # ------------------------------------------------
+            if prompt_type == "improved":
+                parsed = safe_json_parse(response_text)
+                if parsed:
+                    response = {
+                        "answer": parsed.get("answer", response_text),
+                        "evidence": parsed.get("evidence", []),
+                        "confidence": parsed.get("confidence", "Medium"),
+                        "retrieved_chunks": retrieved_chunks
+                    }
+                else:
+                    # Fallback if JSON parsing fails
+                    response = {
+                        "answer": response_text,
+                        "evidence": [],
+                        "confidence": "Medium",
+                        "retrieved_chunks": retrieved_chunks
+                    }
+            else:
+                response = {
+                    "answer": response_text,
+                    "evidence": [],
+                    "confidence": "N/A",
+                    "retrieved_chunks": retrieved_chunks
+                }
+            # ------------------------------------------------
+            #  7️⃣ Add Evaluation Metrics (NEW)
+            # ------------------------------------------------
+            evaluation = evaluate_response(question, response, prompt_type)
+            response["evaluation"] = evaluation
+            # ------------------------------------------------
+            # 8️⃣ Log Query
+            # ------------------------------------------------
+            log_query(question, retrieved_chunks, response, prompt_type)
+            return response
+        except Exception as e:
+            print(f"Error calling LLM: {e}")
+            response = {
+                "answer": "The system encountered an error while generating a response.",
+                "evidence": [],
+                "confidence": "Low",
+                "retrieved_chunks": retrieved_chunks
+            }
+            evaluation = evaluate_response(question, response, prompt_type)
+            response["evaluation"] = evaluation
+            return response
+    # ------------------------------------------------
+    # Helper: Build Context
+    # ------------------------------------------------
+    def _build_context(self, chunks: List[Dict]) -> str:
+        """Build context string from retrieved chunks."""
+        context_parts = []
+        for i, chunk in enumerate(chunks, 1):
+            source = chunk.get("metadata", {}).get("source", "Unknown")
+            text = chunk["text"]
+            context_parts.append(f"[Document {i} - {source}]\n{text}\n")
+        return "\n".join(context_parts)
+    # ------------------------------------------------
+    # BONUS: Simple Reranker
+    # ------------------------------------------------
+    def rerank_simple(self, chunks: List[Dict], question: str) -> List[Dict]:
+        """
+        Simple reranking based on keyword overlap.
+        """
+        question_words = set(question.lower().split())
+        for chunk in chunks:
+            text_words = set(chunk["text"].lower().split())
+            overlap = len(question_words & text_words)
+            chunk["keyword_score"] = overlap
+        reranked = sorted(
+            chunks,
+            key=lambda x: (x.get("keyword_score", 0), -x.get("score", 0)),
+            reverse=True
+        )
+        return reranked

src/utils.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import os
+import json
+from datetime import datetime
+from pathlib import Path
+def ensure_directories():
+    """Create necessary directories if they don't exist."""
+    Path("data/policies").mkdir(parents=True, exist_ok=True)
+    Path("logs").mkdir(parents=True, exist_ok=True)
+    Path("chroma_db").mkdir(parents=True, exist_ok=True)
+def log_query(question, retrieved_chunks, response, prompt_type="improved"):
+    """Log query details to JSONL file."""
+    log_entry = {
+        "timestamp": datetime.now().isoformat(),
+        "question": question,
+        "prompt_type": prompt_type,
+        "num_chunks_retrieved": len(retrieved_chunks),
+        "chunks": [
+            {
+                "text": chunk["text"][:200] + "..." if len(chunk["text"]) > 200 else chunk["text"],
+                "metadata": chunk.get("metadata", {})
+            }
+            for chunk in retrieved_chunks
+        ],
+        "response": response
+    }
+    log_file = "logs/queries.jsonl"
+    with open(log_file, "a", encoding="utf-8") as f:
+        f.write(json.dumps(log_entry, ensure_ascii=False) + "\n")
+def get_groq_api_key():
+    """Get Groq API key from environment."""
+    api_key = os.getenv("GROQ_API_KEY")
+    if not api_key:
+        raise ValueError("GROQ_API_KEY environment variable not set")
+    return api_key
+def safe_json_parse(text):
+    """Safely parse JSON from LLM response."""
+    try:
+        # Try to find JSON in the response
+        start = text.find("{")
+        end = text.rfind("}") + 1
+        if start != -1 and end > start:
+            json_str = text[start:end]
+            return json.loads(json_str)
+        return None
+    except Exception:
+        return None
+# ============================================================
+# ⭐ NEW: Simple RAG Evaluation Metrics
+# ============================================================
+def evaluate_response(question: str, response: dict, prompt_type: str) -> dict:
+    """
+    Generate simple evaluation metrics for RAG output.
+    Metrics:
+    - Accuracy (basic heuristic)
+    - Groundedness (based on evidence presence)
+    - Hallucination Risk
+    - Prompt Version
+    """
+    answer = response.get("answer", "")
+    evidence = response.get("evidence", [])
+    # ---------------------------
+    # Accuracy (simple heuristic)
+    # ---------------------------
+    if isinstance(answer, str) and answer.startswith("I don't know"):
+        accuracy = "⚠️"
+    else:
+        accuracy = "✅"
+    # ---------------------------
+    # Groundedness
+    # ---------------------------
+    groundedness = "✅" if evidence else "⚠️"
+    # ---------------------------
+    # Hallucination Risk
+    # ---------------------------
+    if isinstance(answer, str) and answer.startswith("I don't know"):
+        hallucination = "LOW"
+    elif evidence:
+        hallucination = "LOW"
+    else:
+        hallucination = "MEDIUM"
+    evaluation = {
+        "Accuracy": accuracy,
+        "Groundedness": groundedness,
+        "Hallucination Risk": hallucination,
+        "Prompt Version": prompt_type
+    }
+    return evaluation

src/vectorstore.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import chromadb
+from chromadb.config import Settings
+from sentence_transformers import SentenceTransformer
+from typing import List
+class VectorStore:
+    """Simple ChromaDB wrapper for document storage and retrieval."""
+    def __init__(self, collection_name: str = "policy_docs", persist_directory: str = "./chroma_db"):
+        """Initialize ChromaDB and embedding model."""
+        self.client = chromadb.PersistentClient(
+            path=persist_directory,
+            settings=Settings(anonymized_telemetry=False)
+        )
+        self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
+        self.collection_name = collection_name
+        # Get or create collection
+        self.collection = self.client.get_or_create_collection(
+            name=collection_name,
+            metadata={"hnsw:space": "cosine"}
+        )
+    def add_documents(self, documents: List[dict]):
+        """
+        Add documents to the vector store.
+        Args:
+            documents: List of dicts with 'text' and 'metadata' keys
+        """
+        if not documents:
+            print("No documents to add")
+            return
+        texts = [doc["text"] for doc in documents]
+        metadatas = [doc.get("metadata", {}) for doc in documents]
+        ids = [f"doc_{i}" for i in range(len(documents))]
+        # Generate embeddings
+        embeddings = self.embedding_model.encode(texts).tolist()
+        # Add to ChromaDB
+        self.collection.add(
+            embeddings=embeddings,
+            documents=texts,
+            metadatas=metadatas,
+            ids=ids
+        )
+        print(f"Added {len(documents)} chunks to vector store")
+    def search(self, query: str, top_k: int = 5) -> List[dict]:
+        """
+        Search for relevant documents.
+        Returns:
+            List of dicts with 'text', 'metadata', and 'score' keys
+        """
+        # Generate query embedding
+        query_embedding = self.embedding_model.encode([query]).tolist()
+        # Search
+        results = self.collection.query(
+            query_embeddings=query_embedding,
+            n_results=top_k
+        )
+        # Format results
+        documents = []
+        if results["documents"] and results["documents"][0]:
+            for i, doc in enumerate(results["documents"][0]):
+                documents.append({
+                    "text": doc,
+                    "metadata": results["metadatas"][0][i] if results["metadatas"] else {},
+                    "score": results["distances"][0][i] if results["distances"] else 0
+                })
+        return documents
+    def reset(self):
+        """Delete and recreate the collection."""
+        self.client.delete_collection(self.collection_name)
+        self.collection = self.client.create_collection(
+            name=self.collection_name,
+            metadata={"hnsw:space": "cosine"}
+        )
+        print("Vector store reset")
+    def count(self) -> int:
+        """Get count of documents in collection."""
+        return self.collection.count()