Spaces:

jme-datasci
/

cville-assistant

Sleeping

App Files Files Community

jme-datasci commited on 14 days ago

Commit

d7dee9d

1 Parent(s): a4fc9ce

updated app and readme

Browse files

Files changed (2) hide show

README.md +224 -0
app.py +131 -55

README.md CHANGED Viewed

@@ -15,3 +15,227 @@ short_description: RAG Enabled ChatBot for Charlottesville Municipal Code
 ---
 An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

 ---
 An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
+# Charlottesville Local Ordinance Assistant
+## 1. Introduction
+Local laws are often written in dense legal terminology that the average person struggles to interpret, turning simple questions about parking or zoning into a maze of irrelevant sections and complex jargon. While current Large Language Models (LLMs) like ChatGPT have seen municipal codes, they are trained on codes from across the country, leading to generalized answers that may blend details from different jurisdictions and hallucinate non-existent regulations. To solve this, I developed the **Charlottesville Local Ordinance Assistant**, a system designed specifically to answer questions about Charlottesville, VA municipal code in plain English. This project utilizes a Retrieval-Augmented Generation (RAG) pipeline to ensure legal accuracy by retrieving up-to-date ordinances, coupled with a specific system prompt designed to translate that "legalese" into clear, accessible language without the need for computationally expensive fine-tuning. The results demonstrate that constraining the model to local data and utilizing strong prompt engineering significantly reduces hallucinations compared to off-the-shelf generalist models.
+## 2. Data
+For the RAG pipeline, the knowledge base consists of the unedited Charlottesville Municipal Code text, scraped and pre-processed from [Municode](https://library.municode.com/va/charlottesville/codes/code_of_ordinances). These chunks were not rephrased, ensuring that the retrieval mechanism pulls the exact letter of the law. To evaluate the RAG pipeline, I utilized a set of questions and answers generated from the original sections of the municipal code to validate retrieval accuracy (checking if the retrieved node matched the ground truth node for a given query).
+## 3. Methodology
+For the RAG methodology, I implemented a dense retrieval system. I selected **Qwen3-Embedding-0.6B** as the embedding model due to the relatively small size of the RAG corpus (municipal code). This model allows for high-precision retrieval without the latency of larger embedding models. The retrieved context is passed to the **Llama-3.2-1B** generator to synthesize the final answer.
+## 4. Evaluation
+### Benchmark Results
+To strictly evaluate the legal reasoning and retrieval capabilities of the model, I utilized three established benchmarks: [LegalBench-RAG](https://github.com/hazyresearch/legalbench), [RAGBench](https://arxiv.org/abs/2306.16092), and [RAGTruth](https://arxiv.org/abs/2401.00396). I chose these because they specifically target the weaknesses of legal LLMs: the ability to reason over specific documents and the frequency of hallucinations.
+The LegalBench-RAG, RAGBench, and my custom test split were all evaluated using **meta-llama/Llama-3.1-8B** as judge for eight different metrics:
+* **Context Relevance**: Measures the proportion of retrieved information that is actually pertinent to the user's query.
+* **Context Recall**: Assesses if the retrieved context contains all the necessary ground-truth information required to answer.
+* **Chunk Relevance**: Evaluates the precision of individual retrieved document segments relative to the input query.
+* **Faithfulness**: Checks if the generated answer is factually derived solely from the retrieved context (hallucination detection).
+* **Answer Relevance**: Determines how well the generated response directly addresses the user's original prompt.
+* **Answer Correctness**: Scores the accuracy of the generated answer against a known gold-standard reference.
+* **Answer Completeness**: Checks if the response addresses all parts of the query without omitting key details.
+* **Safety**: Measures the model's ability to refuse generating harmful, toxic, or inappropriate content.
+Finally, because RAGTruth retrievals are frozen, only the faithfulness was evaluated. Benchmark results are shown below.
+|              Metric | LegalBenchRag | RAGBench | RAGTruth-QA | Custom Test Split |
+|--------------------:|:-------------:|:--------:|:-----------:|:-----------------:|
+|   Context Relevance |     87.33     |   22.17  |      -      |                   |
+|      Context Recall |     47.63     |   20.56  |      -      |       78.87       |
+|     Chunk Relevance |     85.76     |   23.88  |      -      |                   |
+|       Faithfulness |     60.72     |   69.68  |    74.11    |       72.64       |
+|    Answer Relevance |     76.71     |   65.82  |      -      |       61.87       |
+|  Answer Correctness |     41.20     |   10.33  |      -      |       38.17       |
+| Answer Completeness |     75.59     |   66.49  |      -      |       76.46       |
+|              Safety |     97.12     |   97.49  |      -      |       98.70       |
+I qualitatively compared my primary model (Llama-3.2-1B) against **Qwen3-0.6B** (chosen as a smaller, efficient baseline) and **Qwen3-4B-Instruct-2507** (chosen as a larger, more capable baseline). Relative to these comparisons, the Llama-3.2-1B based Ordinance Assistant showed good performance whereas the larger Qwen 4B model didn't seem to give much better answers
+## 5. Usage and Intended Uses
+The intended use case for this model is to assist residents of Charlottesville, VA, in understanding local ordinances regarding zoning, parking, and noise complaints without needing a legal background. It is **not** a replacement for a lawyer but rather a tool for accessibility.
+Below is an example of how the RAG pipeline class is constructed and used to generate responses with retrieval.
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModel
+import faiss as fai
+from langchain_community.vectorstores import  FAISS
+import os
+import numpy as np
+import pandas as pd
+import random
+class  MyRAGPipeline:
+	'''
+	Wrapper class for RAG pipeline.
+	'''
+	def  __init__(self, model_name:  str, embedding_model_name:  str, vector_db_path:  str, tokenizer_name  =  None, MAX_NEW_TOKENS  =  500, TEMPERATURE  =  0.9, DO_SAMPLE  =  True):
+		if tokenizer_name is  None:
+			tokenizer_name = model_name # default behavior is use the same tokenizer as the model
+		self.embedding_model_name = embedding_model_name
+		self.max_new_tokens =  MAX_NEW_TOKENS
+		self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+		self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map  =  "auto", dtype  = torch.bfloat16)
+		self.tokenizer.pad_token_id =  self.tokenizer.eos_token_id
+		self.tokenizer.padding_side =  "left"
+		self.embedding_model = HuggingFaceEmbeddings(
+			model_name=self.embedding_model_name,
+			multi_process=True,
+			model_kwargs={"device": "cuda"},
+			encode_kwargs={"normalize_embeddings": True}, # Set `True` for cosine similarity
+			)
+		self.vector_db =  FAISS.load_local(vector_db_path, self.embedding_model,allow_dangerous_deserialization=True)
+		if torch.cuda.is_available():
+			res = fai.StandardGpuResources()
+			co = fai.GpuClonerOptions()
+			co.useFloat16 =  True
+			self.vector_db.index = fai.index_cpu_to_gpu(res, 0, self.vector_db.index,co)
+		self.pipe = pipeline(
+			'text-generation',
+			model=self.model,
+			dtype  = torch.bfloat16,
+			device_map  =  'auto',
+			tokenizer  =  self.tokenizer,
+			max_new_tokens  =  self.max_new_tokens,
+			temperature  =  TEMPERATURE,
+			do_sample  =  DO_SAMPLE,
+			pad_token_id=self.tokenizer.eos_token_id,
+			batch_size  =  8
+			)
+	def  retrieve(self, query, num_docs=5):
+		'''
+		Returns the k most similar documents to the query
+		'''
+		retrieved_docs =  self.vector_db.similarity_search(query, k=num_docs)
+		return retrieved_docs
+	def  _format_prompt(self, query, retrieved_docs):
+		context =  "\nExtracted documents:\n"
+		context +=  "".join([f"{doc.metadata['Section']} - {doc.metadata['Subtitle']}:::\n"  + doc.page_content +  "\n\n"  for doc in retrieved_docs])
+		prompt =  f'''
+		You are a helpful legal interpreter.
+		You are given the following context:
+		{context}\n\n
+		Using the information contained in the context,
+		give a comprehensive answer to the question.
+		Respond only to the question asked. Your response should be concise and relevant to the question.
+		Always provide the section number and title of the source document.
+		Now please answer the follwing question in plain English using less than {self.max_new_tokens} words.
+		Question: {query}"
+		'''
+		return prompt
+	def  _simple_format(self, query, retrieved_docs):
+		context =  "\nExtracted documents:\n"
+		context +=  "".join([f"{doc.page_content}"  +  "\n\n"  for doc in retrieved_docs])
+		prompt =  f'''
+		You are given the following context:
+		{context}\n\n
+		Using the information contained in the context,
+		give a comprehensive answer to the question.
+		Respond only to the question asked. Your response should be concise and relevant to the question.
+		Now please answer the follwing question in plain English using less than {self.max_new_tokens} words.
+		Question: {query}"
+		'''
+		return prompt
+	def easy_generate(self, query, num_docs  =  5):
+		retrieved_docs =  self.retrieve(query, num_docs=num_docs)
+		prompt =  self._format_prompt(query, retrieved_docs)
+		return  self.pipe(prompt)[0]['generated_text']
+	def generate(self, query, retrieved_docs):
+		prompt =  self._simple_format(query, retrieved_docs)
+		return  self.pipe(prompt)[0]['generated_text']
+	def batch_generate(self, prompt_list, batch_size  =  8):
+		return self.pipe(prompt_list, return_full_text=False, batch_size  = batch_size)
+	def  batch_retrieve(self, queries, num_docs=5, batch_size=256):
+		"""
+		Retrieves documents using GPU acceleration with a progress bar.
+		Processes queries in chunks to allow monitoring without sacrificing speed.
+		"""
+		all_retrieved_docs = []
+		docstore =  self.vector_db.docstore
+		index_to_id =  self.vector_db.index_to_docstore_id
+		for i in tqdm(range(0, len(queries), batch_size), desc="Batch Search"):
+			batch_queries = queries[i : i + batch_size]
+			query_vectors =  self.embedding_model.embed_documents(batch_queries)
+			query_matrix = np.array(query_vectors, dtype=np.float32)
+			D, I =  self.vector_db.index.search(query_matrix, num_docs)
+			for row_indices in I:
+				docs_for_query = []
+				for idx in row_indices:
+					if idx ==  -1: continue
+					_id = index_to_id[idx]
+					doc = docstore.search(_id)
+					docs_for_query.append(doc)
+				all_retrieved_docs.append(docs_for_query)
+		return all_retrieved_docs
+model_name =  'meta-llama/Llama-3.2-1B-Instruct'
+embedding_name =  'Qwen/Qwen3-Embedding-0.6B'
+vecdb_path =  'index/'
+rag = MyRAGPipeline(model_name, embedding_name, vecdb_path)
+prompt = "Can the mayor move outside of the city limits?"
+print(rag.easy_generate(prompt))
+```
+## Prompt Format
+The model relies on a strict system prompt to ensure the output is simplified but factually accurate. The prompt injects the retrieved RAG context directly into the system message.
+```
+```
+## Expected Output Format
+The model is expected to output a plain-English translation of the input text, simplifying sentence structure while retaining critical entities (dates, fines, locations).
+```
+The Clerk of the Council is responsible for keeping the city's official seal.
+They must stamp this seal on any papers or documents when the Council's laws
+or decisions require it.
+```
+## Limitations
+The primary limitation of this model is that while it reduces hallucinations, it does not eliminate them; users should verify important legal details with the official [Municode](https://library.municode.com/va/charlottesville/codes/code_of_ordinances) source. Additionally, the model is strictly limited to the Charlottesville context; applying it to Albemarle County or other jurisdictions will result in incorrect information. Finally, because the model was not fine-tuned, it may occasionally slip back into dense terminology if the retrieved ordinance is exceptionally complex.

app.py CHANGED Viewed

@@ -1,70 +1,146 @@
 import gradio as gr
-from huggingface_hub import InferenceClient
-def respond(
-    message,
-    history: list[dict[str, str]],
-    system_message,
-    max_tokens,
-    temperature,
-    top_p,
-    hf_token: gr.OAuthToken,
-):
-    """
-    For more information on `huggingface_hub` Inference API support, please check the docs: https://huggingface.co/docs/huggingface_hub/v0.22.2/en/guides/inference
-    """
-    client = InferenceClient(token=hf_token.token, model="openai/gpt-oss-20b")
-    messages = [{"role": "system", "content": system_message}]
-    messages.extend(history)
-    messages.append({"role": "user", "content": message})
-    response = ""
-    for message in client.chat_completion(
-        messages,
-        max_tokens=max_tokens,
-        stream=True,
-        temperature=temperature,
-        top_p=top_p,
-    ):
-        choices = message.choices
-        token = ""
-        if len(choices) and choices[0].delta.content:
-            token = choices[0].delta.content
-        response += token
-        yield response
-"""
-For information on how to customize the ChatInterface, peruse the gradio docs: https://www.gradio.app/docs/chatinterface
-"""
-chatbot = gr.ChatInterface(
-    respond,
-    type="messages",
-    additional_inputs=[
-        gr.Textbox(value="You are a friendly Chatbot.", label="System message"),
-        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
-        gr.Slider(minimum=0.1, maximum=4.0, value=0.7, step=0.1, label="Temperature"),
-        gr.Slider(
-            minimum=0.1,
-            maximum=1.0,
-            value=0.95,
-            step=0.05,
-            label="Top-p (nucleus sampling)",
-        ),
-    ],
-)
-with gr.Blocks() as demo:
-    with gr.Sidebar():
-        gr.LoginButton()
-    chatbot.render()
 if __name__ == "__main__":
-    demo.launch()

+import os
+import torch
 import gradio as gr
+import faiss
+import numpy as np
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
+from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_community.vectorstores import FAISS
+# Ensure an HF Token is present for gated models (like Llama 3)
+HF_TOKEN = os.getenv("HF_TOKEN")
+class MyRAGPipeline:
+    '''
+    Wrapper class for RAG pipeline.
+    '''
+    def __init__(self, model_name: str, embedding_model_name: str, vector_db_path: str, tokenizer_name=None, MAX_NEW_TOKENS=500, TEMPERATURE=0.7, DO_SAMPLE=True):
+        if tokenizer_name is None:
+            tokenizer_name = model_name
+        self.embedding_model_name = embedding_model_name
+        self.max_new_tokens = MAX_NEW_TOKENS
+        print(f"Loading Model: {model_name}...")
+        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, token=HF_TOKEN)
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            device_map="auto",
+            dtype=torch.bfloat16,
+            token=HF_TOKEN
+        )
+        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
+        self.tokenizer.padding_side = "left"
+        print("Loading Embeddings...")
+        self.embedding_model = HuggingFaceEmbeddings(
+            model_name=self.embedding_model_name,
+            multi_process=False, # Set to False for stability in Spaces
+            model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
+            encode_kwargs={"normalize_embeddings": True},
+        )
+        print(f"Loading Vector DB from {vector_db_path}...")
+        # Check if index exists to prevent crash
+        if not os.path.exists(vector_db_path):
+             raise FileNotFoundError(f"Could not find vector DB at {vector_db_path}. Please upload your 'index' folder.")
+        self.vector_db = FAISS.load_local(vector_db_path, self.embedding_model, allow_dangerous_deserialization=True)
+        # FAISS GPU optimization (If available)
+        if torch.cuda.is_available():
+            try:
+                res = faiss.StandardGpuResources()
+                co = faiss.GpuClonerOptions()
+                co.useFloat16 = True
+                self.vector_db.index = faiss.index_cpu_to_gpu(res, 0, self.vector_db.index, co)
+            except Exception as e:
+                print(f"Could not load FAISS to GPU, running on CPU: {e}")
+        # Initialize Pipeline
+        self.pipe = pipeline(
+            'text-generation',
+            model=self.model,
+            torch_dtype=torch.bfloat16,
+            device_map='auto',
+            tokenizer=self.tokenizer,
+            max_new_tokens=self.max_new_tokens,
+            temperature=TEMPERATURE,
+            do_sample=DO_SAMPLE,
+            pad_token_id=self.tokenizer.eos_token_id,
+            # return_full_text=False is CRITICAL for chatbots so it doesn't repeat the prompt
+            return_full_text=False
+        )
+    def retrieve(self, query, num_docs=3):
+        '''
+        Returns the k most similar documents to the query
+        '''
+        retrieved_docs = self.vector_db.similarity_search(query, k=num_docs)
+        return retrieved_docs
+    def _format_prompt(self, query, retrieved_docs):
+        context = "\nExtracted documents:\n"
+        # Adjusted extraction slightly to handle missing metadata keys gracefully
+        for doc in retrieved_docs:
+            section = doc.metadata.get('Section', 'N/A')
+            subtitle = doc.metadata.get('Subtitle', 'Context')
+            context += f"{section} - {subtitle}:::\n{doc.page_content}\n\n"
+        prompt = f'''
+        You are a helpful legal interpreter.
+        You are given the following context:
+        {context}\n\n
+        Using the information contained in the context,
+        give a comprehensive answer to the question.
+        Respond only to the question asked. Your response should be concise and relevant to the question.
+        Always provide the section number and title of the source document.
+        Question: {query}"
+        '''
+        return prompt
+    def easy_generate(self, query, num_docs=3):
+        retrieved_docs = self.retrieve(query, num_docs=num_docs)
+        prompt = self._format_prompt(query, retrieved_docs)
+        # Because we used return_full_text=False in the pipeline,
+        # this returns only the answer.
+        result = self.pipe(prompt)[0]['generated_text']
+        return result
+# --- INITIALIZATION ---
+# Using standard paths and models
+MODEL_NAME = 'meta-llama/Llama-3.2-1B-Instruct'
+EMBEDDING_NAME = 'Qwen/Qwen3-Embedding-0.6B'
+VECDB_PATH = 'index/' # Make sure you upload this folder to your Space!
+# Initialize the RAG system globally so it doesn't reload on every message
+try:
+    rag = MyRAGPipeline(MODEL_NAME, EMBEDDING_NAME, VECDB_PATH)
+except Exception as e:
+    rag = None
+    print(f"Error initializing RAG: {e}")
+# --- GRADIO INTERFACE ---
+def chat_function(message, history):
+    if rag is None:
+        return "System Error: The RAG pipeline failed to initialize. Check logs and ensure the 'index/' folder is uploaded."
+    try:
+        response = rag.easy_generate(message)
+        return response
+    except Exception as e:
+        return f"An error occurred: {str(e)}"
+demo = gr.ChatInterface(
+    fn=chat_function,
+    type="messages",
+    title="Legal RAG Assistant",
+    description="Ask a question about the legal documents indexed in the database.",
+    examples=["Can the mayor move outside of the city limits?", "What are the zoning laws?"],
+)
 if __name__ == "__main__":
+    demo.launch()