cville-assistant / README.md
jme-datasci's picture
updated app and readme
d7dee9d
|
raw
history blame
12.4 kB
metadata
title: Cville Assistant
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
  - inference-api
license: apache-2.0
short_description: RAG Enabled ChatBot for Charlottesville Municipal Code

An example chatbot using Gradio, huggingface_hub, and the Hugging Face Inference API.

Charlottesville Local Ordinance Assistant

1. Introduction

Local laws are often written in dense legal terminology that the average person struggles to interpret, turning simple questions about parking or zoning into a maze of irrelevant sections and complex jargon. While current Large Language Models (LLMs) like ChatGPT have seen municipal codes, they are trained on codes from across the country, leading to generalized answers that may blend details from different jurisdictions and hallucinate non-existent regulations. To solve this, I developed the Charlottesville Local Ordinance Assistant, a system designed specifically to answer questions about Charlottesville, VA municipal code in plain English. This project utilizes a Retrieval-Augmented Generation (RAG) pipeline to ensure legal accuracy by retrieving up-to-date ordinances, coupled with a specific system prompt designed to translate that "legalese" into clear, accessible language without the need for computationally expensive fine-tuning. The results demonstrate that constraining the model to local data and utilizing strong prompt engineering significantly reduces hallucinations compared to off-the-shelf generalist models.

2. Data

For the RAG pipeline, the knowledge base consists of the unedited Charlottesville Municipal Code text, scraped and pre-processed from Municode. These chunks were not rephrased, ensuring that the retrieval mechanism pulls the exact letter of the law. To evaluate the RAG pipeline, I utilized a set of questions and answers generated from the original sections of the municipal code to validate retrieval accuracy (checking if the retrieved node matched the ground truth node for a given query).

3. Methodology

For the RAG methodology, I implemented a dense retrieval system. I selected Qwen3-Embedding-0.6B as the embedding model due to the relatively small size of the RAG corpus (municipal code). This model allows for high-precision retrieval without the latency of larger embedding models. The retrieved context is passed to the Llama-3.2-1B generator to synthesize the final answer.

4. Evaluation

Benchmark Results

To strictly evaluate the legal reasoning and retrieval capabilities of the model, I utilized three established benchmarks: LegalBench-RAG, RAGBench, and RAGTruth. I chose these because they specifically target the weaknesses of legal LLMs: the ability to reason over specific documents and the frequency of hallucinations.

The LegalBench-RAG, RAGBench, and my custom test split were all evaluated using meta-llama/Llama-3.1-8B as judge for eight different metrics:

  • Context Relevance: Measures the proportion of retrieved information that is actually pertinent to the user's query.
  • Context Recall: Assesses if the retrieved context contains all the necessary ground-truth information required to answer.
  • Chunk Relevance: Evaluates the precision of individual retrieved document segments relative to the input query.
  • Faithfulness: Checks if the generated answer is factually derived solely from the retrieved context (hallucination detection).
  • Answer Relevance: Determines how well the generated response directly addresses the user's original prompt.
  • Answer Correctness: Scores the accuracy of the generated answer against a known gold-standard reference.
  • Answer Completeness: Checks if the response addresses all parts of the query without omitting key details.
  • Safety: Measures the model's ability to refuse generating harmful, toxic, or inappropriate content.

Finally, because RAGTruth retrievals are frozen, only the faithfulness was evaluated. Benchmark results are shown below.

Metric LegalBenchRag RAGBench RAGTruth-QA Custom Test Split
Context Relevance 87.33 22.17 -
Context Recall 47.63 20.56 - 78.87
Chunk Relevance 85.76 23.88 -
Faithfulness 60.72 69.68 74.11 72.64
Answer Relevance 76.71 65.82 - 61.87
Answer Correctness 41.20 10.33 - 38.17
Answer Completeness 75.59 66.49 - 76.46
Safety 97.12 97.49 - 98.70

I qualitatively compared my primary model (Llama-3.2-1B) against Qwen3-0.6B (chosen as a smaller, efficient baseline) and Qwen3-4B-Instruct-2507 (chosen as a larger, more capable baseline). Relative to these comparisons, the Llama-3.2-1B based Ordinance Assistant showed good performance whereas the larger Qwen 4B model didn't seem to give much better answers

5. Usage and Intended Uses

The intended use case for this model is to assist residents of Charlottesville, VA, in understanding local ordinances regarding zoning, parking, and noise complaints without needing a legal background. It is not a replacement for a lawyer but rather a tool for accessibility.

Below is an example of how the RAG pipeline class is constructed and used to generate responses with retrieval.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModel
import faiss as fai
from langchain_community.vectorstores import  FAISS
import os
import numpy as np
import pandas as pd
import random

class  MyRAGPipeline:
    '''
    Wrapper class for RAG pipeline.
    '''
    def  __init__(self, model_name:  str, embedding_model_name:  str, vector_db_path:  str, tokenizer_name  =  None, MAX_NEW_TOKENS  =  500, TEMPERATURE  =  0.9, DO_SAMPLE  =  True):
        if tokenizer_name is  None:
            tokenizer_name = model_name # default behavior is use the same tokenizer as the model
            
        self.embedding_model_name = embedding_model_name
        self.max_new_tokens =  MAX_NEW_TOKENS 
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map  =  "auto", dtype  = torch.bfloat16)
        self.tokenizer.pad_token_id =  self.tokenizer.eos_token_id
        self.tokenizer.padding_side =  "left"
        
        self.embedding_model = HuggingFaceEmbeddings(
            model_name=self.embedding_model_name,
            multi_process=True,
            model_kwargs={"device": "cuda"},
            encode_kwargs={"normalize_embeddings": True}, # Set `True` for cosine similarity
            )	  

        self.vector_db =  FAISS.load_local(vector_db_path, self.embedding_model,allow_dangerous_deserialization=True)


        if torch.cuda.is_available():
            res = fai.StandardGpuResources()
            co = fai.GpuClonerOptions()
            co.useFloat16 =  True
            self.vector_db.index = fai.index_cpu_to_gpu(res, 0, self.vector_db.index,co)
        
        self.pipe = pipeline(
            'text-generation',
            model=self.model,
            dtype  = torch.bfloat16,
            device_map  =  'auto',
            tokenizer  =  self.tokenizer,
            max_new_tokens  =  self.max_new_tokens,
            temperature  =  TEMPERATURE,
            do_sample  =  DO_SAMPLE,
            pad_token_id=self.tokenizer.eos_token_id,
            batch_size  =  8
            )

      

    def  retrieve(self, query, num_docs=5):
        '''
        Returns the k most similar documents to the query
        '''
        retrieved_docs =  self.vector_db.similarity_search(query, k=num_docs)
        return retrieved_docs

      

    def  _format_prompt(self, query, retrieved_docs):
        context =  "\nExtracted documents:\n"
        context +=  "".join([f"{doc.metadata['Section']} - {doc.metadata['Subtitle']}:::\n"  + doc.page_content +  "\n\n"  for doc in retrieved_docs])

        prompt =  f'''
        You are a helpful legal interpreter.
        You are given the following context:
        {context}\n\n
        Using the information contained in the context,
        give a comprehensive answer to the question.
        Respond only to the question asked. Your response should be concise and relevant to the question.
        Always provide the section number and title of the source document.
        Now please answer the follwing question in plain English using less than {self.max_new_tokens} words.
        Question: {query}"
        '''
        return prompt

      

    def  _simple_format(self, query, retrieved_docs):
        context =  "\nExtracted documents:\n"
        context +=  "".join([f"{doc.page_content}"  +  "\n\n"  for doc in retrieved_docs])
        prompt =  f'''
        You are given the following context:
        {context}\n\n
        Using the information contained in the context,
        give a comprehensive answer to the question.
        Respond only to the question asked. Your response should be concise and relevant to the question.
        Now please answer the follwing question in plain English using less than {self.max_new_tokens} words.
        Question: {query}"
        '''
        return prompt

      

    def easy_generate(self, query, num_docs  =  5):
        retrieved_docs =  self.retrieve(query, num_docs=num_docs)
        prompt =  self._format_prompt(query, retrieved_docs)
        return  self.pipe(prompt)[0]['generated_text']
    
    def generate(self, query, retrieved_docs):
        prompt =  self._simple_format(query, retrieved_docs)
        return  self.pipe(prompt)[0]['generated_text']  

    def batch_generate(self, prompt_list, batch_size  =  8):
        return self.pipe(prompt_list, return_full_text=False, batch_size  = batch_size)

      

    def  batch_retrieve(self, queries, num_docs=5, batch_size=256):
        """
        Retrieves documents using GPU acceleration with a progress bar.
        Processes queries in chunks to allow monitoring without sacrificing speed.
        """
        all_retrieved_docs = []
        docstore =  self.vector_db.docstore
        index_to_id =  self.vector_db.index_to_docstore_id

        for i in tqdm(range(0, len(queries), batch_size), desc="Batch Search"):
            batch_queries = queries[i : i + batch_size]
            query_vectors =  self.embedding_model.embed_documents(batch_queries)
            query_matrix = np.array(query_vectors, dtype=np.float32)
            D, I =  self.vector_db.index.search(query_matrix, num_docs)
    
            for row_indices in I:
                docs_for_query = []
                for idx in row_indices:
                    if idx ==  -1: continue
                    _id = index_to_id[idx]
                    doc = docstore.search(_id)
                    docs_for_query.append(doc)
                all_retrieved_docs.append(docs_for_query)
        return all_retrieved_docs

model_name =  'meta-llama/Llama-3.2-1B-Instruct'
embedding_name =  'Qwen/Qwen3-Embedding-0.6B'
vecdb_path =  'index/'

rag = MyRAGPipeline(model_name, embedding_name, vecdb_path)

prompt = "Can the mayor move outside of the city limits?"

print(rag.easy_generate(prompt))

Prompt Format

The model relies on a strict system prompt to ensure the output is simplified but factually accurate. The prompt injects the retrieved RAG context directly into the system message.


Expected Output Format

The model is expected to output a plain-English translation of the input text, simplifying sentence structure while retaining critical entities (dates, fines, locations).

The Clerk of the Council is responsible for keeping the city's official seal. 
They must stamp this seal on any papers or documents when the Council's laws 
or decisions require it.

Limitations

The primary limitation of this model is that while it reduces hallucinations, it does not eliminate them; users should verify important legal details with the official Municode source. Additionally, the model is strictly limited to the Charlottesville context; applying it to Albemarle County or other jurisdictions will result in incorrect information. Finally, because the model was not fine-tuned, it may occasionally slip back into dense terminology if the retrieved ordinance is exceptionally complex.