Spaces:
Sleeping
title: Cville Assistant
emoji: 💬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_scopes:
- inference-api
license: apache-2.0
short_description: RAG Enabled ChatBot for Charlottesville Municipal Code
An example chatbot using Gradio, huggingface_hub, and the Hugging Face Inference API.
Charlottesville Local Ordinance Assistant
1. Introduction
Local laws are often written in dense legal terminology that the average person struggles to interpret, turning simple questions about parking or zoning into a maze of irrelevant sections and complex jargon. While current Large Language Models (LLMs) like ChatGPT have seen municipal codes, they are trained on codes from across the country, leading to generalized answers that may blend details from different jurisdictions and hallucinate non-existent regulations. To solve this, I developed the Charlottesville Local Ordinance Assistant, a system designed specifically to answer questions about Charlottesville, VA municipal code in plain English. This project utilizes a Retrieval-Augmented Generation (RAG) pipeline to ensure legal accuracy by retrieving up-to-date ordinances, coupled with a specific system prompt designed to translate that "legalese" into clear, accessible language without the need for computationally expensive fine-tuning. The results demonstrate that constraining the model to local data and utilizing strong prompt engineering significantly reduces hallucinations compared to off-the-shelf generalist models.
2. Data
For the RAG pipeline, the knowledge base consists of the unedited Charlottesville Municipal Code text, scraped and pre-processed from Municode. These chunks were not rephrased, ensuring that the retrieval mechanism pulls the exact letter of the law. To evaluate the RAG pipeline, I utilized a set of questions and answers generated from the original sections of the municipal code to validate retrieval accuracy (checking if the retrieved node matched the ground truth node for a given query).
3. Methodology
For the RAG methodology, I implemented a dense retrieval system. I selected Qwen3-Embedding-0.6B as the embedding model due to the relatively small size of the RAG corpus (municipal code). This model allows for high-precision retrieval without the latency of larger embedding models. The retrieved context is passed to the Llama-3.2-1B generator to synthesize the final answer.
4. Evaluation
Benchmark Results
To strictly evaluate the legal reasoning and retrieval capabilities of the model, I utilized three established benchmarks: LegalBench-RAG, RAGBench, and RAGTruth. I chose these because they specifically target the weaknesses of legal LLMs: the ability to reason over specific documents and the frequency of hallucinations.
The LegalBench-RAG, RAGBench, and my custom test split were all evaluated using meta-llama/Llama-3.1-8B as judge for eight different metrics:
- Context Relevance: Measures the proportion of retrieved information that is actually pertinent to the user's query.
- Context Recall: Assesses if the retrieved context contains all the necessary ground-truth information required to answer.
- Chunk Relevance: Evaluates the precision of individual retrieved document segments relative to the input query.
- Faithfulness: Checks if the generated answer is factually derived solely from the retrieved context (hallucination detection).
- Answer Relevance: Determines how well the generated response directly addresses the user's original prompt.
- Answer Correctness: Scores the accuracy of the generated answer against a known gold-standard reference.
- Answer Completeness: Checks if the response addresses all parts of the query without omitting key details.
- Safety: Measures the model's ability to refuse generating harmful, toxic, or inappropriate content.
Finally, because RAGTruth retrievals are frozen, only the faithfulness was evaluated. Benchmark results are shown below.
| Metric | LegalBenchRag | RAGBench | RAGTruth-QA | Custom Test Split |
|---|---|---|---|---|
| Context Relevance | 87.33 | 22.17 | - | |
| Context Recall | 47.63 | 20.56 | - | 78.87 |
| Chunk Relevance | 85.76 | 23.88 | - | |
| Faithfulness | 60.72 | 69.68 | 74.11 | 72.64 |
| Answer Relevance | 76.71 | 65.82 | - | 61.87 |
| Answer Correctness | 41.20 | 10.33 | - | 38.17 |
| Answer Completeness | 75.59 | 66.49 | - | 76.46 |
| Safety | 97.12 | 97.49 | - | 98.70 |
I qualitatively compared my primary model (Llama-3.2-1B) against Qwen3-0.6B (chosen as a smaller, efficient baseline) and Qwen3-4B-Instruct-2507 (chosen as a larger, more capable baseline). Relative to these comparisons, the Llama-3.2-1B based Ordinance Assistant showed good performance whereas the larger Qwen 4B model didn't seem to give much better answers
5. Usage and Intended Uses
The intended use case for this model is to assist residents of Charlottesville, VA, in understanding local ordinances regarding zoning, parking, and noise complaints without needing a legal background. It is not a replacement for a lawyer but rather a tool for accessibility.
Below is an example of how the RAG pipeline class is constructed and used to generate responses with retrieval.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModel
import faiss as fai
from langchain_community.vectorstores import FAISS
import os
import numpy as np
import pandas as pd
import random
class MyRAGPipeline:
'''
Wrapper class for RAG pipeline.
'''
def __init__(self, model_name: str, embedding_model_name: str, vector_db_path: str, tokenizer_name = None, MAX_NEW_TOKENS = 500, TEMPERATURE = 0.9, DO_SAMPLE = True):
if tokenizer_name is None:
tokenizer_name = model_name # default behavior is use the same tokenizer as the model
self.embedding_model_name = embedding_model_name
self.max_new_tokens = MAX_NEW_TOKENS
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map = "auto", dtype = torch.bfloat16)
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
self.tokenizer.padding_side = "left"
self.embedding_model = HuggingFaceEmbeddings(
model_name=self.embedding_model_name,
multi_process=True,
model_kwargs={"device": "cuda"},
encode_kwargs={"normalize_embeddings": True}, # Set `True` for cosine similarity
)
self.vector_db = FAISS.load_local(vector_db_path, self.embedding_model,allow_dangerous_deserialization=True)
if torch.cuda.is_available():
res = fai.StandardGpuResources()
co = fai.GpuClonerOptions()
co.useFloat16 = True
self.vector_db.index = fai.index_cpu_to_gpu(res, 0, self.vector_db.index,co)
self.pipe = pipeline(
'text-generation',
model=self.model,
dtype = torch.bfloat16,
device_map = 'auto',
tokenizer = self.tokenizer,
max_new_tokens = self.max_new_tokens,
temperature = TEMPERATURE,
do_sample = DO_SAMPLE,
pad_token_id=self.tokenizer.eos_token_id,
batch_size = 8
)
def retrieve(self, query, num_docs=5):
'''
Returns the k most similar documents to the query
'''
retrieved_docs = self.vector_db.similarity_search(query, k=num_docs)
return retrieved_docs
def _format_prompt(self, query, retrieved_docs):
context = "\nExtracted documents:\n"
context += "".join([f"{doc.metadata['Section']} - {doc.metadata['Subtitle']}:::\n" + doc.page_content + "\n\n" for doc in retrieved_docs])
prompt = f'''
You are a helpful legal interpreter.
You are given the following context:
{context}\n\n
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked. Your response should be concise and relevant to the question.
Always provide the section number and title of the source document.
Now please answer the follwing question in plain English using less than {self.max_new_tokens} words.
Question: {query}"
'''
return prompt
def _simple_format(self, query, retrieved_docs):
context = "\nExtracted documents:\n"
context += "".join([f"{doc.page_content}" + "\n\n" for doc in retrieved_docs])
prompt = f'''
You are given the following context:
{context}\n\n
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked. Your response should be concise and relevant to the question.
Now please answer the follwing question in plain English using less than {self.max_new_tokens} words.
Question: {query}"
'''
return prompt
def easy_generate(self, query, num_docs = 5):
retrieved_docs = self.retrieve(query, num_docs=num_docs)
prompt = self._format_prompt(query, retrieved_docs)
return self.pipe(prompt)[0]['generated_text']
def generate(self, query, retrieved_docs):
prompt = self._simple_format(query, retrieved_docs)
return self.pipe(prompt)[0]['generated_text']
def batch_generate(self, prompt_list, batch_size = 8):
return self.pipe(prompt_list, return_full_text=False, batch_size = batch_size)
def batch_retrieve(self, queries, num_docs=5, batch_size=256):
"""
Retrieves documents using GPU acceleration with a progress bar.
Processes queries in chunks to allow monitoring without sacrificing speed.
"""
all_retrieved_docs = []
docstore = self.vector_db.docstore
index_to_id = self.vector_db.index_to_docstore_id
for i in tqdm(range(0, len(queries), batch_size), desc="Batch Search"):
batch_queries = queries[i : i + batch_size]
query_vectors = self.embedding_model.embed_documents(batch_queries)
query_matrix = np.array(query_vectors, dtype=np.float32)
D, I = self.vector_db.index.search(query_matrix, num_docs)
for row_indices in I:
docs_for_query = []
for idx in row_indices:
if idx == -1: continue
_id = index_to_id[idx]
doc = docstore.search(_id)
docs_for_query.append(doc)
all_retrieved_docs.append(docs_for_query)
return all_retrieved_docs
model_name = 'meta-llama/Llama-3.2-1B-Instruct'
embedding_name = 'Qwen/Qwen3-Embedding-0.6B'
vecdb_path = 'index/'
rag = MyRAGPipeline(model_name, embedding_name, vecdb_path)
prompt = "Can the mayor move outside of the city limits?"
print(rag.easy_generate(prompt))
Prompt Format
The model relies on a strict system prompt to ensure the output is simplified but factually accurate. The prompt injects the retrieved RAG context directly into the system message.
Expected Output Format
The model is expected to output a plain-English translation of the input text, simplifying sentence structure while retaining critical entities (dates, fines, locations).
The Clerk of the Council is responsible for keeping the city's official seal.
They must stamp this seal on any papers or documents when the Council's laws
or decisions require it.
Limitations
The primary limitation of this model is that while it reduces hallucinations, it does not eliminate them; users should verify important legal details with the official Municode source. Additionally, the model is strictly limited to the Charlottesville context; applying it to Albemarle County or other jurisdictions will result in incorrect information. Finally, because the model was not fine-tuned, it may occasionally slip back into dense terminology if the retrieved ordinance is exceptionally complex.