FinLLM-RAG / README.md
yashgori20's picture
Update README.md
42c2248 verified
---
title: FinLLM RAG
emoji:
colorFrom: purple
colorTo: gray
sdk: streamlit
sdk_version: 1.40.1
app_file: app.py
pinned: false
---
# 💸 Finance Assistant
This project is a multi-functional financial assistant built with Streamlit. It leverages large language models and retrieval-augmented generation (RAG) to provide a suite of tools for financial analysis, compliance, and data retrieval.
## Features
The application is divided into several key functionalities:
* **Circular Compliance Assistant**: Analyzes user-provided scenarios for compliance against RBI Master Circulars on Management of Advances. It uses a FAISS vector database to retrieve relevant sections of the circular and a language model to generate a detailed compliance report.
* **Industry Classification Assistant**: Suggests appropriate industry classification codes based on user-provided keywords. This feature also utilizes a RAG pipeline to search through an industry classification master document.
* **Calculation Methodology**: Provides interactive calculators for key financial metrics:
* **Maximum Permissible Bank Finance (MPBF)**
* **Drawing Power (DP)**
* **Financial Data Assistant**: Answers questions about historical (1980-2015) state-wise financial data for India. It can retrieve specific metrics for a given state and year.
* **Model 1 Chat**: A general-purpose chat interface powered by the `gemma2-9b-it` model via the Groq API.
## How It Works
The core of the "Circular Compliance" and "Industry Classification" assistants is a Retrieval-Augmented Generation (RAG) pipeline.
1. **Indexing**: Source documents (`Master Circular.pdf`, `Industry Classification Master.pdf`) are chunked, and the text chunks are converted into vector embeddings using a `SentenceTransformer` model. These embeddings are stored in a FAISS index for efficient similarity search.
2. **Retrieval**: When a user enters a query, the query is embedded, and the FAISS index is searched to find the most relevant document chunks.
3. **Generation**: The retrieved chunks are passed as context, along with the user's query, to a large language model (`gemma2-9b-it`). The model then generates a comprehensive and context-aware response.
The "Financial Data Assistant" works by directly parsing the user's query for state, year, and metric information and looking up the corresponding data from a pre-loaded data file.
## Setup and Installation
1. **Clone the Repository**:
```bash
git clone <your-repository-url>
cd <your-repository-directory>
```
2. **Install Dependencies**:
Install the necessary Python libraries using the `requirements.txt` file.
```bash
pip install -r requirements.txt
```
3. **Set Up Assets**:
The application requires pre-built FAISS indexes and data files.
* Create a folder named `assets` in the root directory.
* Generate and place the following files into the `assets` folder (You will need a separate script to process the source PDFs and JSON to create these files):
* `industry_index.faiss`
* `industry_chunks.pkl`
* `circular_index.faiss`
* `circular_chunks.pkl`
* `financial_index.faiss`
* `financial_statements.pkl`
4. **API Key**:
Insert your Groq API key directly into the `app.py` file at the following line:
```python
GROQ_API_KEY = "your-groq-api-key-here"
```
## Usage
1. **Run the Streamlit App**:
Execute the following command in your terminal:
```bash
streamlit run app.py
```
2. **Interact with the Application**:
* Open the URL provided by Streamlit (usually `http://localhost:8501`) in your web browser.
* Use the radio buttons at the top of the page to navigate between the different functionalities: "Calculation Methodology", "Circular Compliance", "Industry Classification", "Model 1", and "Model 2".
* Follow the on-screen instructions for each tool.
## Dependencies
This project relies on the following major libraries:
* `streamlit`: For creating the web application interface.
* `groq`: The client for accessing the Groq API.
* `sentence-transformers`: For generating text embeddings.
* `faiss-cpu`: For efficient similarity search in the vector database.
* `pandas`: For data manipulation, particularly in the financial data assistant.
* `numpy`: For numerical operations.
* `torch` & `transformers`: Core dependencies for the sentence transformer models.