Create README.md

## VisaBuddy — Canada Immigration Assistant

### A retrieval‑augmented, QLoRA‑fine‑tuned LLM assistant for Canadian immigration guidance

---

### Badges (text)

- **License**: MIT
- **Model (Hugging Face)**: `<your-hf-username>/<your-model-repo>`
- **Tech**: `LLM • QLoRA • RAG • FastAPI • React • TypeScript • TailwindCSS • ChromaDB`

---

## Overview

VisaBuddy — Canada Immigration Assistant is an end‑to‑end system that combines a custom fine‑tuned LLM with a modern web UI to help users explore Canadian immigration programs in a conversational way.

The assistant:

- retrieves information from cleaned IRCC (Immigration, Refugees and Citizenship Canada) documents,
- runs a **RAG (Retrieval‑Augmented Generation)** pipeline over a **ChromaDB** vector store,
- and answers questions through a **FastAPI** backend and **React** chat frontend.

> **Important:** VisaBuddy is **not** an official source of immigration advice and is **not** a substitute for a licensed immigration consultant or lawyer. Always verify final decisions and requirements directly with official IRCC resources.

---

## Architecture

High‑level flow from user to model:

```text
┌───────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ ┌───────────────┐
│ Browser │ ───▶ │ Frontend │ ───▶ │ API │ ───▶ │ Model + │ ───▶ │ Vector DB │
│ (User UI) │ │ (React/TS) │ │(FastAPI) │ │ RAG Core │ │ (ChromaDB) │
└───────────┘ └──────────────┘ └───────────┘ └──────────────┘ └───────────────┘
▲ │
└───────────────────────── streamed / batched responses ◀─────────────────────────────┘
```

Expanded data path:

```text
User Question
│
▼
React Chat UI
│ (POST /v1/chat)
▼
FastAPI Backend (Colab, behind Cloudflare Tunnel)
│
├─► RAG Retriever
│ ├─ Chunked IRCC docs
│ ├─ Embeddings (Sentence Transformers)
│ └─ ChromaDB vector search
│
├─► Prompt Builder (question + retrieved context)
│
└─► QLoRA‑fine‑tuned LLM (4‑bit quantized)
│
▼
Formatted Answer → sent back to UI
```

---

## Features

### AI / Modeling

- **Custom LLM fine‑tuned with QLoRA**
- Built on a modern base model (e.g. LLaMA 3 / Mistral / Qwen; configurable in code).
- 4‑bit quantization for efficient inference.
- Fine‑tuned on an **instruction JSONL dataset** derived from IRCC content.
- **RAG (Retrieval‑Augmented Generation)**
- Cleans IRCC HTML + PDF documents.
- Chunks long texts into overlapping segments.
- Uses a **Sentence Transformers** embedding model.
- Stores embeddings and metadata in **ChromaDB**.
- Retrieves top‑k relevant chunks per query and injects them into prompts.
- **Evaluation**
- Training **loss curve** and **perplexity**.
- Retrieval metrics: **precision@k**, **recall@k**, **MRR**.
- Answer‑quality metrics: **factuality**, **hallucination rate**, **groundedness**, **context utilization**.
- Notebook dashboards for quick visual inspection.

### Backend (FastAPI, Colab)

- **FastAPI server** running in **Google Colab**.
- Loads a **model ZIP** from **Google Drive** (or other storage) and extracts it.
- Uses **PEFT** LoRA adapters on top of the base model.
- Exposes an inference endpoint (e.g. `POST /v1/chat`) for the frontend.
- Tunnelled to the public internet via **Cloudflare Tunnel**.

### Frontend (React + TypeScript)

- Modern **chat interface** tailored to visa/immigration workflows.
- Built with **React**, **TypeScript**, and **TailwindCSS**.
- Configurable `VITE_API_URL` pointing to the backend tunnel URL.
- Multiple conversation threads, persistent history and smooth animations.

---

## Tech Stack

### Frontend

- **React**, **TypeScript**, **Vite**
- **TailwindCSS** for styling
- Optional animation and icon libraries

### Backend

- **Python 3.x**
- **FastAPI** (or compatible ASGI framework)
- **Uvicorn** for serving
- Runs inside **Google Colab**
- **Cloudflare Tunnel** for secure public access

### AI / ML

- **Transformers** + **PEFT**
- **QLoRA** (LoRA on quantized weights, 4‑bit)
- **Sentence Transformers** for embeddings
- **ChromaDB** as vector database

### Infrastructure / Storage

- **Google Drive** for model ZIP storage
- **Google Colab** GPU runtime
- **Cloudflare Tunnel** for public HTTPS endpoint

---

## How the Model Works

### Data & Preprocessing

- Public IRCC documents (HTML and PDF) are:
- crawled or downloaded,
- parsed into raw text,
- cleaned (removing navigation, styling, boilerplate),
- normalized (whitespace, encoding artifacts).

### Chunking

- Long documents are split into **overlapping chunks**:
- Sentence‑aware segmentation.
- A target chunk size (e.g. 800–1,000 tokens or words).
- Overlap between chunks (e.g. 150–200 tokens/words) to preserve context across boundaries.

### Embeddings & Vector Store

- Each chunk is embedded using a **Sentence Transformers** model.
- Embeddings + metadata (source URL, section, document ID) are stored in **ChromaDB**.
- At query time:
- the user question is embedded,
- **k** nearest chunks are retrieved,
- these chunks become the **context** for RAG.

### QLoRA Fine‑Tuning

- A base model such as **LLaMA 3**, **Mistral**, or **Qwen** (configurable) is used.
- **QLoRA** is applied:
- The base model is loaded in **4‑bit** quantized mode.
- Low‑rank adapters (LoRA) are trained on top of frozen base weights.
- Training uses an instruction dataset in JSONL with fields like:
- `instruction`, `input`, `output`, plus metadata.
- Benefits:
- Strong performance with limited GPU memory.
- Safety: it’s easy to discard or retrain adapters without touching base weights.

### Inference Flow

1. User asks a question via the web UI.
2. The frontend sends a request to the FastAPI endpoint.
3. Backend:
- embeds the question,
- retrieves top‑k context chunks from ChromaDB,
- builds a prompt that includes:
- system instructions,
- retrieved context,
- the user question.
4. The QLoRA‑fine‑tuned model generates an answer.
5. The answer is returned to the frontend and displayed in the chat.

---

## Installation & Setup

### 1. Clone the Repository

```bash
git clone https://github.com/<your-username>/<your-repo-name>.git
cd <your-repo-name>
```

---

### 2. Frontend Setup (React + Vite)

#### 2.1 Install Dependencies

```bash
cd frontend # or project root if frontend is at top-level
npm install
```

#### 2.2 Environment Variables

Create a `.env` (or `.env.local`) file in the frontend root:

```bash
VITE_API_URL=<your-cloudflare-tunnel-url>
```

Examples:

- `VITE_API_URL=https://your-tunnel-subdomain.trycloudflare.com`

This URL should point to the public endpoint that fronts your FastAPI server in Colab.

#### 2.3 Run the Dev Server

```bash
npm run dev
```

Open the printed URL in your browser (typically `http://localhost:5173`) to use VisaBuddy.

---

### 3. Backend Setup (Google Colab + FastAPI)

The backend is intended to run inside a **Google Colab** notebook.

#### 3.1 Upload Model ZIP to Google Drive

1. Train or download the QLoRA adapters and tokenizer.
2. Zip the model folder into something like:
- `visa-buddy-model.zip`
3. Upload the ZIP to Google Drive, for example:
- `MyDrive/visa_buddy/visa-buddy-model.zip`

#### 3.2 Mount Google Drive in Colab

In your Colab notebook:

```python
from google.colab import drive
drive.mount("/content/drive")

MODEL_ZIP_PATH = "/content/drive/MyDrive/visa_buddy/visa-buddy-model.zip"
LOCAL_MODEL_DIR = "/content/visa_buddy_model"
```

#### 3.3 Unzip and Load the Model

```python
import os
import zipfile

os.makedirs(LOCAL_MODEL_DIR, exist_ok=True)

with zipfile.ZipFile(MODEL_ZIP_PATH, "r") as zf:
zf.extractall(LOCAL_MODEL_DIR)
```

Then load the base model and LoRA adapters with **Transformers** + **PEFT**:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE_MODEL_NAME = "<base-model-name>" # e.g. meta-llama/Meta-Llama-3-8B-Instruct
ADAPTER_PATH = LOCAL_MODEL_DIR # folder with adapter_config / adapter_model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_NAME,
torch_dtype=torch.float16,
load_in_4bit=True,
device_map="auto",
)

model = PeftModel.from_pretrained(model, ADAPTER_PATH)
model.eval()
```

#### 3.4 FastAPI Server

Create a simple FastAPI app in the notebook:

```python
from fastapi import FastAPI
from pydantic import BaseModel
from fastapi.responses import JSONResponse
import uvicorn

app = FastAPI()

class ChatRequest(BaseModel):
message: str

@app .post("/v1/chat")
def chat(req: ChatRequest):
# TODO: run RAG retrieval and LLM generation here
# response_text = generate_answer(req.message)
response_text = "This is a placeholder response."
return JSONResponse({"answer": response_text})

def start_server():
uvicorn.run(app, host="0.0.0.0", port=8000)
```

Start the server (in a background cell or with `nest_asyncio` / separate process).

#### 3.5 Expose Backend via Cloudflare Tunnel

1. Install cloudflared in Colab:

```bash
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!sudo dpkg -i cloudflared-linux-amd64.deb
```

2. Start a tunnel targeting the FastAPI port (8000). Follow Cloudflare’s docs to authenticate and run:

```bash
!cloudflared tunnel --url http://localhost:80

Files changed (1) hide show

README.md +32 -0

README.md ADDED Viewed

	@@ -0,0 +1,32 @@

+---
+license: apache-2.0
+language:
+- en
+metrics:
+- precision
+- accuracy
+base_model:
+- meta-llama/Llama-3.1-8B-Instruct
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- visa
+- canada
+- rag
+- q-lora
+- lora
+- finetuned
+- google-colab
+- fastapi
+- fullstack
+- react
+- tailwind
+- figma
+- ux
+- ui
+- typescript
+- embeddings
+- chromadb
+- ai-assistant
+- instruct
+---