Spaces:

Adityax-07
/

CodeSage

Sleeping

App Files Files Community

CodeSage / README.md

Aditya

Add live HuggingFace Spaces demo link to README

b8a0f1b about 2 months ago

preview code

Raw

History Blame Contribute Delete

21.3 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

metadata

title: CodeSage
emoji: 🧙
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.35.0
app_file: demo.py
pinned: false

🧪 CodeSage is a live, side-by-side AI research platform that fires the same programming question at three fundamentally different architectures — Baseline LLM, RAG, and Fine-Tuning — then auto-scores every answer on accuracy, hallucination, groundedness, relevance, and cost.

No cherry-picking. No manual grading. Real numbers, real trade-offs.

📌 Table of Contents

	Section
⚡	Benchmark Results
🧠	What is CodeSage?
✨	Features
🏗️	Architecture
📊	Evaluation Pipeline
🚀	Quick Start
📚	Knowledge Base
💡	Decision Guide
🛠️	Tech Stack
🗂️	Project Structure
🔮	Roadmap

⚡ Benchmark Results

Full evaluation: 3 systems × 50 Q&A pairs × 8 metrics — fully automated, zero manual grading

📏 Metric	🔵 Baseline LLM	🟢 RAG Chatbot	🟣 Fine-Tuned (Qwen2.5 + LoRA)
🎯 Answer Accuracy	61.4%	81.6%	85.3% ✨
🚫 Hallucination Rate	43.2% ❌	9.8%	0.0% ✨
🔍 Answer Relevance	0.714	0.768	0.891 ✨
📌 Groundedness	—	0.87 ✨	—
⚡ Avg Latency	~1.2s	~2.1s	~0.4s ✨
💰 Cost / Query	~$0.0020	~$0.0030	$0.0002 ✨

🔑 Key Findings

Insight	Detail
🚫 Hallucination gap	Baseline hallucinates on `43.2%` of questions — Fine-Tuning eliminates this entirely → `0%`
📉 RAG cuts hallucination 4.4×	From `43.2%` → `9.8%` purely through grounded retrieval, no retraining needed
💰 Fine-Tuning is 10× cheaper	`$0.0002` vs `~$0.002` per query — smaller model, fully local inference
⚡ Fine-Tuning is 3× faster	`0.4s` vs `1.2s` — no retrieval pipeline, no large-model API round-trip
🎯 No universal winner	RAG wins on updatability · Fine-Tuning wins on cost/speed/precision · Baseline wins on zero-setup

🧠 What is CodeSage?

CodeSage is a decision-making tool for AI engineers. When building a domain-specific assistant, you always hit the same three-way fork:

              ┌──────────────────────────────────────────┐
              │       Domain-Specific AI Assistant        │
              └────────────────────┬─────────────────────┘
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         ▼                         ▼                         ▼
┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  BASELINE LLM   │     │   RAG PIPELINE   │     │   FINE-TUNING    │
│                 │     │                  │     │                  │
│ + Zero setup    │     │ + Always fresh   │     │ + 10x cheaper    │
│ + Broad topics  │     │ + Grounded       │     │ + 0% hallucin.   │
│ - Hallucinates  │     │ - Retrieval lag  │     │ - Hard to update │
└─────────────────┘     └──────────────────┘     └──────────────────┘

CodeSage makes this trade-off visible and measurable — same question, same moment, real output from all three.

✨ Features

🔀 Side-by-Side Compare Three answers to one question, simultaneously, in one view	📊 Auto Evaluation 8-metric LLM-as-Judge scores every response automatically	🏆 Winner Badge Best answer highlighted; hallucination flag raised on low-confidence
📈 Analytics Dashboard Plotly charts + paper-style TABLE II aggregated over 50 benchmarks	💾 Persistent Cache Results stored in `benchmark_cache.json` — instant reload, no re-running	📄 PDF Ingestion Drop any PDF into `data/pdfs/` — RAG ingests it automatically

🏗️ Architecture

╔══════════════════════════════════════════════════════════════════════════╗
║                           🖥️  Streamlit UI                               ║
║  ┌──────────────────┐   ┌──────────────────────┐   ┌──────────────────┐ ║
║  │  ⚡ System 1      │   │   🔍 System 2         │   │  🧠 System 3     │ ║
║  │  Baseline LLM    │   │   RAG Pipeline        │   │  Fine-Tuned      │ ║
║  └────────┬─────────┘   └──────────┬────────────┘   └────────┬─────────┘ ║
╚═══════════╪═══════════════════════╪════════════════════════╪═════════════╝
            │                       │                         │
            ▼                       ▼                         ▼
     ┌─────────────┐       ┌──────────────────┐      ┌───────────────────┐
     │  Groq API   │       │   FAISS Index    │      │  Qwen2.5-1.5B     │
     │ Llama-3.1-8B│       │ all-MiniLM-L6-v2 │      │  + LoRA Adapters  │
     │ (zero-shot) │       │  (top-3 chunks)  │      │  (PEFT, local)    │
     └─────────────┘       └────────┬─────────┘      └───────────────────┘
                                    │
                             Groq API (with context)
                                    │
                          ┌─────────▼──────────┐
                          │   🏛️  LLM-as-Judge  │
                          │  8 metrics, auto    │
                          └────────────────────┘

⚡ System 1 — Baseline LLM

Sends the question directly to Llama-3.1-8B via Groq with a minimal system prompt. No extra knowledge. Represents what an off-the-shelf LLM can do — the floor every other system must beat.

🔍 System 2 — RAG Pipeline

Question → all-MiniLM-L6-v2 embedding
Top-3 chunks retrieved from FAISS vector store (17 documents)
Chunks injected as context into Llama-3.1-8B via Groq
Groundedness scored — answers must be traceable to retrieved text

🧠 System 3 — Fine-Tuned Model

Qwen2.5-1.5B fine-tuned with LoRA (r=8, α=32) on curated CS Q&A pairs via Google Colab T4 GPU. Adapters loaded locally via peft — zero cloud inference cost, sub-second latency.

📊 Evaluation Pipeline

Each answer is auto-scored by an LLM judge across 8 dimensions:

Icon	Metric	Description	Unit
🎯	Answer Accuracy	Cosine similarity of answer vs reference embedding	%
📌	Groundedness	Cosine similarity of answer vs retrieved context	0–1
🚫	Hallucination Rate	% of answers with accuracy < 0.5	%
🔍	Answer Relevance	Cosine similarity of answer vs question	0–1
📜	Faithfulness (ROUGE-L)	Token overlap with source context or reference	0–1
⏱️	Avg Response Time	Mean latency per query	sec
💰	Cost per Query	Token-count-based cost estimate	USD
⭐	Overall Score	30% Acc + 20% Ground + 20% (1−HR) + 15% Rel + 15% Faith	1–5

🚀 Quick Start

`Step 1` — Clone & Install

git clone https://github.com/Adityax-07/LLM-vs-RAG-vs-Fine-Tuning-.git
cd LLM-vs-RAG-vs-Fine-Tuning-
pip install -r requirements.txt

`Step 2` — Configure API Key

echo "GROQ_API_KEY=your_key_here" > .env

🆓 Free key at console.groq.com

`Step 3` — Launch

streamlit run demo.py

FAISS vector store builds automatically on first launch. Systems 1 & 2 are ready instantly.

`Step 4` — (Optional) Activate Fine-Tuned Model

python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-1.5B-Instruct')
model = PeftModel.from_pretrained(base, 'checkpoint-25')
model.merge_and_unload().save_pretrained('finetuned_model')
AutoTokenizer.from_pretrained('checkpoint-25').save_pretrained('finetuned_model')
"

Or open system3_finetune_colab.ipynb in Google Colab to train from scratch on a free T4 GPU (~10 min).

`Step 5` — Regenerate Benchmark (optional)

# Pre-computed results already included in data/benchmark_cache.json
python run_benchmark.py

📚 Knowledge Base

The RAG system retrieves from 17 hand-crafted topic documents in data/docs/:

🧮 Algorithms & DSA

binary_search
sorting_algorithms
dynamic_programming
graph_algorithms
trees
linked_list
stack_queue
recursion
backtracking 📐 More DSA

greedy_algorithms
hashing
string_algorithms
two_pointers
big_o_notation
heaps 🌐 Web & Tooling

react_hooks
rest_api
javascript_promises
css_flexbox
typescript_basics
sql_basics
git_basics

💡 Decision Guide

🤔 Situation	✅ Best Choice	📝 Why
Prototyping or general queries	Baseline LLM	Zero setup, covers broad topics well
Knowledge changes frequently	RAG	Update docs without retraining
Fixed domain, cost/latency matters	Fine-Tuning	10× cheaper, 3× faster, 0% hallucination
Need citations & traceability	RAG	Groundedness score + visible source chunks
Production with tight latency SLA	Fine-Tuning	Local inference, no API round-trip

🛠️ Tech Stack

Layer	Technology	Purpose
📊 UI		3-way comparison dashboard + analytics charts
⚡ LLM		Llama-3.1-8B — Baseline + RAG generation
🤖 Embeddings		`all-MiniLM-L6-v2` — RAG semantic retrieval
🔍 Vector DB		CPU-based semantic search over knowledge base
🧠 Fine-Tuning		LoRA adapter (r=8, α=32) on Qwen2.5-1.5B
🏋️ Base Model		Alibaba's compact LLM — LoRA fine-tuned locally
☁️ Training		Free T4 GPU — LoRA training in ~10 minutes
🔗 Orchestration		RAG pipeline, FAISS integration, PDF ingestion
📏 Metrics		ROUGE-L + cosine similarity for auto-evaluation

🗂️ Project Structure

📦 LLM-vs-RAG-vs-Fine-Tuning/
│
├── 📄 demo.py                         ← Streamlit app (main entry point)
├── 📄 system1_baseline.py             ← Baseline LLM via Groq API
├── 📄 system2_rag.py                  ← RAG pipeline: FAISS + LangChain + Groq
├── 📄 system3_inference.py            ← Fine-tuned model inference (PEFT)
├── 📓 system3_finetune_colab.ipynb    ← LoRA training notebook (Colab T4)
├── 📄 evaluate.py                     ← Standalone evaluation script
├── 📄 run_benchmark.py                ← Regenerates benchmark_cache.json
│
├── 📁 checkpoint-25/                  ← Trained LoRA weights (included)
│   ├── adapter_model.safetensors
│   ├── adapter_config.json            ← r=8, alpha=32
│   └── tokenizer.json
│
├── 📁 finetuned_model/                ← Merged model (after merge step)
│
├── 📁 data/
│   ├── 📁 docs/                       ← 17 knowledge base .txt files
│   ├── 📁 faiss_index/                ← FAISS vector store (auto-built)
│   ├── 📁 pdfs/                       ← Drop PDFs here for RAG ingestion
│   ├── benchmark_cache.json           ← Pre-computed 50Q benchmark results
│   ├── reference_answers.json         ← Ground-truth Q&A pairs
│   └── finetune_data.jsonl            ← LoRA training data (ChatML format)
│
└── 📄 requirements.txt

🔮 Roadmap

Status	Feature
✅	50-question auto-benchmark with persistent cache
✅	LoRA fine-tune checkpoint (`checkpoint-25`) included
✅	Analytics dashboard with Plotly + TABLE II
✅	PDF ingestion into RAG knowledge base
🔜	Push Qwen2.5 LoRA adapter to HuggingFace Hub
🔜	Full 3-system live demo on HuggingFace Spaces
🔜	Expand knowledge base: 17 → 50+ documents
🔜	RAGAS-style faithfulness + context precision metrics
🔜	Custom knowledge base upload via Streamlit UI

Built with 🧠 by Adityax-07

Powered by Groq · HuggingFace · FAISS · LangChain · Streamlit

⭐ If CodeSage helped you understand the LLM trade-off space, drop a star!