--- title: CodeSage emoji: 🧙 colorFrom: blue colorTo: purple sdk: streamlit sdk_version: "1.35.0" app_file: demo.py pinned: false ---

🧪 CodeSage is a live, side-by-side AI research platform that fires the same programming question at three fundamentally different architectures — Baseline LLM, RAG, and Fine-Tuning — then auto-scores every answer on accuracy, hallucination, groundedness, relevance, and cost.

No cherry-picking. No manual grading. Real numbers, real trade-offs.

## 📌 Table of Contents | | Section | |:---:|:---| | ⚡ | [Benchmark Results](#-benchmark-results) | | 🧠 | [What is CodeSage?](#-what-is-codesage) | | ✨ | [Features](#-features) | | 🏗️ | [Architecture](#️-architecture) | | 📊 | [Evaluation Pipeline](#-evaluation-pipeline) | | 🚀 | [Quick Start](#-quick-start) | | 📚 | [Knowledge Base](#-knowledge-base) | | 💡 | [Decision Guide](#-decision-guide) | | 🛠️ | [Tech Stack](#️-tech-stack) | | 🗂️ | [Project Structure](#️-project-structure) | | 🔮 | [Roadmap](#-roadmap) | ## ⚡ Benchmark Results > **Full evaluation:** `3 systems` × `50 Q&A pairs` × `8 metrics` — fully automated, zero manual grading | 📏 Metric | 🔵 Baseline LLM | 🟢 RAG Chatbot | 🟣 Fine-Tuned (Qwen2.5 + LoRA) | |:---|:---:|:---:|:---:| | 🎯 **Answer Accuracy** | 61.4% | 81.6% | **85.3% ✨** | | 🚫 **Hallucination Rate** | 43.2% ❌ | 9.8% | **0.0% ✨** | | 🔍 **Answer Relevance** | 0.714 | 0.768 | **0.891 ✨** | | 📌 **Groundedness** | — | **0.87 ✨** | — | | ⚡ **Avg Latency** | ~1.2s | ~2.1s | **~0.4s ✨** | | 💰 **Cost / Query** | ~$0.0020 | ~$0.0030 | **$0.0002 ✨** | ### 🔑 Key Findings | Insight | Detail | |:---|:---| | 🚫 **Hallucination gap** | Baseline hallucinates on `43.2%` of questions — Fine-Tuning eliminates this entirely → `0%` | | 📉 **RAG cuts hallucination 4.4×** | From `43.2%` → `9.8%` purely through grounded retrieval, no retraining needed | | 💰 **Fine-Tuning is 10× cheaper** | `$0.0002` vs `~$0.002` per query — smaller model, fully local inference | | ⚡ **Fine-Tuning is 3× faster** | `0.4s` vs `1.2s` — no retrieval pipeline, no large-model API round-trip | | 🎯 **No universal winner** | RAG wins on updatability · Fine-Tuning wins on cost/speed/precision · Baseline wins on zero-setup | ## 🧠 What is CodeSage? CodeSage is a **decision-making tool** for AI engineers. When building a domain-specific assistant, you always hit the same three-way fork: ``` ┌──────────────────────────────────────────┐ │ Domain-Specific AI Assistant │ └────────────────────┬─────────────────────┘ │ ┌─────────────────────────┼─────────────────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ BASELINE LLM │ │ RAG PIPELINE │ │ FINE-TUNING │ │ │ │ │ │ │ │ + Zero setup │ │ + Always fresh │ │ + 10x cheaper │ │ + Broad topics │ │ + Grounded │ │ + 0% hallucin. │ │ - Hallucinates │ │ - Retrieval lag │ │ - Hard to update │ └─────────────────┘ └──────────────────┘ └──────────────────┘ ``` > CodeSage makes this trade-off **visible and measurable** — same question, same moment, real output from all three. --- ## ✨ Features

🔀 Side-by-Side Compare Three answers to one question, simultaneously, in one view	📊 Auto Evaluation 8-metric LLM-as-Judge scores every response automatically	🏆 Winner Badge Best answer highlighted; hallucination flag raised on low-confidence
📈 Analytics Dashboard Plotly charts + paper-style TABLE II aggregated over 50 benchmarks	💾 Persistent Cache Results stored in `benchmark_cache.json` — instant reload, no re-running	📄 PDF Ingestion Drop any PDF into `data/pdfs/` — RAG ingests it automatically

--- ## 🏗️ Architecture ``` ╔══════════════════════════════════════════════════════════════════════════╗ ║ 🖥️ Streamlit UI ║ ║ ┌──────────────────┐ ┌──────────────────────┐ ┌──────────────────┐ ║ ║ │ ⚡ System 1 │ │ 🔍 System 2 │ │ 🧠 System 3 │ ║ ║ │ Baseline LLM │ │ RAG Pipeline │ │ Fine-Tuned │ ║ ║ └────────┬─────────┘ └──────────┬────────────┘ └────────┬─────────┘ ║ ╚═══════════╪═══════════════════════╪════════════════════════╪═════════════╝ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌──────────────────┐ ┌───────────────────┐ │ Groq API │ │ FAISS Index │ │ Qwen2.5-1.5B │ │ Llama-3.1-8B│ │ all-MiniLM-L6-v2 │ │ + LoRA Adapters │ │ (zero-shot) │ │ (top-3 chunks) │ │ (PEFT, local) │ └─────────────┘ └────────┬─────────┘ └───────────────────┘ │ Groq API (with context) │ ┌─────────▼──────────┐ │ 🏛️ LLM-as-Judge │ │ 8 metrics, auto │ └────────────────────┘ ``` ### ⚡ System 1 — Baseline LLM Sends the question directly to **Llama-3.1-8B** via Groq with a minimal system prompt. No extra knowledge. Represents what an off-the-shelf LLM can do — the floor every other system must beat. ### 🔍 System 2 — RAG Pipeline 1. Question → `all-MiniLM-L6-v2` embedding 2. Top-3 chunks retrieved from **FAISS** vector store (17 documents) 3. Chunks injected as context into **Llama-3.1-8B** via Groq 4. Groundedness scored — answers must be traceable to retrieved text ### 🧠 System 3 — Fine-Tuned Model **Qwen2.5-1.5B** fine-tuned with **LoRA** (`r=8, α=32`) on curated CS Q&A pairs via Google Colab T4 GPU. Adapters loaded locally via `peft` — zero cloud inference cost, sub-second latency. --- ## 📊 Evaluation Pipeline Each answer is auto-scored by an LLM judge across **8 dimensions**: | Icon | Metric | Description | Unit | |:---:|:---|:---|:---:| | 🎯 | **Answer Accuracy** | Cosine similarity of answer vs reference embedding | % | | 📌 | **Groundedness** | Cosine similarity of answer vs retrieved context | 0–1 | | 🚫 | **Hallucination Rate** | % of answers with accuracy < 0.5 | % | | 🔍 | **Answer Relevance** | Cosine similarity of answer vs question | 0–1 | | 📜 | **Faithfulness (ROUGE-L)** | Token overlap with source context or reference | 0–1 | | ⏱️ | **Avg Response Time** | Mean latency per query | sec | | 💰 | **Cost per Query** | Token-count-based cost estimate | USD | | ⭐ | **Overall Score** | 30% Acc + 20% Ground + 20% (1−HR) + 15% Rel + 15% Faith | 1–5 | --- ## 🚀 Quick Start ### `Step 1` — Clone & Install ```bash git clone https://github.com/Adityax-07/LLM-vs-RAG-vs-Fine-Tuning-.git cd LLM-vs-RAG-vs-Fine-Tuning- pip install -r requirements.txt ``` ### `Step 2` — Configure API Key ```bash echo "GROQ_API_KEY=your_key_here" > .env ``` > 🆓 Free key at [console.groq.com](https://console.groq.com) ### `Step 3` — Launch ```bash streamlit run demo.py ``` > FAISS vector store builds automatically on first launch. **Systems 1 & 2 are ready instantly.** ### `Step 4` — (Optional) Activate Fine-Tuned Model ```bash python -c " from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-1.5B-Instruct') model = PeftModel.from_pretrained(base, 'checkpoint-25') model.merge_and_unload().save_pretrained('finetuned_model') AutoTokenizer.from_pretrained('checkpoint-25').save_pretrained('finetuned_model') " ``` > Or open `system3_finetune_colab.ipynb` in **Google Colab** to train from scratch on a free T4 GPU (~10 min). ### `Step 5` — Regenerate Benchmark *(optional)* ```bash # Pre-computed results already included in data/benchmark_cache.json python run_benchmark.py ``` --- ## 📚 Knowledge Base The RAG system retrieves from **17 hand-crafted topic documents** in `data/docs/`:

🧮 Algorithms & DSA

binary_search
sorting_algorithms
dynamic_programming
graph_algorithms
trees
linked_list
stack_queue
recursion
backtracking 📐 More DSA

greedy_algorithms
hashing
string_algorithms
two_pointers
big_o_notation
heaps 🌐 Web & Tooling

react_hooks
rest_api
javascript_promises
css_flexbox
typescript_basics
sql_basics
git_basics

--- ## 💡 Decision Guide | 🤔 Situation | ✅ Best Choice | 📝 Why | |:---|:---:|:---| | Prototyping or general queries | **Baseline LLM** | Zero setup, covers broad topics well | | Knowledge changes frequently | **RAG** | Update docs without retraining | | Fixed domain, cost/latency matters | **Fine-Tuning** | 10× cheaper, 3× faster, 0% hallucination | | Need citations & traceability | **RAG** | Groundedness score + visible source chunks | | Production with tight latency SLA | **Fine-Tuning** | Local inference, no API round-trip | --- ## 🛠️ Tech Stack

Layer	Technology	Purpose
📊 UI		3-way comparison dashboard + analytics charts
⚡ LLM		Llama-3.1-8B — Baseline + RAG generation
🤖 Embeddings		`all-MiniLM-L6-v2` — RAG semantic retrieval
🔍 Vector DB		CPU-based semantic search over knowledge base
🧠 Fine-Tuning		LoRA adapter (r=8, α=32) on Qwen2.5-1.5B
🏋️ Base Model		Alibaba's compact LLM — LoRA fine-tuned locally
☁️ Training		Free T4 GPU — LoRA training in ~10 minutes
🔗 Orchestration		RAG pipeline, FAISS integration, PDF ingestion
📏 Metrics		ROUGE-L + cosine similarity for auto-evaluation

--- ## 🗂️ Project Structure ``` 📦 LLM-vs-RAG-vs-Fine-Tuning/ │ ├── 📄 demo.py ← Streamlit app (main entry point) ├── 📄 system1_baseline.py ← Baseline LLM via Groq API ├── 📄 system2_rag.py ← RAG pipeline: FAISS + LangChain + Groq ├── 📄 system3_inference.py ← Fine-tuned model inference (PEFT) ├── 📓 system3_finetune_colab.ipynb ← LoRA training notebook (Colab T4) ├── 📄 evaluate.py ← Standalone evaluation script ├── 📄 run_benchmark.py ← Regenerates benchmark_cache.json │ ├── 📁 checkpoint-25/ ← Trained LoRA weights (included) │ ├── adapter_model.safetensors │ ├── adapter_config.json ← r=8, alpha=32 │ └── tokenizer.json │ ├── 📁 finetuned_model/ ← Merged model (after merge step) │ ├── 📁 data/ │ ├── 📁 docs/ ← 17 knowledge base .txt files │ ├── 📁 faiss_index/ ← FAISS vector store (auto-built) │ ├── 📁 pdfs/ ← Drop PDFs here for RAG ingestion │ ├── benchmark_cache.json ← Pre-computed 50Q benchmark results │ ├── reference_answers.json ← Ground-truth Q&A pairs │ └── finetune_data.jsonl ← LoRA training data (ChatML format) │ └── 📄 requirements.txt ``` --- ## 🔮 Roadmap | Status | Feature | |:---:|:---| | ✅ | 50-question auto-benchmark with persistent cache | | ✅ | LoRA fine-tune checkpoint (`checkpoint-25`) included | | ✅ | Analytics dashboard with Plotly + TABLE II | | ✅ | PDF ingestion into RAG knowledge base | | 🔜 | Push Qwen2.5 LoRA adapter to HuggingFace Hub | | 🔜 | Full 3-system live demo on HuggingFace Spaces | | 🔜 | Expand knowledge base: 17 → 50+ documents | | 🔜 | RAGAS-style faithfulness + context precision metrics | | 🔜 | Custom knowledge base upload via Streamlit UI | ---

Built with 🧠 by Adityax-07
Powered by Groq · HuggingFace · FAISS · LangChain · Streamlit

⭐ If CodeSage helped you understand the LLM trade-off space, drop a star!