--- title: CodeSage emoji: ๐Ÿง™ colorFrom: blue colorTo: purple sdk: streamlit sdk_version: "1.35.0" app_file: demo.py pinned: false ---
Typing SVG


๐Ÿงช CodeSage is a live, side-by-side AI research platform that fires the same programming question at three fundamentally different architectures โ€” Baseline LLM, RAG, and Fine-Tuning โ€” then auto-scores every answer on accuracy, hallucination, groundedness, relevance, and cost.

No cherry-picking. No manual grading. Real numbers, real trade-offs.

## ๐Ÿ“Œ Table of Contents | | Section | |:---:|:---| | โšก | [Benchmark Results](#-benchmark-results) | | ๐Ÿง  | [What is CodeSage?](#-what-is-codesage) | | โœจ | [Features](#-features) | | ๐Ÿ—๏ธ | [Architecture](#๏ธ-architecture) | | ๐Ÿ“Š | [Evaluation Pipeline](#-evaluation-pipeline) | | ๐Ÿš€ | [Quick Start](#-quick-start) | | ๐Ÿ“š | [Knowledge Base](#-knowledge-base) | | ๐Ÿ’ก | [Decision Guide](#-decision-guide) | | ๐Ÿ› ๏ธ | [Tech Stack](#๏ธ-tech-stack) | | ๐Ÿ—‚๏ธ | [Project Structure](#๏ธ-project-structure) | | ๐Ÿ”ฎ | [Roadmap](#-roadmap) | ## โšก Benchmark Results > **Full evaluation:** `3 systems` ร— `50 Q&A pairs` ร— `8 metrics` โ€” fully automated, zero manual grading | ๐Ÿ“ Metric | ๐Ÿ”ต Baseline LLM | ๐ŸŸข RAG Chatbot | ๐ŸŸฃ Fine-Tuned (Qwen2.5 + LoRA) | |:---|:---:|:---:|:---:| | ๐ŸŽฏ **Answer Accuracy** | 61.4% | 81.6% | **85.3% โœจ** | | ๐Ÿšซ **Hallucination Rate** | 43.2% โŒ | 9.8% | **0.0% โœจ** | | ๐Ÿ” **Answer Relevance** | 0.714 | 0.768 | **0.891 โœจ** | | ๐Ÿ“Œ **Groundedness** | โ€” | **0.87 โœจ** | โ€” | | โšก **Avg Latency** | ~1.2s | ~2.1s | **~0.4s โœจ** | | ๐Ÿ’ฐ **Cost / Query** | ~$0.0020 | ~$0.0030 | **$0.0002 โœจ** | ### ๐Ÿ”‘ Key Findings | Insight | Detail | |:---|:---| | ๐Ÿšซ **Hallucination gap** | Baseline hallucinates on `43.2%` of questions โ€” Fine-Tuning eliminates this entirely โ†’ `0%` | | ๐Ÿ“‰ **RAG cuts hallucination 4.4ร—** | From `43.2%` โ†’ `9.8%` purely through grounded retrieval, no retraining needed | | ๐Ÿ’ฐ **Fine-Tuning is 10ร— cheaper** | `$0.0002` vs `~$0.002` per query โ€” smaller model, fully local inference | | โšก **Fine-Tuning is 3ร— faster** | `0.4s` vs `1.2s` โ€” no retrieval pipeline, no large-model API round-trip | | ๐ŸŽฏ **No universal winner** | RAG wins on updatability ยท Fine-Tuning wins on cost/speed/precision ยท Baseline wins on zero-setup | ## ๐Ÿง  What is CodeSage? CodeSage is a **decision-making tool** for AI engineers. When building a domain-specific assistant, you always hit the same three-way fork: ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Domain-Specific AI Assistant โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ BASELINE LLM โ”‚ โ”‚ RAG PIPELINE โ”‚ โ”‚ FINE-TUNING โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ + Zero setup โ”‚ โ”‚ + Always fresh โ”‚ โ”‚ + 10x cheaper โ”‚ โ”‚ + Broad topics โ”‚ โ”‚ + Grounded โ”‚ โ”‚ + 0% hallucin. โ”‚ โ”‚ - Hallucinates โ”‚ โ”‚ - Retrieval lag โ”‚ โ”‚ - Hard to update โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` > CodeSage makes this trade-off **visible and measurable** โ€” same question, same moment, real output from all three. --- ## โœจ Features
๐Ÿ”€ Side-by-Side Compare

Three answers to one question,
simultaneously, in one view
๐Ÿ“Š Auto Evaluation

8-metric LLM-as-Judge scores
every response automatically
๐Ÿ† Winner Badge

Best answer highlighted;
hallucination flag raised on low-confidence
๐Ÿ“ˆ Analytics Dashboard

Plotly charts + paper-style TABLE II
aggregated over 50 benchmarks
๐Ÿ’พ Persistent Cache

Results stored in benchmark_cache.json
โ€” instant reload, no re-running
๐Ÿ“„ PDF Ingestion

Drop any PDF into data/pdfs/
โ€” RAG ingests it automatically
--- ## ๐Ÿ—๏ธ Architecture ``` โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ ๐Ÿ–ฅ๏ธ Streamlit UI โ•‘ โ•‘ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘ โ•‘ โ”‚ โšก System 1 โ”‚ โ”‚ ๐Ÿ” System 2 โ”‚ โ”‚ ๐Ÿง  System 3 โ”‚ โ•‘ โ•‘ โ”‚ Baseline LLM โ”‚ โ”‚ RAG Pipeline โ”‚ โ”‚ Fine-Tuned โ”‚ โ•‘ โ•‘ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Groq API โ”‚ โ”‚ FAISS Index โ”‚ โ”‚ Qwen2.5-1.5B โ”‚ โ”‚ Llama-3.1-8Bโ”‚ โ”‚ all-MiniLM-L6-v2 โ”‚ โ”‚ + LoRA Adapters โ”‚ โ”‚ (zero-shot) โ”‚ โ”‚ (top-3 chunks) โ”‚ โ”‚ (PEFT, local) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Groq API (with context) โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ๐Ÿ›๏ธ LLM-as-Judge โ”‚ โ”‚ 8 metrics, auto โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### โšก System 1 โ€” Baseline LLM Sends the question directly to **Llama-3.1-8B** via Groq with a minimal system prompt. No extra knowledge. Represents what an off-the-shelf LLM can do โ€” the floor every other system must beat. ### ๐Ÿ” System 2 โ€” RAG Pipeline 1. Question โ†’ `all-MiniLM-L6-v2` embedding 2. Top-3 chunks retrieved from **FAISS** vector store (17 documents) 3. Chunks injected as context into **Llama-3.1-8B** via Groq 4. Groundedness scored โ€” answers must be traceable to retrieved text ### ๐Ÿง  System 3 โ€” Fine-Tuned Model **Qwen2.5-1.5B** fine-tuned with **LoRA** (`r=8, ฮฑ=32`) on curated CS Q&A pairs via Google Colab T4 GPU. Adapters loaded locally via `peft` โ€” zero cloud inference cost, sub-second latency. --- ## ๐Ÿ“Š Evaluation Pipeline Each answer is auto-scored by an LLM judge across **8 dimensions**: | Icon | Metric | Description | Unit | |:---:|:---|:---|:---:| | ๐ŸŽฏ | **Answer Accuracy** | Cosine similarity of answer vs reference embedding | % | | ๐Ÿ“Œ | **Groundedness** | Cosine similarity of answer vs retrieved context | 0โ€“1 | | ๐Ÿšซ | **Hallucination Rate** | % of answers with accuracy < 0.5 | % | | ๐Ÿ” | **Answer Relevance** | Cosine similarity of answer vs question | 0โ€“1 | | ๐Ÿ“œ | **Faithfulness (ROUGE-L)** | Token overlap with source context or reference | 0โ€“1 | | โฑ๏ธ | **Avg Response Time** | Mean latency per query | sec | | ๐Ÿ’ฐ | **Cost per Query** | Token-count-based cost estimate | USD | | โญ | **Overall Score** | 30% Acc + 20% Ground + 20% (1โˆ’HR) + 15% Rel + 15% Faith | 1โ€“5 | --- ## ๐Ÿš€ Quick Start ### `Step 1` โ€” Clone & Install ```bash git clone https://github.com/Adityax-07/LLM-vs-RAG-vs-Fine-Tuning-.git cd LLM-vs-RAG-vs-Fine-Tuning- pip install -r requirements.txt ``` ### `Step 2` โ€” Configure API Key ```bash echo "GROQ_API_KEY=your_key_here" > .env ``` > ๐Ÿ†“ Free key at [console.groq.com](https://console.groq.com) ### `Step 3` โ€” Launch ```bash streamlit run demo.py ``` > FAISS vector store builds automatically on first launch. **Systems 1 & 2 are ready instantly.** ### `Step 4` โ€” (Optional) Activate Fine-Tuned Model ```bash python -c " from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-1.5B-Instruct') model = PeftModel.from_pretrained(base, 'checkpoint-25') model.merge_and_unload().save_pretrained('finetuned_model') AutoTokenizer.from_pretrained('checkpoint-25').save_pretrained('finetuned_model') " ``` > Or open `system3_finetune_colab.ipynb` in **Google Colab** to train from scratch on a free T4 GPU (~10 min). ### `Step 5` โ€” Regenerate Benchmark *(optional)* ```bash # Pre-computed results already included in data/benchmark_cache.json python run_benchmark.py ``` --- ## ๐Ÿ“š Knowledge Base The RAG system retrieves from **17 hand-crafted topic documents** in `data/docs/`:
๐Ÿงฎ Algorithms & DSA

binary_search
sorting_algorithms
dynamic_programming
graph_algorithms
trees
linked_list
stack_queue
recursion
backtracking
๐Ÿ“ More DSA

greedy_algorithms
hashing
string_algorithms
two_pointers
big_o_notation
heaps
๐ŸŒ Web & Tooling

react_hooks
rest_api
javascript_promises
css_flexbox
typescript_basics
sql_basics
git_basics
--- ## ๐Ÿ’ก Decision Guide | ๐Ÿค” Situation | โœ… Best Choice | ๐Ÿ“ Why | |:---|:---:|:---| | Prototyping or general queries | **Baseline LLM** | Zero setup, covers broad topics well | | Knowledge changes frequently | **RAG** | Update docs without retraining | | Fixed domain, cost/latency matters | **Fine-Tuning** | 10ร— cheaper, 3ร— faster, 0% hallucination | | Need citations & traceability | **RAG** | Groundedness score + visible source chunks | | Production with tight latency SLA | **Fine-Tuning** | Local inference, no API round-trip | --- ## ๐Ÿ› ๏ธ Tech Stack
Layer Technology Purpose
๐Ÿ“Š UI 3-way comparison dashboard + analytics charts
โšก LLM Llama-3.1-8B โ€” Baseline + RAG generation
๐Ÿค– Embeddings all-MiniLM-L6-v2 โ€” RAG semantic retrieval
๐Ÿ” Vector DB CPU-based semantic search over knowledge base
๐Ÿง  Fine-Tuning LoRA adapter (r=8, ฮฑ=32) on Qwen2.5-1.5B
๐Ÿ‹๏ธ Base Model Alibaba's compact LLM โ€” LoRA fine-tuned locally
โ˜๏ธ Training Free T4 GPU โ€” LoRA training in ~10 minutes
๐Ÿ”— Orchestration RAG pipeline, FAISS integration, PDF ingestion
๐Ÿ“ Metrics ROUGE-L + cosine similarity for auto-evaluation
--- ## ๐Ÿ—‚๏ธ Project Structure ``` ๐Ÿ“ฆ LLM-vs-RAG-vs-Fine-Tuning/ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ demo.py โ† Streamlit app (main entry point) โ”œโ”€โ”€ ๐Ÿ“„ system1_baseline.py โ† Baseline LLM via Groq API โ”œโ”€โ”€ ๐Ÿ“„ system2_rag.py โ† RAG pipeline: FAISS + LangChain + Groq โ”œโ”€โ”€ ๐Ÿ“„ system3_inference.py โ† Fine-tuned model inference (PEFT) โ”œโ”€โ”€ ๐Ÿ““ system3_finetune_colab.ipynb โ† LoRA training notebook (Colab T4) โ”œโ”€โ”€ ๐Ÿ“„ evaluate.py โ† Standalone evaluation script โ”œโ”€โ”€ ๐Ÿ“„ run_benchmark.py โ† Regenerates benchmark_cache.json โ”‚ โ”œโ”€โ”€ ๐Ÿ“ checkpoint-25/ โ† Trained LoRA weights (included) โ”‚ โ”œโ”€โ”€ adapter_model.safetensors โ”‚ โ”œโ”€โ”€ adapter_config.json โ† r=8, alpha=32 โ”‚ โ””โ”€โ”€ tokenizer.json โ”‚ โ”œโ”€โ”€ ๐Ÿ“ finetuned_model/ โ† Merged model (after merge step) โ”‚ โ”œโ”€โ”€ ๐Ÿ“ data/ โ”‚ โ”œโ”€โ”€ ๐Ÿ“ docs/ โ† 17 knowledge base .txt files โ”‚ โ”œโ”€โ”€ ๐Ÿ“ faiss_index/ โ† FAISS vector store (auto-built) โ”‚ โ”œโ”€โ”€ ๐Ÿ“ pdfs/ โ† Drop PDFs here for RAG ingestion โ”‚ โ”œโ”€โ”€ benchmark_cache.json โ† Pre-computed 50Q benchmark results โ”‚ โ”œโ”€โ”€ reference_answers.json โ† Ground-truth Q&A pairs โ”‚ โ””โ”€โ”€ finetune_data.jsonl โ† LoRA training data (ChatML format) โ”‚ โ””โ”€โ”€ ๐Ÿ“„ requirements.txt ``` --- ## ๐Ÿ”ฎ Roadmap | Status | Feature | |:---:|:---| | โœ… | 50-question auto-benchmark with persistent cache | | โœ… | LoRA fine-tune checkpoint (`checkpoint-25`) included | | โœ… | Analytics dashboard with Plotly + TABLE II | | โœ… | PDF ingestion into RAG knowledge base | | ๐Ÿ”œ | Push Qwen2.5 LoRA adapter to HuggingFace Hub | | ๐Ÿ”œ | Full 3-system live demo on HuggingFace Spaces | | ๐Ÿ”œ | Expand knowledge base: 17 โ†’ 50+ documents | | ๐Ÿ”œ | RAGAS-style faithfulness + context precision metrics | | ๐Ÿ”œ | Custom knowledge base upload via Streamlit UI | ---
Built with ๐Ÿง  by Adityax-07
Powered by Groq ยท HuggingFace ยท FAISS ยท LangChain ยท Streamlit



โญ If CodeSage helped you understand the LLM trade-off space, drop a star!