CodeSage / README.md
Aditya
Add live HuggingFace Spaces demo link to README
b8a0f1b
|
Raw
History Blame Contribute Delete
21.3 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: CodeSage
emoji: ๐Ÿง™
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.35.0
app_file: demo.py
pinned: false
Typing SVG


๐Ÿงช CodeSage is a live, side-by-side AI research platform that fires the same programming question at three fundamentally different architectures โ€” Baseline LLM, RAG, and Fine-Tuning โ€” then auto-scores every answer on accuracy, hallucination, groundedness, relevance, and cost.

No cherry-picking. No manual grading. Real numbers, real trade-offs.

๐Ÿ“Œ Table of Contents

Section
โšก Benchmark Results
๐Ÿง  What is CodeSage?
โœจ Features
๐Ÿ—๏ธ Architecture
๐Ÿ“Š Evaluation Pipeline
๐Ÿš€ Quick Start
๐Ÿ“š Knowledge Base
๐Ÿ’ก Decision Guide
๐Ÿ› ๏ธ Tech Stack
๐Ÿ—‚๏ธ Project Structure
๐Ÿ”ฎ Roadmap

โšก Benchmark Results

Full evaluation: 3 systems ร— 50 Q&A pairs ร— 8 metrics โ€” fully automated, zero manual grading

๐Ÿ“ Metric ๐Ÿ”ต Baseline LLM ๐ŸŸข RAG Chatbot ๐ŸŸฃ Fine-Tuned (Qwen2.5 + LoRA)
๐ŸŽฏ Answer Accuracy 61.4% 81.6% 85.3% โœจ
๐Ÿšซ Hallucination Rate 43.2% โŒ 9.8% 0.0% โœจ
๐Ÿ” Answer Relevance 0.714 0.768 0.891 โœจ
๐Ÿ“Œ Groundedness โ€” 0.87 โœจ โ€”
โšก Avg Latency ~1.2s ~2.1s ~0.4s โœจ
๐Ÿ’ฐ Cost / Query ~$0.0020 ~$0.0030 $0.0002 โœจ

๐Ÿ”‘ Key Findings

Insight Detail
๐Ÿšซ Hallucination gap Baseline hallucinates on 43.2% of questions โ€” Fine-Tuning eliminates this entirely โ†’ 0%
๐Ÿ“‰ RAG cuts hallucination 4.4ร— From 43.2% โ†’ 9.8% purely through grounded retrieval, no retraining needed
๐Ÿ’ฐ Fine-Tuning is 10ร— cheaper $0.0002 vs ~$0.002 per query โ€” smaller model, fully local inference
โšก Fine-Tuning is 3ร— faster 0.4s vs 1.2s โ€” no retrieval pipeline, no large-model API round-trip
๐ŸŽฏ No universal winner RAG wins on updatability ยท Fine-Tuning wins on cost/speed/precision ยท Baseline wins on zero-setup

๐Ÿง  What is CodeSage?

CodeSage is a decision-making tool for AI engineers. When building a domain-specific assistant, you always hit the same three-way fork:

              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚       Domain-Specific AI Assistant        โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ–ผ                         โ–ผ                         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  BASELINE LLM   โ”‚     โ”‚   RAG PIPELINE   โ”‚     โ”‚   FINE-TUNING    โ”‚
โ”‚                 โ”‚     โ”‚                  โ”‚     โ”‚                  โ”‚
โ”‚ + Zero setup    โ”‚     โ”‚ + Always fresh   โ”‚     โ”‚ + 10x cheaper    โ”‚
โ”‚ + Broad topics  โ”‚     โ”‚ + Grounded       โ”‚     โ”‚ + 0% hallucin.   โ”‚
โ”‚ - Hallucinates  โ”‚     โ”‚ - Retrieval lag  โ”‚     โ”‚ - Hard to update โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

CodeSage makes this trade-off visible and measurable โ€” same question, same moment, real output from all three.


โœจ Features

๐Ÿ”€ Side-by-Side Compare

Three answers to one question,
simultaneously, in one view
๐Ÿ“Š Auto Evaluation

8-metric LLM-as-Judge scores
every response automatically
๐Ÿ† Winner Badge

Best answer highlighted;
hallucination flag raised on low-confidence
๐Ÿ“ˆ Analytics Dashboard

Plotly charts + paper-style TABLE II
aggregated over 50 benchmarks
๐Ÿ’พ Persistent Cache

Results stored in benchmark_cache.json
โ€” instant reload, no re-running
๐Ÿ“„ PDF Ingestion

Drop any PDF into data/pdfs/
โ€” RAG ingests it automatically

๐Ÿ—๏ธ Architecture

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                           ๐Ÿ–ฅ๏ธ  Streamlit UI                               โ•‘
โ•‘  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ•‘
โ•‘  โ”‚  โšก System 1      โ”‚   โ”‚   ๐Ÿ” System 2         โ”‚   โ”‚  ๐Ÿง  System 3     โ”‚ โ•‘
โ•‘  โ”‚  Baseline LLM    โ”‚   โ”‚   RAG Pipeline        โ”‚   โ”‚  Fine-Tuned      โ”‚ โ•‘
โ•‘  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
            โ”‚                       โ”‚                         โ”‚
            โ–ผ                       โ–ผ                         โ–ผ
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚  Groq API   โ”‚       โ”‚   FAISS Index    โ”‚      โ”‚  Qwen2.5-1.5B     โ”‚
     โ”‚ Llama-3.1-8Bโ”‚       โ”‚ all-MiniLM-L6-v2 โ”‚      โ”‚  + LoRA Adapters  โ”‚
     โ”‚ (zero-shot) โ”‚       โ”‚  (top-3 chunks)  โ”‚      โ”‚  (PEFT, local)    โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
                             Groq API (with context)
                                    โ”‚
                          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                          โ”‚   ๐Ÿ›๏ธ  LLM-as-Judge  โ”‚
                          โ”‚  8 metrics, auto    โ”‚
                          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โšก System 1 โ€” Baseline LLM

Sends the question directly to Llama-3.1-8B via Groq with a minimal system prompt. No extra knowledge. Represents what an off-the-shelf LLM can do โ€” the floor every other system must beat.

๐Ÿ” System 2 โ€” RAG Pipeline

  1. Question โ†’ all-MiniLM-L6-v2 embedding
  2. Top-3 chunks retrieved from FAISS vector store (17 documents)
  3. Chunks injected as context into Llama-3.1-8B via Groq
  4. Groundedness scored โ€” answers must be traceable to retrieved text

๐Ÿง  System 3 โ€” Fine-Tuned Model

Qwen2.5-1.5B fine-tuned with LoRA (r=8, ฮฑ=32) on curated CS Q&A pairs via Google Colab T4 GPU. Adapters loaded locally via peft โ€” zero cloud inference cost, sub-second latency.


๐Ÿ“Š Evaluation Pipeline

Each answer is auto-scored by an LLM judge across 8 dimensions:

Icon Metric Description Unit
๐ŸŽฏ Answer Accuracy Cosine similarity of answer vs reference embedding %
๐Ÿ“Œ Groundedness Cosine similarity of answer vs retrieved context 0โ€“1
๐Ÿšซ Hallucination Rate % of answers with accuracy < 0.5 %
๐Ÿ” Answer Relevance Cosine similarity of answer vs question 0โ€“1
๐Ÿ“œ Faithfulness (ROUGE-L) Token overlap with source context or reference 0โ€“1
โฑ๏ธ Avg Response Time Mean latency per query sec
๐Ÿ’ฐ Cost per Query Token-count-based cost estimate USD
โญ Overall Score 30% Acc + 20% Ground + 20% (1โˆ’HR) + 15% Rel + 15% Faith 1โ€“5

๐Ÿš€ Quick Start

Step 1 โ€” Clone & Install

git clone https://github.com/Adityax-07/LLM-vs-RAG-vs-Fine-Tuning-.git
cd LLM-vs-RAG-vs-Fine-Tuning-
pip install -r requirements.txt

Step 2 โ€” Configure API Key

echo "GROQ_API_KEY=your_key_here" > .env

๐Ÿ†“ Free key at console.groq.com

Step 3 โ€” Launch

streamlit run demo.py

FAISS vector store builds automatically on first launch. Systems 1 & 2 are ready instantly.

Step 4 โ€” (Optional) Activate Fine-Tuned Model

python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-1.5B-Instruct')
model = PeftModel.from_pretrained(base, 'checkpoint-25')
model.merge_and_unload().save_pretrained('finetuned_model')
AutoTokenizer.from_pretrained('checkpoint-25').save_pretrained('finetuned_model')
"

Or open system3_finetune_colab.ipynb in Google Colab to train from scratch on a free T4 GPU (~10 min).

Step 5 โ€” Regenerate Benchmark (optional)

# Pre-computed results already included in data/benchmark_cache.json
python run_benchmark.py

๐Ÿ“š Knowledge Base

The RAG system retrieves from 17 hand-crafted topic documents in data/docs/:

๐Ÿงฎ Algorithms & DSA

binary_search
sorting_algorithms
dynamic_programming
graph_algorithms
trees
linked_list
stack_queue
recursion
backtracking
๐Ÿ“ More DSA

greedy_algorithms
hashing
string_algorithms
two_pointers
big_o_notation
heaps
๐ŸŒ Web & Tooling

react_hooks
rest_api
javascript_promises
css_flexbox
typescript_basics
sql_basics
git_basics

๐Ÿ’ก Decision Guide

๐Ÿค” Situation โœ… Best Choice ๐Ÿ“ Why
Prototyping or general queries Baseline LLM Zero setup, covers broad topics well
Knowledge changes frequently RAG Update docs without retraining
Fixed domain, cost/latency matters Fine-Tuning 10ร— cheaper, 3ร— faster, 0% hallucination
Need citations & traceability RAG Groundedness score + visible source chunks
Production with tight latency SLA Fine-Tuning Local inference, no API round-trip

๐Ÿ› ๏ธ Tech Stack

Layer Technology Purpose
๐Ÿ“Š UI 3-way comparison dashboard + analytics charts
โšก LLM Llama-3.1-8B โ€” Baseline + RAG generation
๐Ÿค– Embeddings all-MiniLM-L6-v2 โ€” RAG semantic retrieval
๐Ÿ” Vector DB CPU-based semantic search over knowledge base
๐Ÿง  Fine-Tuning LoRA adapter (r=8, ฮฑ=32) on Qwen2.5-1.5B
๐Ÿ‹๏ธ Base Model Alibaba's compact LLM โ€” LoRA fine-tuned locally
โ˜๏ธ Training Free T4 GPU โ€” LoRA training in ~10 minutes
๐Ÿ”— Orchestration RAG pipeline, FAISS integration, PDF ingestion
๐Ÿ“ Metrics ROUGE-L + cosine similarity for auto-evaluation

๐Ÿ—‚๏ธ Project Structure

๐Ÿ“ฆ LLM-vs-RAG-vs-Fine-Tuning/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“„ demo.py                         โ† Streamlit app (main entry point)
โ”œโ”€โ”€ ๐Ÿ“„ system1_baseline.py             โ† Baseline LLM via Groq API
โ”œโ”€โ”€ ๐Ÿ“„ system2_rag.py                  โ† RAG pipeline: FAISS + LangChain + Groq
โ”œโ”€โ”€ ๐Ÿ“„ system3_inference.py            โ† Fine-tuned model inference (PEFT)
โ”œโ”€โ”€ ๐Ÿ““ system3_finetune_colab.ipynb    โ† LoRA training notebook (Colab T4)
โ”œโ”€โ”€ ๐Ÿ“„ evaluate.py                     โ† Standalone evaluation script
โ”œโ”€โ”€ ๐Ÿ“„ run_benchmark.py                โ† Regenerates benchmark_cache.json
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ checkpoint-25/                  โ† Trained LoRA weights (included)
โ”‚   โ”œโ”€โ”€ adapter_model.safetensors
โ”‚   โ”œโ”€โ”€ adapter_config.json            โ† r=8, alpha=32
โ”‚   โ””โ”€โ”€ tokenizer.json
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ finetuned_model/                โ† Merged model (after merge step)
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ data/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ docs/                       โ† 17 knowledge base .txt files
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ faiss_index/                โ† FAISS vector store (auto-built)
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ pdfs/                       โ† Drop PDFs here for RAG ingestion
โ”‚   โ”œโ”€โ”€ benchmark_cache.json           โ† Pre-computed 50Q benchmark results
โ”‚   โ”œโ”€โ”€ reference_answers.json         โ† Ground-truth Q&A pairs
โ”‚   โ””โ”€โ”€ finetune_data.jsonl            โ† LoRA training data (ChatML format)
โ”‚
โ””โ”€โ”€ ๐Ÿ“„ requirements.txt

๐Ÿ”ฎ Roadmap

Status Feature
โœ… 50-question auto-benchmark with persistent cache
โœ… LoRA fine-tune checkpoint (checkpoint-25) included
โœ… Analytics dashboard with Plotly + TABLE II
โœ… PDF ingestion into RAG knowledge base
๐Ÿ”œ Push Qwen2.5 LoRA adapter to HuggingFace Hub
๐Ÿ”œ Full 3-system live demo on HuggingFace Spaces
๐Ÿ”œ Expand knowledge base: 17 โ†’ 50+ documents
๐Ÿ”œ RAGAS-style faithfulness + context precision metrics
๐Ÿ”œ Custom knowledge base upload via Streamlit UI

Built with ๐Ÿง  by Adityax-07


Powered by Groq ยท HuggingFace ยท FAISS ยท LangChain ยท Streamlit





โญ If CodeSage helped you understand the LLM trade-off space, drop a star!