Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.58.0
title: CodeSage
emoji: ๐ง
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.35.0
app_file: demo.py
pinned: false
๐งช CodeSage is a live, side-by-side AI research platform that fires the same programming question at three fundamentally different architectures โ Baseline LLM, RAG, and Fine-Tuning โ then auto-scores every answer on accuracy, hallucination, groundedness, relevance, and cost.
No cherry-picking. No manual grading. Real numbers, real trade-offs.
๐ Table of Contents
| Section | |
|---|---|
| โก | Benchmark Results |
| ๐ง | What is CodeSage? |
| โจ | Features |
| ๐๏ธ | Architecture |
| ๐ | Evaluation Pipeline |
| ๐ | Quick Start |
| ๐ | Knowledge Base |
| ๐ก | Decision Guide |
| ๐ ๏ธ | Tech Stack |
| ๐๏ธ | Project Structure |
| ๐ฎ | Roadmap |
โก Benchmark Results
Full evaluation:
3 systemsร50 Q&A pairsร8 metricsโ fully automated, zero manual grading
| ๐ Metric | ๐ต Baseline LLM | ๐ข RAG Chatbot | ๐ฃ Fine-Tuned (Qwen2.5 + LoRA) |
|---|---|---|---|
| ๐ฏ Answer Accuracy | 61.4% | 81.6% | 85.3% โจ |
| ๐ซ Hallucination Rate | 43.2% โ | 9.8% | 0.0% โจ |
| ๐ Answer Relevance | 0.714 | 0.768 | 0.891 โจ |
| ๐ Groundedness | โ | 0.87 โจ | โ |
| โก Avg Latency | ~1.2s | ~2.1s | ~0.4s โจ |
| ๐ฐ Cost / Query | ~$0.0020 | ~$0.0030 | $0.0002 โจ |
๐ Key Findings
| Insight | Detail |
|---|---|
| ๐ซ Hallucination gap | Baseline hallucinates on 43.2% of questions โ Fine-Tuning eliminates this entirely โ 0% |
| ๐ RAG cuts hallucination 4.4ร | From 43.2% โ 9.8% purely through grounded retrieval, no retraining needed |
| ๐ฐ Fine-Tuning is 10ร cheaper | $0.0002 vs ~$0.002 per query โ smaller model, fully local inference |
| โก Fine-Tuning is 3ร faster | 0.4s vs 1.2s โ no retrieval pipeline, no large-model API round-trip |
| ๐ฏ No universal winner | RAG wins on updatability ยท Fine-Tuning wins on cost/speed/precision ยท Baseline wins on zero-setup |
๐ง What is CodeSage?
CodeSage is a decision-making tool for AI engineers. When building a domain-specific assistant, you always hit the same three-way fork:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Domain-Specific AI Assistant โ
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ BASELINE LLM โ โ RAG PIPELINE โ โ FINE-TUNING โ
โ โ โ โ โ โ
โ + Zero setup โ โ + Always fresh โ โ + 10x cheaper โ
โ + Broad topics โ โ + Grounded โ โ + 0% hallucin. โ
โ - Hallucinates โ โ - Retrieval lag โ โ - Hard to update โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
CodeSage makes this trade-off visible and measurable โ same question, same moment, real output from all three.
โจ Features
|
๐ Side-by-Side Compare Three answers to one question, simultaneously, in one view |
๐ Auto Evaluation 8-metric LLM-as-Judge scores every response automatically |
๐ Winner Badge Best answer highlighted; hallucination flag raised on low-confidence |
|
๐ Analytics Dashboard Plotly charts + paper-style TABLE II aggregated over 50 benchmarks |
๐พ Persistent Cache Results stored in benchmark_cache.jsonโ instant reload, no re-running |
๐ PDF Ingestion Drop any PDF into data/pdfs/โ RAG ingests it automatically |
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ฅ๏ธ Streamlit UI โ
โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ โก System 1 โ โ ๐ System 2 โ โ ๐ง System 3 โ โ
โ โ Baseline LLM โ โ RAG Pipeline โ โ Fine-Tuned โ โ
โ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โ
โโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ
โ Groq API โ โ FAISS Index โ โ Qwen2.5-1.5B โ
โ Llama-3.1-8Bโ โ all-MiniLM-L6-v2 โ โ + LoRA Adapters โ
โ (zero-shot) โ โ (top-3 chunks) โ โ (PEFT, local) โ
โโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโ
โ
Groq API (with context)
โ
โโโโโโโโโโโผโโโโโโโโโโโ
โ ๐๏ธ LLM-as-Judge โ
โ 8 metrics, auto โ
โโโโโโโโโโโโโโโโโโโโโโ
โก System 1 โ Baseline LLM
Sends the question directly to Llama-3.1-8B via Groq with a minimal system prompt. No extra knowledge. Represents what an off-the-shelf LLM can do โ the floor every other system must beat.
๐ System 2 โ RAG Pipeline
- Question โ
all-MiniLM-L6-v2embedding - Top-3 chunks retrieved from FAISS vector store (17 documents)
- Chunks injected as context into Llama-3.1-8B via Groq
- Groundedness scored โ answers must be traceable to retrieved text
๐ง System 3 โ Fine-Tuned Model
Qwen2.5-1.5B fine-tuned with LoRA (r=8, ฮฑ=32) on curated CS Q&A pairs via Google Colab T4 GPU. Adapters loaded locally via peft โ zero cloud inference cost, sub-second latency.
๐ Evaluation Pipeline
Each answer is auto-scored by an LLM judge across 8 dimensions:
| Icon | Metric | Description | Unit |
|---|---|---|---|
| ๐ฏ | Answer Accuracy | Cosine similarity of answer vs reference embedding | % |
| ๐ | Groundedness | Cosine similarity of answer vs retrieved context | 0โ1 |
| ๐ซ | Hallucination Rate | % of answers with accuracy < 0.5 | % |
| ๐ | Answer Relevance | Cosine similarity of answer vs question | 0โ1 |
| ๐ | Faithfulness (ROUGE-L) | Token overlap with source context or reference | 0โ1 |
| โฑ๏ธ | Avg Response Time | Mean latency per query | sec |
| ๐ฐ | Cost per Query | Token-count-based cost estimate | USD |
| โญ | Overall Score | 30% Acc + 20% Ground + 20% (1โHR) + 15% Rel + 15% Faith | 1โ5 |
๐ Quick Start
Step 1 โ Clone & Install
git clone https://github.com/Adityax-07/LLM-vs-RAG-vs-Fine-Tuning-.git
cd LLM-vs-RAG-vs-Fine-Tuning-
pip install -r requirements.txt
Step 2 โ Configure API Key
echo "GROQ_API_KEY=your_key_here" > .env
๐ Free key at console.groq.com
Step 3 โ Launch
streamlit run demo.py
FAISS vector store builds automatically on first launch. Systems 1 & 2 are ready instantly.
Step 4 โ (Optional) Activate Fine-Tuned Model
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-1.5B-Instruct')
model = PeftModel.from_pretrained(base, 'checkpoint-25')
model.merge_and_unload().save_pretrained('finetuned_model')
AutoTokenizer.from_pretrained('checkpoint-25').save_pretrained('finetuned_model')
"
Or open
system3_finetune_colab.ipynbin Google Colab to train from scratch on a free T4 GPU (~10 min).
Step 5 โ Regenerate Benchmark (optional)
# Pre-computed results already included in data/benchmark_cache.json
python run_benchmark.py
๐ Knowledge Base
The RAG system retrieves from 17 hand-crafted topic documents in data/docs/:
๐งฎ Algorithms & DSAbinary_searchsorting_algorithmsdynamic_programminggraph_algorithmstreeslinked_liststack_queuerecursionbacktracking
|
๐ More DSAgreedy_algorithmshashingstring_algorithmstwo_pointersbig_o_notationheaps
|
๐ Web & Toolingreact_hooksrest_apijavascript_promisescss_flexboxtypescript_basicssql_basicsgit_basics
|
๐ก Decision Guide
| ๐ค Situation | โ Best Choice | ๐ Why |
|---|---|---|
| Prototyping or general queries | Baseline LLM | Zero setup, covers broad topics well |
| Knowledge changes frequently | RAG | Update docs without retraining |
| Fixed domain, cost/latency matters | Fine-Tuning | 10ร cheaper, 3ร faster, 0% hallucination |
| Need citations & traceability | RAG | Groundedness score + visible source chunks |
| Production with tight latency SLA | Fine-Tuning | Local inference, no API round-trip |
๐ ๏ธ Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| ๐ UI |
|
3-way comparison dashboard + analytics charts |
| โก LLM | Llama-3.1-8B โ Baseline + RAG generation | |
| ๐ค Embeddings | all-MiniLM-L6-v2 โ RAG semantic retrieval |
|
| ๐ Vector DB | CPU-based semantic search over knowledge base | |
| ๐ง Fine-Tuning |
|
LoRA adapter (r=8, ฮฑ=32) on Qwen2.5-1.5B |
| ๐๏ธ Base Model | Alibaba's compact LLM โ LoRA fine-tuned locally | |
| โ๏ธ Training | Free T4 GPU โ LoRA training in ~10 minutes | |
| ๐ Orchestration | RAG pipeline, FAISS integration, PDF ingestion | |
| ๐ Metrics | ROUGE-L + cosine similarity for auto-evaluation |
๐๏ธ Project Structure
๐ฆ LLM-vs-RAG-vs-Fine-Tuning/
โ
โโโ ๐ demo.py โ Streamlit app (main entry point)
โโโ ๐ system1_baseline.py โ Baseline LLM via Groq API
โโโ ๐ system2_rag.py โ RAG pipeline: FAISS + LangChain + Groq
โโโ ๐ system3_inference.py โ Fine-tuned model inference (PEFT)
โโโ ๐ system3_finetune_colab.ipynb โ LoRA training notebook (Colab T4)
โโโ ๐ evaluate.py โ Standalone evaluation script
โโโ ๐ run_benchmark.py โ Regenerates benchmark_cache.json
โ
โโโ ๐ checkpoint-25/ โ Trained LoRA weights (included)
โ โโโ adapter_model.safetensors
โ โโโ adapter_config.json โ r=8, alpha=32
โ โโโ tokenizer.json
โ
โโโ ๐ finetuned_model/ โ Merged model (after merge step)
โ
โโโ ๐ data/
โ โโโ ๐ docs/ โ 17 knowledge base .txt files
โ โโโ ๐ faiss_index/ โ FAISS vector store (auto-built)
โ โโโ ๐ pdfs/ โ Drop PDFs here for RAG ingestion
โ โโโ benchmark_cache.json โ Pre-computed 50Q benchmark results
โ โโโ reference_answers.json โ Ground-truth Q&A pairs
โ โโโ finetune_data.jsonl โ LoRA training data (ChatML format)
โ
โโโ ๐ requirements.txt
๐ฎ Roadmap
| Status | Feature |
|---|---|
| โ | 50-question auto-benchmark with persistent cache |
| โ | LoRA fine-tune checkpoint (checkpoint-25) included |
| โ | Analytics dashboard with Plotly + TABLE II |
| โ | PDF ingestion into RAG knowledge base |
| ๐ | Push Qwen2.5 LoRA adapter to HuggingFace Hub |
| ๐ | Full 3-system live demo on HuggingFace Spaces |
| ๐ | Expand knowledge base: 17 โ 50+ documents |
| ๐ | RAGAS-style faithfulness + context precision metrics |
| ๐ | Custom knowledge base upload via Streamlit UI |
Built with ๐ง by Adityax-07
Powered by Groq ยท HuggingFace ยท FAISS ยท LangChain ยท Streamlit
โญ If CodeSage helped you understand the LLM trade-off space, drop a star!