Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.16.0
title: Philosopher Chat
emoji: ποΈ
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.15.1
app_file: app.py
pinned: false
license: mit
Philosopher Chat
A RAG (Retrieval-Augmented Generation) chatbot grounded in Western philosophical primary texts. Ask questions about nihilism, existentialism, epistemology, ethics, and more β answers are cited directly from 12 primary texts (~5,700 chunks).
Live demo: fikri0o0/philosopher-chat on HuggingFace Spaces
Features
| Feature | Detail |
|---|---|
| Query rewriting | Multi-query expansion (LLM paraphrases) fused with RRF for better recall |
| Two-stage retrieval | Hybrid (dense + BM25) β RRF β cross-encoder reranking |
| Corrective RAG | Abstains when retrieval confidence is low instead of hallucinating |
| RAGAS evaluation | 4 metrics, 3-stage ablation β each component quantified, not assumed |
| Streaming | Token-by-token via Google / Groq / OpenRouter |
| 15 LLMs | Gemma 4, Gemini, Llama 4, Qwen3, DeepSeek, Nemotron β all free tier |
| Think blocks | Qwen3 / DeepSeek reasoning rendered as collapsible chains-of-thought |
| UMAP viz | 2D projection of all 5,700+ embeddings coloured by philosopher |
| Model comparison | Side-by-side latency + quality comparison across any two models |
| Extendable KB | Upload your own PDF/TXT to add new philosophers |
Knowledge Base
| Philosopher | Works |
|---|---|
| Nietzsche | Thus Spoke Zarathustra, Beyond Good and Evil, On the Genealogy of Morality, The Birth of Tragedy |
| Schopenhauer | Essays of Arthur Schopenhauer |
| Hume | An Enquiry Concerning Human Understanding |
| Russell | The Problems of Philosophy |
| Marcus Aurelius | Meditations |
| Plato | The Republic |
| Mill | Utilitarianism |
| Epictetus | The Enchiridion |
| Kant | Fundamental Principles of the Metaphysic of Morals |
All texts are public domain, sourced from Project Gutenberg.
Tech Stack
| Layer | Tool |
|---|---|
| LLM routing | 15 models via Google AI Studio, Groq, OpenRouter (all free tier) |
| Embeddings | google/embeddinggemma-300m (HuggingFace, 768-dim) |
| Query transform | Multi-query rewriting (LLM paraphrases β RRF) |
| Retrieval | Hybrid (dense + BM25) β RRF fusion β cross-encoder rerank |
| Reranker | BAAI/bge-reranker-v2-m3 (multilingual cross-encoder) |
| Guardrail | Corrective RAG β cosine-gated abstention on out-of-corpus queries |
| Evaluation | RAGAS metrics (faithfulness, relevancy, context precision/recall) |
| RAG Framework | LangChain LCEL (no chains, direct composition) |
| UI | Gradio 6 |
| Deployment | HuggingFace Spaces |
Retrieval Architecture
Question
β
ββ Query rewriting (LLM β original + paraphrases) ββ
β each variant β β
ββ Dense retrieval (EmbeddingGemma-300M β ChromaDB cosine) ββ RRF fusion β top-20 pool
ββ Sparse retrieval (BM25 / rank-bm25) ββ
β
ββ Cross-encoder rerank (BGE-reranker-v2-m3) β top-6
β
ββ Corrective gate (cosine < threshold β abstain)
β
ββ LLM answer (grounded + cited from top-6 chunks)
The pattern follows modern production RAG: cheap recall first (multi-query hybrid), a precise cross-encoder rerank of the small pool, and an abstention gate so out-of-corpus questions get an honest "I don't know" instead of a hallucination.
Evaluation
The pipeline is measured, not assumed. evaluate.py generates
answers for a curated question set with reference answers across a 3-stage ablation
(baseline β + reranker β + query rewrite), then an LLM judge scores four
RAGAS metrics. Results render live in the app's
π Evaluation tab; full analysis in the
evaluation notebook.
Measuring each component (12 questions, LLM-as-judge)
| Metric | Baseline (Hybrid) | + Reranker | + Query Rewrite | Ξ (full) |
|---|---|---|---|---|
| Faithfulness | 0.40 | 0.44 | 0.43 | +0.03 |
| Answer Relevancy | 0.94 | 0.90 | 0.91 | β0.03 |
| Context Precision | 1.00 | 1.00 | 0.99 | β0.01 |
| Context Recall | 0.38 | 0.51 | 0.43 | +0.06 |
The reranker is the clear win β Context Recall jumps +0.14 (0.38 β 0.51) and Faithfulness rises, with no real cost elsewhere. Query rewriting did not help this corpus β it slightly reduced recall (0.51 β 0.43) and relevancy. That is the point of measuring: the data says ship the reranker and treat query rewriting as corpus-dependent rather than assuming more components are always better. Two-phase eval (generation, then judging) keeps it reproducible:
pip install -r requirements.txt
pip install --no-deps ragas && pip install -r requirements-eval.txt
python evaluate.py --generate # phase A: real retrieval + generation β eval_samples.json
# phase B: an LLM judge scores eval_samples.json β eval_results.json
Local Setup
1. Clone and install
git clone https://github.com/Fikri645/philosopher-chat
cd philosopher-chat
pip install -r requirements.txt
2. Set up API keys
# Create .env with your keys:
GOOGLE_API_KEY=... # https://ai.google.dev (free)
GROQ_API_KEY=... # https://console.groq.com (free)
OPENROUTER_API_KEY=... # https://openrouter.ai (free)
HF_TOKEN=... # https://huggingface.co/settings/tokens (for gated EmbeddingGemma)
3. Build the vectorstore (run once)
python ingest.py
Downloads 12 texts from Project Gutenberg, chunks them, embeds with EmbeddingGemma-300M,
and persists to vectorstore/. Takes ~5β10 min on first run (model download + embedding).
4. Run the app
python app.py
Open http://localhost:7860 in your browser.
Deploying to HuggingFace Spaces
- Fork or push to a new Space (SDK: Gradio)
- In Space Settings β Variables and Secrets, add:
GOOGLE_API_KEYGROQ_API_KEYOPENROUTER_API_KEYHF_TOKEN(your HF token β needed to download the gated EmbeddingGemma model)
- On first boot the Space auto-ingests all 12 texts (~10 min); subsequent boots load the cached vectorstore.
Project Structure
philosopher-chat/
βββ app.py β Gradio UI + event handlers
βββ rag_chain.py β LangChain RAG pipeline (retrieval + LLM routing)
βββ ingest.py β Data ingestion from Project Gutenberg
βββ config.py β LLM options, embedding model, RAG parameters
βββ requirements.txt
βββ .gitignore
βββ README.md