Spaces:

ajaymauryabbn
/

telecom-rag

Running

App Files Files Community

telecom-rag / docs /INTERVIEW_GUIDE.md

ajaymauryabbn

feat: Telecom RAG System - Production Ready

eb731f7 4 months ago

preview code

raw

history blame contribute delete

14.8 kB

	# Telecom RAG - Interview Guide

	A comprehensive Q&A guide to explain this project's technical concepts in interviews.

	---

	## Section 1: Project Overview Questions

	### Q: "Tell me about your project."

	> I built a Telecom RAG (Retrieval-Augmented Generation) system that provides AI-powered answers for telecom operations support - covering 5G networks, 3GPP standards, troubleshooting, and network KPIs.
	>
	> It's not just a simple RAG - it implements 6 key innovations:
	> 1. Hybrid Search - combines semantic (dense) vectors with keyword (BM25) search using Reciprocal Rank Fusion
	> 2. Neural Query Router - classifies query intent to pick the optimal search strategy
	> 3. RAGAS Evaluation - 6-metric quality assessment with automatic abstention for low-confidence answers
	> 4. Semantic Caching - deduplicates similar queries via embedding similarity
	> 5. Glossary Enhancement - expands 81+ telecom acronyms for better retrieval
	> 6. Graceful Degradation - works at every level even when components fail
	>
	> It's deployed on Google Cloud Run and handles a knowledge base of 12,500+ documents from 7 data sources.

	### Q: "Why did you build this? What problem does it solve?"

	> Telecom engineers deal with massive documentation - 3GPP specs alone are thousands of pages. They need quick, accurate answers when troubleshooting network issues or checking standards compliance.
	>
	> The problem with generic LLMs is they hallucinate telecom-specific details. My system grounds every answer in actual retrieved documents, evaluates faithfulness, and refuses to answer when it can't verify the response - which is critical in a domain where wrong answers can cause network outages.

	---

	## Section 2: RAG Architecture Questions

	### Q: "What is RAG and why did you use it?"

	> RAG = Retrieval-Augmented Generation. Instead of relying solely on an LLM's parametric knowledge (which can hallucinate), RAG first retrieves relevant documents from a knowledge base, then feeds them as context to the LLM.
	>
	> Pipeline: Query → Retrieve relevant docs → Build context → LLM generates grounded answer
	>
	> Why RAG over fine-tuning?
	> - No retraining needed when documents update (just re-index)
	> - Can cite specific sources (traceability)
	> - Can detect when it doesn't have enough information (abstention)
	> - Much cheaper than fine-tuning large models

	### Q: "Walk me through what happens when a user asks a question."

	> 9-step pipeline:
	>
	> 1. Rate Limit Check - Sliding window (50 req/min) prevents abuse
	> 2. Cache Check - Embed query, search for similar cached queries (threshold: 0.95 cosine similarity). If hit, return cached answer in <100ms
	> 3. Query Routing - Neural router classifies intent: factual (dense search), procedural (hybrid), or keyword (BM25). Also predicts document category
	> 4. Query Enhancement - Glossary expands acronyms: "HARQ" → "HARQ (Hybrid Automatic Repeat Request)"
	> 5. Hybrid Retrieval - Dense search (ChromaDB) + BM25 keyword search, merged with Reciprocal Rank Fusion
	> 6. Context Building - Top-6 results assembled into a context string with a 1500-token budget
	> 7. LLM Generation - GPT-4o-mini generates a grounded answer using a structured prompt with question repetition
	> 8. RAGAS Evaluation - 6 metrics computed: faithfulness, relevancy, context precision/recall, confidence, trust score. If below threshold → abstain
	> 9. Cache & Return - Cache the response, display with metrics and source citations

	### Q: "Why hybrid search instead of just semantic search?"

	> Semantic (dense) search excels at understanding meaning - "How to fix antenna issues" matches documents about "VSWR troubleshooting." But it struggles with exact terms - error codes, alarm IDs, 3GPP specification numbers.
	>
	> BM25 (keyword) search is the opposite - great for exact matches but misses semantic similarity.
	>
	> Hybrid search combines both. I use Reciprocal Rank Fusion (RRF) to merge the results:
	> ```
	> score(doc) = 1/(60+rank_in_dense) + 1/(60+rank_in_BM25)
	> ```
	> This gives documents that appear in both result sets a higher score, while still surfacing documents that are strong in either signal.

	---

	## Section 3: Algorithm Deep Dives

	### Q: "Explain Reciprocal Rank Fusion. Why k=60?"

	> Problem: Dense similarity scores (0 to 1) and BM25 scores (0 to unbounded) are on different scales. You can't simply add them.
	>
	> RRF Solution: Instead of using raw scores, use rank positions:
	> ```
	> RRF(doc) = Σ 1/(k + rank_i)
	> ```
	>
	> Why k=60? It's a smoothing constant. A smaller k (like 1) would heavily favor top-ranked documents. k=60 makes the fusion more democratic - the difference between rank 1 and rank 5 is smaller, so both dense and BM25 signals contribute meaningfully. This value was empirically validated in the Telco-RAG paper and is also standard in NIST TREC benchmarks.

	### Q: "How does your evaluation system work?"

	> I implemented a RAGAS-style evaluation with 6 metrics:
	>
	> 1. Faithfulness - Extract factual claims from the answer, check each against the context via keyword overlap. Score = supported/total claims
	> 2. Relevancy - Measure keyword overlap between question terms and answer terms (technical terms weighted 2x)
	> 3. Context Precision - What fraction of retrieved chunks were actually relevant (similarity > 0.5)
	> 4. Context Recall - What fraction of answer claims are covered by the context
	> 5. Confidence - Average of top-3 retrieval similarity scores
	> 6. Trust Score - Weighted combination: 40% faithfulness + 30% relevancy + 20% precision + 10% confidence
	>
	> Abstention Logic: If any metric drops below 0.3 (very low), the system refuses to answer and explains why. This prevents hallucinated responses in a safety-critical domain.
	>
	> There's also an optional LLM-as-judge mode where GPT-4o-mini itself scores faithfulness and relevancy (more accurate but slower).

	### Q: "What is HyDE and why did you implement it?"

	> HyDE = Hypothetical Document Embeddings.
	>
	> Problem: Short queries like "What is HARQ?" don't have much semantic content for embedding-based search.
	>
	> Solution: Ask the LLM to generate a hypothetical ideal answer, then search for documents similar to that hypothetical answer.
	>
	> Why it works: The hypothetical answer is closer in embedding space to actual documents about HARQ than the short query alone. It bridges the "query-document gap."
	>
	> Trade-off: +2-3.5% accuracy but adds an extra LLM call (~1s latency). That's why I made it optional and disabled by default.

	### Q: "How does the query router work?"

	> It's a prototype-based nearest neighbor classifier using embedding similarity:
	>
	> 1. Pre-compute embeddings for 6 prototype questions per strategy (DENSE, HYBRID, KEYWORD)
	> 2. When a query arrives, embed it
	> 3. Compute cosine similarity against all prototypes
	> 4. Pick the strategy whose prototypes have the highest max similarity
	>
	> Example:
	> - "What is 5G NR?" → highest sim with DENSE prototypes → pure semantic search
	> - "How to fix VSWR alarm?" → highest sim with HYBRID prototypes → hybrid search
	> - "Error 5301" → highest sim with KEYWORD prototypes → BM25 search
	>
	> This is lightweight (no ML training needed) and effective because telecom queries follow predictable patterns.

	---

	## Section 4: Infrastructure & Deployment

	### Q: "How did you deploy this?"

	> Google Cloud Run - serverless container platform.
	>
	> Key decisions:
	> - Scale-to-zero: Min instances = 0, so no cost when idle
	> - Pre-built indexes: ChromaDB and BM25 indexes are compressed into the Docker image so cold starts don't require rebuilding
	> - Feature flags: Reranker disabled on cloud (saves ~1GB memory), Redis disabled (no managed instance), hybrid search enabled
	> - Environment variables: API keys passed via `--set-env-vars`, not baked into the image
	>
	> Deployment flow: `deploy-cloudbuild.sh` → sources `.env` → `gcloud run deploy --source=.` → Cloud Build creates Docker image → deploys to Cloud Run

	### Q: "How do you handle cold starts?"

	> Cloud Run cold starts are a challenge. My mitigations:
	>
	> 1. Pre-built ChromaDB - compressed as `chroma_db.tar.gz` in the Docker image, extracted at build time
	> 2. Cached BM25 index - serialized as `bm25_index.pkl`, loaded from cache on startup instead of rebuilding from documents
	> 3. OpenAI embeddings - no local model download needed (vs sentence-transformers which needs ~1GB download)
	> 4. Lazy reranker - cross-encoder model only loaded on first use (not at startup)
	> 5. Streamlit caching - `@st.cache_resource` ensures components initialize once per instance

	### Q: "Explain your graceful degradation strategy."

	> Every component is designed to fail gracefully:
	>
	> \| Component Failure \| Fallback \|
	> \|---\|---\|
	> \| Redis down \| In-memory cache (non-persistent) \|
	> \| LLM unavailable \| Returns raw retrieved context \|
	> \| Reranker fails to load \| Skips reranking, returns hybrid results \|
	> \| BM25 index missing \| Falls back to dense-only search \|
	> \| HuggingFace rate limit \| Falls back to public dataset, then built-in KB \|
	> \| OpenAI API key invalid \| Shows helpful error with diagnostic command \|
	>
	> The system always tries to provide something useful rather than crashing. In production, this means the service stays up even when external dependencies have issues.

	---

	## Section 5: Design Decisions

	### Q: "Why ChromaDB instead of Pinecone/Weaviate/FAISS?"

	> ChromaDB was chosen because:
	> - Embedded - runs inside the application process, no separate database server needed
	> - Persistent - data survives container restarts via disk storage
	> - Lightweight - fits within Cloud Run's 4GB memory limit
	> - Python-native - first-class Python API, easy to integrate
	>
	> Why not alternatives:
	> - Pinecone - managed service, adds latency for network calls, costs money per vector
	> - FAISS - no built-in persistence, metadata filtering is manual
	> - Weaviate - requires separate server, overkill for this scale

	### Q: "Why GPT-4o-mini instead of GPT-4o?"

	> Cost vs quality trade-off:
	> - GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens
	> - GPT-4o: ~$5/1M input tokens, ~$15/1M output tokens
	>
	> For RAG, the quality difference is minimal because the answer quality depends more on the retrieved context than the model's parametric knowledge. GPT-4o-mini is 30x cheaper and 2-3x faster while being sufficient for summarizing and citing retrieved documents.

	### Q: "Why 125-token chunks?"

	> Based on the Telco-RAG paper and empirical testing:
	> - Too small (50 tokens): loses context, fragments sentences
	> - Too large (500 tokens): reduces retrieval precision, wastes token budget
	> - 125 tokens is optimal for Q&A-style telecom documents
	>
	> I also use dynamic chunk sizing based on category:
	> - Standards docs: 125 tokens (concise, factual)
	> - Network operations: 250 tokens (event-based, needs more context)
	> - Performance data: 500 tokens (time-series, needs surrounding context)

	### Q: "Why did you set the cache similarity threshold at 0.95?"

	> It's intentionally strict because we want cache hits only for nearly identical questions:
	> - "What is HARQ in 5G?" and "What is HARQ in 5G NR?" → same intent, should cache
	> - "What is HARQ?" and "What is MIMO?" → different intent, should NOT cache
	>
	> At 0.95, only near-paraphrases match. At 0.90, you'd get false positives that return answers for different questions. The cost of a cache miss (re-running the pipeline, ~3s) is much lower than the cost of returning a wrong cached answer.

	---

	## Section 6: Challenges & Learnings

	### Q: "What was the hardest challenge?"

	> Balancing retrieval precision vs recall in a specialized domain.
	>
	> Telecom has thousands of similar-sounding concepts (PDSCH/PUSCH, gNB/eNB, FR1/FR2). Pure semantic search would often retrieve the wrong related concept. The hybrid search with BM25 solved this because keyword matching catches exact acronyms that embeddings might confuse.
	>
	> Another challenge was empty answers from HuggingFace datasets - many TeleQnA entries had answer indices pointing to empty choices. I had to implement validation to filter these out (hence the "data quality report" in the ingestion pipeline).

	### Q: "What would you do differently if starting over?"

	> 1. Use a vector DB with built-in hybrid search (like Qdrant or Weaviate) instead of managing separate BM25 + ChromaDB
	> 2. Implement streaming responses - currently the UI waits for the full pipeline, but users would benefit from seeing partial results
	> 3. Use Google Cloud Secret Manager instead of environment variables for API keys
	> 4. Add automated evaluation with a test suite of known Q&A pairs to catch regressions

	---

	## Section 7: Metrics & Performance

	### Q: "What are the performance characteristics?"

	> \| Metric \| Value \|
	> \|--------\|-------\|
	> \| Cold start \| ~10-15s (Cloud Run) \|
	> \| Warm query (cache miss) \| ~2-4s \|
	> \| Cache hit \| <100ms \|
	> \| Knowledge base \| 12,500+ docs \|
	> \| Index size \| ~500MB \|
	> \| Memory usage \| ~2-3GB \|
	> \| Concurrent users \| ~10-20 (per instance) \|

	### Q: "How would you scale this?"

	> Horizontal: Cloud Run auto-scales to 10 instances. Each instance handles its own requests independently since ChromaDB is embedded and read-only after ingestion.
	>
	> If I needed more:
	> 1. Move to a managed vector DB (Pinecone/Qdrant) for shared state across instances
	> 2. Add a Redis cluster for shared caching (currently per-instance)
	> 3. Use Cloud CDN for static Streamlit assets
	> 4. Implement async LLM calls to handle more concurrent requests per instance
	> 5. Consider Google Cloud Memorystore for managed Redis

	---

	## Quick Reference: Key Technical Terms

	\| Term \| What It Is \| Where It's Used \|
	\|------\|-----------\|-----------------\|
	\| RAG \| Retrieval-Augmented Generation \| Core architecture pattern \|
	\| RRF \| Reciprocal Rank Fusion \| Merging dense + BM25 results \|
	\| HyDE \| Hypothetical Document Embeddings \| Optional retrieval improvement \|
	\| BM25 \| Best Matching 25 (TF-IDF variant) \| Keyword/sparse search \|
	\| HNSW \| Hierarchical Navigable Small World \| ChromaDB's vector index \|
	\| RAGAS \| RAG Assessment framework \| Answer quality evaluation \|
	\| TLM \| Trustworthy Language Model \| Trust score computation \|
	\| SON \| Self-Organizing Networks \| Telecom domain concept \|
	\| HARQ \| Hybrid Automatic Repeat Request \| Common telecom example query \|
	\| Cross-Encoder \| Model scoring query-doc pairs \| Reranking stage \|