telecom-rag / docs /INTERVIEW_GUIDE.md
ajaymauryabbn's picture
feat: Telecom RAG System - Production Ready
eb731f7

Telecom RAG - Interview Guide

A comprehensive Q&A guide to explain this project's technical concepts in interviews.


Section 1: Project Overview Questions

Q: "Tell me about your project."

I built a Telecom RAG (Retrieval-Augmented Generation) system that provides AI-powered answers for telecom operations support - covering 5G networks, 3GPP standards, troubleshooting, and network KPIs.

It's not just a simple RAG - it implements 6 key innovations:

  1. Hybrid Search - combines semantic (dense) vectors with keyword (BM25) search using Reciprocal Rank Fusion
  2. Neural Query Router - classifies query intent to pick the optimal search strategy
  3. RAGAS Evaluation - 6-metric quality assessment with automatic abstention for low-confidence answers
  4. Semantic Caching - deduplicates similar queries via embedding similarity
  5. Glossary Enhancement - expands 81+ telecom acronyms for better retrieval
  6. Graceful Degradation - works at every level even when components fail

It's deployed on Google Cloud Run and handles a knowledge base of 12,500+ documents from 7 data sources.

Q: "Why did you build this? What problem does it solve?"

Telecom engineers deal with massive documentation - 3GPP specs alone are thousands of pages. They need quick, accurate answers when troubleshooting network issues or checking standards compliance.

The problem with generic LLMs is they hallucinate telecom-specific details. My system grounds every answer in actual retrieved documents, evaluates faithfulness, and refuses to answer when it can't verify the response - which is critical in a domain where wrong answers can cause network outages.


Section 2: RAG Architecture Questions

Q: "What is RAG and why did you use it?"

RAG = Retrieval-Augmented Generation. Instead of relying solely on an LLM's parametric knowledge (which can hallucinate), RAG first retrieves relevant documents from a knowledge base, then feeds them as context to the LLM.

Pipeline: Query β†’ Retrieve relevant docs β†’ Build context β†’ LLM generates grounded answer

Why RAG over fine-tuning?

  • No retraining needed when documents update (just re-index)
  • Can cite specific sources (traceability)
  • Can detect when it doesn't have enough information (abstention)
  • Much cheaper than fine-tuning large models

Q: "Walk me through what happens when a user asks a question."

9-step pipeline:

  1. Rate Limit Check - Sliding window (50 req/min) prevents abuse
  2. Cache Check - Embed query, search for similar cached queries (threshold: 0.95 cosine similarity). If hit, return cached answer in <100ms
  3. Query Routing - Neural router classifies intent: factual (dense search), procedural (hybrid), or keyword (BM25). Also predicts document category
  4. Query Enhancement - Glossary expands acronyms: "HARQ" β†’ "HARQ (Hybrid Automatic Repeat Request)"
  5. Hybrid Retrieval - Dense search (ChromaDB) + BM25 keyword search, merged with Reciprocal Rank Fusion
  6. Context Building - Top-6 results assembled into a context string with a 1500-token budget
  7. LLM Generation - GPT-4o-mini generates a grounded answer using a structured prompt with question repetition
  8. RAGAS Evaluation - 6 metrics computed: faithfulness, relevancy, context precision/recall, confidence, trust score. If below threshold β†’ abstain
  9. Cache & Return - Cache the response, display with metrics and source citations

Q: "Why hybrid search instead of just semantic search?"

Semantic (dense) search excels at understanding meaning - "How to fix antenna issues" matches documents about "VSWR troubleshooting." But it struggles with exact terms - error codes, alarm IDs, 3GPP specification numbers.

BM25 (keyword) search is the opposite - great for exact matches but misses semantic similarity.

Hybrid search combines both. I use Reciprocal Rank Fusion (RRF) to merge the results:

score(doc) = 1/(60+rank_in_dense) + 1/(60+rank_in_BM25)

This gives documents that appear in both result sets a higher score, while still surfacing documents that are strong in either signal.


Section 3: Algorithm Deep Dives

Q: "Explain Reciprocal Rank Fusion. Why k=60?"

Problem: Dense similarity scores (0 to 1) and BM25 scores (0 to unbounded) are on different scales. You can't simply add them.

RRF Solution: Instead of using raw scores, use rank positions:

RRF(doc) = Ξ£ 1/(k + rank_i)

Why k=60? It's a smoothing constant. A smaller k (like 1) would heavily favor top-ranked documents. k=60 makes the fusion more democratic - the difference between rank 1 and rank 5 is smaller, so both dense and BM25 signals contribute meaningfully. This value was empirically validated in the Telco-RAG paper and is also standard in NIST TREC benchmarks.

Q: "How does your evaluation system work?"

I implemented a RAGAS-style evaluation with 6 metrics:

  1. Faithfulness - Extract factual claims from the answer, check each against the context via keyword overlap. Score = supported/total claims
  2. Relevancy - Measure keyword overlap between question terms and answer terms (technical terms weighted 2x)
  3. Context Precision - What fraction of retrieved chunks were actually relevant (similarity > 0.5)
  4. Context Recall - What fraction of answer claims are covered by the context
  5. Confidence - Average of top-3 retrieval similarity scores
  6. Trust Score - Weighted combination: 40% faithfulness + 30% relevancy + 20% precision + 10% confidence

Abstention Logic: If any metric drops below 0.3 (very low), the system refuses to answer and explains why. This prevents hallucinated responses in a safety-critical domain.

There's also an optional LLM-as-judge mode where GPT-4o-mini itself scores faithfulness and relevancy (more accurate but slower).

Q: "What is HyDE and why did you implement it?"

HyDE = Hypothetical Document Embeddings.

Problem: Short queries like "What is HARQ?" don't have much semantic content for embedding-based search.

Solution: Ask the LLM to generate a hypothetical ideal answer, then search for documents similar to that hypothetical answer.

Why it works: The hypothetical answer is closer in embedding space to actual documents about HARQ than the short query alone. It bridges the "query-document gap."

Trade-off: +2-3.5% accuracy but adds an extra LLM call (~1s latency). That's why I made it optional and disabled by default.

Q: "How does the query router work?"

It's a prototype-based nearest neighbor classifier using embedding similarity:

  1. Pre-compute embeddings for 6 prototype questions per strategy (DENSE, HYBRID, KEYWORD)
  2. When a query arrives, embed it
  3. Compute cosine similarity against all prototypes
  4. Pick the strategy whose prototypes have the highest max similarity

Example:

  • "What is 5G NR?" β†’ highest sim with DENSE prototypes β†’ pure semantic search
  • "How to fix VSWR alarm?" β†’ highest sim with HYBRID prototypes β†’ hybrid search
  • "Error 5301" β†’ highest sim with KEYWORD prototypes β†’ BM25 search

This is lightweight (no ML training needed) and effective because telecom queries follow predictable patterns.


Section 4: Infrastructure & Deployment

Q: "How did you deploy this?"

Google Cloud Run - serverless container platform.

Key decisions:

  • Scale-to-zero: Min instances = 0, so no cost when idle
  • Pre-built indexes: ChromaDB and BM25 indexes are compressed into the Docker image so cold starts don't require rebuilding
  • Feature flags: Reranker disabled on cloud (saves ~1GB memory), Redis disabled (no managed instance), hybrid search enabled
  • Environment variables: API keys passed via --set-env-vars, not baked into the image

Deployment flow: deploy-cloudbuild.sh β†’ sources .env β†’ gcloud run deploy --source=. β†’ Cloud Build creates Docker image β†’ deploys to Cloud Run

Q: "How do you handle cold starts?"

Cloud Run cold starts are a challenge. My mitigations:

  1. Pre-built ChromaDB - compressed as chroma_db.tar.gz in the Docker image, extracted at build time
  2. Cached BM25 index - serialized as bm25_index.pkl, loaded from cache on startup instead of rebuilding from documents
  3. OpenAI embeddings - no local model download needed (vs sentence-transformers which needs ~1GB download)
  4. Lazy reranker - cross-encoder model only loaded on first use (not at startup)
  5. Streamlit caching - @st.cache_resource ensures components initialize once per instance

Q: "Explain your graceful degradation strategy."

Every component is designed to fail gracefully:

Component Failure Fallback
Redis down In-memory cache (non-persistent)
LLM unavailable Returns raw retrieved context
Reranker fails to load Skips reranking, returns hybrid results
BM25 index missing Falls back to dense-only search
HuggingFace rate limit Falls back to public dataset, then built-in KB
OpenAI API key invalid Shows helpful error with diagnostic command

The system always tries to provide something useful rather than crashing. In production, this means the service stays up even when external dependencies have issues.


Section 5: Design Decisions

Q: "Why ChromaDB instead of Pinecone/Weaviate/FAISS?"

ChromaDB was chosen because:

  • Embedded - runs inside the application process, no separate database server needed
  • Persistent - data survives container restarts via disk storage
  • Lightweight - fits within Cloud Run's 4GB memory limit
  • Python-native - first-class Python API, easy to integrate

Why not alternatives:

  • Pinecone - managed service, adds latency for network calls, costs money per vector
  • FAISS - no built-in persistence, metadata filtering is manual
  • Weaviate - requires separate server, overkill for this scale

Q: "Why GPT-4o-mini instead of GPT-4o?"

Cost vs quality trade-off:

  • GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens
  • GPT-4o: ~$5/1M input tokens, ~$15/1M output tokens

For RAG, the quality difference is minimal because the answer quality depends more on the retrieved context than the model's parametric knowledge. GPT-4o-mini is 30x cheaper and 2-3x faster while being sufficient for summarizing and citing retrieved documents.

Q: "Why 125-token chunks?"

Based on the Telco-RAG paper and empirical testing:

  • Too small (50 tokens): loses context, fragments sentences
  • Too large (500 tokens): reduces retrieval precision, wastes token budget
  • 125 tokens is optimal for Q&A-style telecom documents

I also use dynamic chunk sizing based on category:

  • Standards docs: 125 tokens (concise, factual)
  • Network operations: 250 tokens (event-based, needs more context)
  • Performance data: 500 tokens (time-series, needs surrounding context)

Q: "Why did you set the cache similarity threshold at 0.95?"

It's intentionally strict because we want cache hits only for nearly identical questions:

  • "What is HARQ in 5G?" and "What is HARQ in 5G NR?" β†’ same intent, should cache
  • "What is HARQ?" and "What is MIMO?" β†’ different intent, should NOT cache

At 0.95, only near-paraphrases match. At 0.90, you'd get false positives that return answers for different questions. The cost of a cache miss (re-running the pipeline, ~3s) is much lower than the cost of returning a wrong cached answer.


Section 6: Challenges & Learnings

Q: "What was the hardest challenge?"

Balancing retrieval precision vs recall in a specialized domain.

Telecom has thousands of similar-sounding concepts (PDSCH/PUSCH, gNB/eNB, FR1/FR2). Pure semantic search would often retrieve the wrong related concept. The hybrid search with BM25 solved this because keyword matching catches exact acronyms that embeddings might confuse.

Another challenge was empty answers from HuggingFace datasets - many TeleQnA entries had answer indices pointing to empty choices. I had to implement validation to filter these out (hence the "data quality report" in the ingestion pipeline).

Q: "What would you do differently if starting over?"

  1. Use a vector DB with built-in hybrid search (like Qdrant or Weaviate) instead of managing separate BM25 + ChromaDB
  2. Implement streaming responses - currently the UI waits for the full pipeline, but users would benefit from seeing partial results
  3. Use Google Cloud Secret Manager instead of environment variables for API keys
  4. Add automated evaluation with a test suite of known Q&A pairs to catch regressions

Section 7: Metrics & Performance

Q: "What are the performance characteristics?"

Metric Value
Cold start ~10-15s (Cloud Run)
Warm query (cache miss) ~2-4s
Cache hit <100ms
Knowledge base 12,500+ docs
Index size ~500MB
Memory usage ~2-3GB
Concurrent users ~10-20 (per instance)

Q: "How would you scale this?"

Horizontal: Cloud Run auto-scales to 10 instances. Each instance handles its own requests independently since ChromaDB is embedded and read-only after ingestion.

If I needed more:

  1. Move to a managed vector DB (Pinecone/Qdrant) for shared state across instances
  2. Add a Redis cluster for shared caching (currently per-instance)
  3. Use Cloud CDN for static Streamlit assets
  4. Implement async LLM calls to handle more concurrent requests per instance
  5. Consider Google Cloud Memorystore for managed Redis

Quick Reference: Key Technical Terms

Term What It Is Where It's Used
RAG Retrieval-Augmented Generation Core architecture pattern
RRF Reciprocal Rank Fusion Merging dense + BM25 results
HyDE Hypothetical Document Embeddings Optional retrieval improvement
BM25 Best Matching 25 (TF-IDF variant) Keyword/sparse search
HNSW Hierarchical Navigable Small World ChromaDB's vector index
RAGAS RAG Assessment framework Answer quality evaluation
TLM Trustworthy Language Model Trust score computation
SON Self-Organizing Networks Telecom domain concept
HARQ Hybrid Automatic Repeat Request Common telecom example query
Cross-Encoder Model scoring query-doc pairs Reranking stage