Spaces:

ajaymauryabbn
/

telecom-rag

Running

App Files Files Community

telecom-rag / docs /INTERVIEW_GUIDE.md

ajaymauryabbn

feat: Telecom RAG System - Production Ready

eb731f7 4 months ago

preview code

raw

history blame contribute delete

14.8 kB

Telecom RAG - Interview Guide

A comprehensive Q&A guide to explain this project's technical concepts in interviews.

Section 1: Project Overview Questions

Q: "Tell me about your project."

I built a Telecom RAG (Retrieval-Augmented Generation) system that provides AI-powered answers for telecom operations support - covering 5G networks, 3GPP standards, troubleshooting, and network KPIs.

It's not just a simple RAG - it implements 6 key innovations:

Hybrid Search - combines semantic (dense) vectors with keyword (BM25) search using Reciprocal Rank Fusion

Neural Query Router - classifies query intent to pick the optimal search strategy

RAGAS Evaluation - 6-metric quality assessment with automatic abstention for low-confidence answers

Semantic Caching - deduplicates similar queries via embedding similarity

Glossary Enhancement - expands 81+ telecom acronyms for better retrieval

Graceful Degradation - works at every level even when components fail

It's deployed on Google Cloud Run and handles a knowledge base of 12,500+ documents from 7 data sources.

Q: "Why did you build this? What problem does it solve?"

Telecom engineers deal with massive documentation - 3GPP specs alone are thousands of pages. They need quick, accurate answers when troubleshooting network issues or checking standards compliance.

The problem with generic LLMs is they hallucinate telecom-specific details. My system grounds every answer in actual retrieved documents, evaluates faithfulness, and refuses to answer when it can't verify the response - which is critical in a domain where wrong answers can cause network outages.

Section 2: RAG Architecture Questions

Q: "What is RAG and why did you use it?"

RAG = Retrieval-Augmented Generation. Instead of relying solely on an LLM's parametric knowledge (which can hallucinate), RAG first retrieves relevant documents from a knowledge base, then feeds them as context to the LLM.

Pipeline: Query → Retrieve relevant docs → Build context → LLM generates grounded answer

Why RAG over fine-tuning?

No retraining needed when documents update (just re-index)

Can cite specific sources (traceability)

Can detect when it doesn't have enough information (abstention)

Much cheaper than fine-tuning large models

Q: "Walk me through what happens when a user asks a question."

9-step pipeline:

Rate Limit Check - Sliding window (50 req/min) prevents abuse

Cache Check - Embed query, search for similar cached queries (threshold: 0.95 cosine similarity). If hit, return cached answer in <100ms

Query Routing - Neural router classifies intent: factual (dense search), procedural (hybrid), or keyword (BM25). Also predicts document category

Query Enhancement - Glossary expands acronyms: "HARQ" → "HARQ (Hybrid Automatic Repeat Request)"

Hybrid Retrieval - Dense search (ChromaDB) + BM25 keyword search, merged with Reciprocal Rank Fusion

Context Building - Top-6 results assembled into a context string with a 1500-token budget

LLM Generation - GPT-4o-mini generates a grounded answer using a structured prompt with question repetition

RAGAS Evaluation - 6 metrics computed: faithfulness, relevancy, context precision/recall, confidence, trust score. If below threshold → abstain

Cache & Return - Cache the response, display with metrics and source citations

Q: "Why hybrid search instead of just semantic search?"

Semantic (dense) search excels at understanding meaning - "How to fix antenna issues" matches documents about "VSWR troubleshooting." But it struggles with exact terms - error codes, alarm IDs, 3GPP specification numbers.

BM25 (keyword) search is the opposite - great for exact matches but misses semantic similarity.

Hybrid search combines both. I use Reciprocal Rank Fusion (RRF) to merge the results:
score(doc) = 1/(60+rank_in_dense) + 1/(60+rank_in_BM25)
This gives documents that appear in both result sets a higher score, while still surfacing documents that are strong in either signal.

Section 3: Algorithm Deep Dives

Q: "Explain Reciprocal Rank Fusion. Why k=60?"

Problem: Dense similarity scores (0 to 1) and BM25 scores (0 to unbounded) are on different scales. You can't simply add them.

RRF Solution: Instead of using raw scores, use rank positions:
RRF(doc) = Σ 1/(k + rank_i)
Why k=60? It's a smoothing constant. A smaller k (like 1) would heavily favor top-ranked documents. k=60 makes the fusion more democratic - the difference between rank 1 and rank 5 is smaller, so both dense and BM25 signals contribute meaningfully. This value was empirically validated in the Telco-RAG paper and is also standard in NIST TREC benchmarks.

Q: "How does your evaluation system work?"

I implemented a RAGAS-style evaluation with 6 metrics:

Faithfulness - Extract factual claims from the answer, check each against the context via keyword overlap. Score = supported/total claims

Relevancy - Measure keyword overlap between question terms and answer terms (technical terms weighted 2x)

Context Precision - What fraction of retrieved chunks were actually relevant (similarity > 0.5)

Context Recall - What fraction of answer claims are covered by the context

Confidence - Average of top-3 retrieval similarity scores

Trust Score - Weighted combination: 40% faithfulness + 30% relevancy + 20% precision + 10% confidence

Abstention Logic: If any metric drops below 0.3 (very low), the system refuses to answer and explains why. This prevents hallucinated responses in a safety-critical domain.

There's also an optional LLM-as-judge mode where GPT-4o-mini itself scores faithfulness and relevancy (more accurate but slower).

Q: "What is HyDE and why did you implement it?"

HyDE = Hypothetical Document Embeddings.

Problem: Short queries like "What is HARQ?" don't have much semantic content for embedding-based search.

Solution: Ask the LLM to generate a hypothetical ideal answer, then search for documents similar to that hypothetical answer.

Why it works: The hypothetical answer is closer in embedding space to actual documents about HARQ than the short query alone. It bridges the "query-document gap."

Trade-off: +2-3.5% accuracy but adds an extra LLM call (~1s latency). That's why I made it optional and disabled by default.

Q: "How does the query router work?"

It's a prototype-based nearest neighbor classifier using embedding similarity:

Pre-compute embeddings for 6 prototype questions per strategy (DENSE, HYBRID, KEYWORD)

When a query arrives, embed it

Compute cosine similarity against all prototypes

Pick the strategy whose prototypes have the highest max similarity

Example:

"What is 5G NR?" → highest sim with DENSE prototypes → pure semantic search

"How to fix VSWR alarm?" → highest sim with HYBRID prototypes → hybrid search

"Error 5301" → highest sim with KEYWORD prototypes → BM25 search

This is lightweight (no ML training needed) and effective because telecom queries follow predictable patterns.

Section 4: Infrastructure & Deployment

Q: "How did you deploy this?"

Google Cloud Run - serverless container platform.

Key decisions:

Scale-to-zero: Min instances = 0, so no cost when idle

Pre-built indexes: ChromaDB and BM25 indexes are compressed into the Docker image so cold starts don't require rebuilding

Feature flags: Reranker disabled on cloud (saves ~1GB memory), Redis disabled (no managed instance), hybrid search enabled

Environment variables: API keys passed via --set-env-vars, not baked into the image

Deployment flow: deploy-cloudbuild.sh → sources .env → gcloud run deploy --source=. → Cloud Build creates Docker image → deploys to Cloud Run

Q: "How do you handle cold starts?"

Cloud Run cold starts are a challenge. My mitigations:

Pre-built ChromaDB - compressed as chroma_db.tar.gz in the Docker image, extracted at build time

Cached BM25 index - serialized as bm25_index.pkl, loaded from cache on startup instead of rebuilding from documents

OpenAI embeddings - no local model download needed (vs sentence-transformers which needs ~1GB download)

Lazy reranker - cross-encoder model only loaded on first use (not at startup)

Streamlit caching - @st.cache_resource ensures components initialize once per instance

Q: "Explain your graceful degradation strategy."

Every component is designed to fail gracefully:

Component Failure Fallback

Redis down In-memory cache (non-persistent)

LLM unavailable Returns raw retrieved context

Reranker fails to load Skips reranking, returns hybrid results

BM25 index missing Falls back to dense-only search

HuggingFace rate limit Falls back to public dataset, then built-in KB

OpenAI API key invalid Shows helpful error with diagnostic command

The system always tries to provide something useful rather than crashing. In production, this means the service stays up even when external dependencies have issues.

Component Failure	Fallback
Redis down	In-memory cache (non-persistent)
LLM unavailable	Returns raw retrieved context
Reranker fails to load	Skips reranking, returns hybrid results
BM25 index missing	Falls back to dense-only search
HuggingFace rate limit	Falls back to public dataset, then built-in KB
OpenAI API key invalid	Shows helpful error with diagnostic command

Section 5: Design Decisions

Q: "Why ChromaDB instead of Pinecone/Weaviate/FAISS?"

ChromaDB was chosen because:

Embedded - runs inside the application process, no separate database server needed

Persistent - data survives container restarts via disk storage

Lightweight - fits within Cloud Run's 4GB memory limit

Python-native - first-class Python API, easy to integrate

Why not alternatives:

Pinecone - managed service, adds latency for network calls, costs money per vector

FAISS - no built-in persistence, metadata filtering is manual

Weaviate - requires separate server, overkill for this scale

Q: "Why GPT-4o-mini instead of GPT-4o?"

Cost vs quality trade-off:

GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens

GPT-4o: ~$5/1M input tokens, ~$15/1M output tokens

For RAG, the quality difference is minimal because the answer quality depends more on the retrieved context than the model's parametric knowledge. GPT-4o-mini is 30x cheaper and 2-3x faster while being sufficient for summarizing and citing retrieved documents.

Q: "Why 125-token chunks?"

Based on the Telco-RAG paper and empirical testing:

Too small (50 tokens): loses context, fragments sentences

Too large (500 tokens): reduces retrieval precision, wastes token budget

125 tokens is optimal for Q&A-style telecom documents

I also use dynamic chunk sizing based on category:

Standards docs: 125 tokens (concise, factual)

Network operations: 250 tokens (event-based, needs more context)

Performance data: 500 tokens (time-series, needs surrounding context)

Q: "Why did you set the cache similarity threshold at 0.95?"

It's intentionally strict because we want cache hits only for nearly identical questions:

"What is HARQ in 5G?" and "What is HARQ in 5G NR?" → same intent, should cache

"What is HARQ?" and "What is MIMO?" → different intent, should NOT cache

At 0.95, only near-paraphrases match. At 0.90, you'd get false positives that return answers for different questions. The cost of a cache miss (re-running the pipeline, ~3s) is much lower than the cost of returning a wrong cached answer.

Section 6: Challenges & Learnings

Q: "What was the hardest challenge?"

Balancing retrieval precision vs recall in a specialized domain.

Telecom has thousands of similar-sounding concepts (PDSCH/PUSCH, gNB/eNB, FR1/FR2). Pure semantic search would often retrieve the wrong related concept. The hybrid search with BM25 solved this because keyword matching catches exact acronyms that embeddings might confuse.

Another challenge was empty answers from HuggingFace datasets - many TeleQnA entries had answer indices pointing to empty choices. I had to implement validation to filter these out (hence the "data quality report" in the ingestion pipeline).

Q: "What would you do differently if starting over?"

Use a vector DB with built-in hybrid search (like Qdrant or Weaviate) instead of managing separate BM25 + ChromaDB

Implement streaming responses - currently the UI waits for the full pipeline, but users would benefit from seeing partial results

Use Google Cloud Secret Manager instead of environment variables for API keys

Add automated evaluation with a test suite of known Q&A pairs to catch regressions

Section 7: Metrics & Performance

Q: "What are the performance characteristics?"

Metric Value

Cold start ~10-15s (Cloud Run)

Warm query (cache miss) ~2-4s

Cache hit <100ms

Knowledge base 12,500+ docs

Index size ~500MB

Memory usage ~2-3GB

Concurrent users ~10-20 (per instance)

Metric	Value
Cold start	~10-15s (Cloud Run)
Warm query (cache miss)	~2-4s
Cache hit	<100ms
Knowledge base	12,500+ docs
Index size	~500MB
Memory usage	~2-3GB
Concurrent users	~10-20 (per instance)

Q: "How would you scale this?"

Horizontal: Cloud Run auto-scales to 10 instances. Each instance handles its own requests independently since ChromaDB is embedded and read-only after ingestion.

If I needed more:

Move to a managed vector DB (Pinecone/Qdrant) for shared state across instances

Add a Redis cluster for shared caching (currently per-instance)

Use Cloud CDN for static Streamlit assets

Implement async LLM calls to handle more concurrent requests per instance

Consider Google Cloud Memorystore for managed Redis

Quick Reference: Key Technical Terms

Term	What It Is	Where It's Used
RAG	Retrieval-Augmented Generation	Core architecture pattern
RRF	Reciprocal Rank Fusion	Merging dense + BM25 results
HyDE	Hypothetical Document Embeddings	Optional retrieval improvement
BM25	Best Matching 25 (TF-IDF variant)	Keyword/sparse search
HNSW	Hierarchical Navigable Small World	ChromaDB's vector index
RAGAS	RAG Assessment framework	Answer quality evaluation
TLM	Trustworthy Language Model	Trust score computation
SON	Self-Organizing Networks	Telecom domain concept
HARQ	Hybrid Automatic Repeat Request	Common telecom example query
Cross-Encoder	Model scoring query-doc pairs	Reranking stage