Spaces:
Running
Running
File size: 14,754 Bytes
eb731f7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | # Telecom RAG - Interview Guide
A comprehensive Q&A guide to explain this project's technical concepts in interviews.
---
## Section 1: Project Overview Questions
### Q: "Tell me about your project."
> I built a **Telecom RAG (Retrieval-Augmented Generation) system** that provides AI-powered answers for telecom operations support - covering 5G networks, 3GPP standards, troubleshooting, and network KPIs.
>
> It's not just a simple RAG - it implements **6 key innovations**:
> 1. **Hybrid Search** - combines semantic (dense) vectors with keyword (BM25) search using Reciprocal Rank Fusion
> 2. **Neural Query Router** - classifies query intent to pick the optimal search strategy
> 3. **RAGAS Evaluation** - 6-metric quality assessment with automatic abstention for low-confidence answers
> 4. **Semantic Caching** - deduplicates similar queries via embedding similarity
> 5. **Glossary Enhancement** - expands 81+ telecom acronyms for better retrieval
> 6. **Graceful Degradation** - works at every level even when components fail
>
> It's deployed on Google Cloud Run and handles a knowledge base of 12,500+ documents from 7 data sources.
### Q: "Why did you build this? What problem does it solve?"
> Telecom engineers deal with massive documentation - 3GPP specs alone are thousands of pages. They need quick, accurate answers when troubleshooting network issues or checking standards compliance.
>
> The problem with generic LLMs is they **hallucinate** telecom-specific details. My system grounds every answer in actual retrieved documents, evaluates faithfulness, and **refuses to answer** when it can't verify the response - which is critical in a domain where wrong answers can cause network outages.
---
## Section 2: RAG Architecture Questions
### Q: "What is RAG and why did you use it?"
> **RAG = Retrieval-Augmented Generation.** Instead of relying solely on an LLM's parametric knowledge (which can hallucinate), RAG first retrieves relevant documents from a knowledge base, then feeds them as context to the LLM.
>
> **Pipeline:** Query β Retrieve relevant docs β Build context β LLM generates grounded answer
>
> **Why RAG over fine-tuning?**
> - No retraining needed when documents update (just re-index)
> - Can cite specific sources (traceability)
> - Can detect when it doesn't have enough information (abstention)
> - Much cheaper than fine-tuning large models
### Q: "Walk me through what happens when a user asks a question."
> 9-step pipeline:
>
> 1. **Rate Limit Check** - Sliding window (50 req/min) prevents abuse
> 2. **Cache Check** - Embed query, search for similar cached queries (threshold: 0.95 cosine similarity). If hit, return cached answer in <100ms
> 3. **Query Routing** - Neural router classifies intent: factual (dense search), procedural (hybrid), or keyword (BM25). Also predicts document category
> 4. **Query Enhancement** - Glossary expands acronyms: "HARQ" β "HARQ (Hybrid Automatic Repeat Request)"
> 5. **Hybrid Retrieval** - Dense search (ChromaDB) + BM25 keyword search, merged with Reciprocal Rank Fusion
> 6. **Context Building** - Top-6 results assembled into a context string with a 1500-token budget
> 7. **LLM Generation** - GPT-4o-mini generates a grounded answer using a structured prompt with question repetition
> 8. **RAGAS Evaluation** - 6 metrics computed: faithfulness, relevancy, context precision/recall, confidence, trust score. If below threshold β abstain
> 9. **Cache & Return** - Cache the response, display with metrics and source citations
### Q: "Why hybrid search instead of just semantic search?"
> Semantic (dense) search excels at understanding meaning - "How to fix antenna issues" matches documents about "VSWR troubleshooting." But it struggles with **exact terms** - error codes, alarm IDs, 3GPP specification numbers.
>
> BM25 (keyword) search is the opposite - great for exact matches but misses semantic similarity.
>
> **Hybrid search combines both.** I use Reciprocal Rank Fusion (RRF) to merge the results:
> ```
> score(doc) = 1/(60+rank_in_dense) + 1/(60+rank_in_BM25)
> ```
> This gives documents that appear in both result sets a higher score, while still surfacing documents that are strong in either signal.
---
## Section 3: Algorithm Deep Dives
### Q: "Explain Reciprocal Rank Fusion. Why k=60?"
> **Problem:** Dense similarity scores (0 to 1) and BM25 scores (0 to unbounded) are on different scales. You can't simply add them.
>
> **RRF Solution:** Instead of using raw scores, use rank positions:
> ```
> RRF(doc) = Ξ£ 1/(k + rank_i)
> ```
>
> **Why k=60?** It's a smoothing constant. A smaller k (like 1) would heavily favor top-ranked documents. k=60 makes the fusion more democratic - the difference between rank 1 and rank 5 is smaller, so both dense and BM25 signals contribute meaningfully. This value was empirically validated in the Telco-RAG paper and is also standard in NIST TREC benchmarks.
### Q: "How does your evaluation system work?"
> I implemented a **RAGAS-style evaluation** with 6 metrics:
>
> 1. **Faithfulness** - Extract factual claims from the answer, check each against the context via keyword overlap. Score = supported/total claims
> 2. **Relevancy** - Measure keyword overlap between question terms and answer terms (technical terms weighted 2x)
> 3. **Context Precision** - What fraction of retrieved chunks were actually relevant (similarity > 0.5)
> 4. **Context Recall** - What fraction of answer claims are covered by the context
> 5. **Confidence** - Average of top-3 retrieval similarity scores
> 6. **Trust Score** - Weighted combination: 40% faithfulness + 30% relevancy + 20% precision + 10% confidence
>
> **Abstention Logic:** If any metric drops below 0.3 (very low), the system refuses to answer and explains why. This prevents hallucinated responses in a safety-critical domain.
>
> There's also an optional **LLM-as-judge** mode where GPT-4o-mini itself scores faithfulness and relevancy (more accurate but slower).
### Q: "What is HyDE and why did you implement it?"
> **HyDE = Hypothetical Document Embeddings.**
>
> **Problem:** Short queries like "What is HARQ?" don't have much semantic content for embedding-based search.
>
> **Solution:** Ask the LLM to generate a hypothetical ideal answer, then search for documents similar to that hypothetical answer.
>
> **Why it works:** The hypothetical answer is closer in embedding space to actual documents about HARQ than the short query alone. It bridges the "query-document gap."
>
> **Trade-off:** +2-3.5% accuracy but adds an extra LLM call (~1s latency). That's why I made it optional and disabled by default.
### Q: "How does the query router work?"
> It's a **prototype-based nearest neighbor classifier** using embedding similarity:
>
> 1. Pre-compute embeddings for 6 prototype questions per strategy (DENSE, HYBRID, KEYWORD)
> 2. When a query arrives, embed it
> 3. Compute cosine similarity against all prototypes
> 4. Pick the strategy whose prototypes have the highest max similarity
>
> **Example:**
> - "What is 5G NR?" β highest sim with DENSE prototypes β pure semantic search
> - "How to fix VSWR alarm?" β highest sim with HYBRID prototypes β hybrid search
> - "Error 5301" β highest sim with KEYWORD prototypes β BM25 search
>
> This is lightweight (no ML training needed) and effective because telecom queries follow predictable patterns.
---
## Section 4: Infrastructure & Deployment
### Q: "How did you deploy this?"
> **Google Cloud Run** - serverless container platform.
>
> **Key decisions:**
> - **Scale-to-zero:** Min instances = 0, so no cost when idle
> - **Pre-built indexes:** ChromaDB and BM25 indexes are compressed into the Docker image so cold starts don't require rebuilding
> - **Feature flags:** Reranker disabled on cloud (saves ~1GB memory), Redis disabled (no managed instance), hybrid search enabled
> - **Environment variables:** API keys passed via `--set-env-vars`, not baked into the image
>
> **Deployment flow:** `deploy-cloudbuild.sh` β sources `.env` β `gcloud run deploy --source=.` β Cloud Build creates Docker image β deploys to Cloud Run
### Q: "How do you handle cold starts?"
> Cloud Run cold starts are a challenge. My mitigations:
>
> 1. **Pre-built ChromaDB** - compressed as `chroma_db.tar.gz` in the Docker image, extracted at build time
> 2. **Cached BM25 index** - serialized as `bm25_index.pkl`, loaded from cache on startup instead of rebuilding from documents
> 3. **OpenAI embeddings** - no local model download needed (vs sentence-transformers which needs ~1GB download)
> 4. **Lazy reranker** - cross-encoder model only loaded on first use (not at startup)
> 5. **Streamlit caching** - `@st.cache_resource` ensures components initialize once per instance
### Q: "Explain your graceful degradation strategy."
> Every component is designed to fail gracefully:
>
> | Component Failure | Fallback |
> |---|---|
> | Redis down | In-memory cache (non-persistent) |
> | LLM unavailable | Returns raw retrieved context |
> | Reranker fails to load | Skips reranking, returns hybrid results |
> | BM25 index missing | Falls back to dense-only search |
> | HuggingFace rate limit | Falls back to public dataset, then built-in KB |
> | OpenAI API key invalid | Shows helpful error with diagnostic command |
>
> The system always tries to provide something useful rather than crashing. In production, this means the service stays up even when external dependencies have issues.
---
## Section 5: Design Decisions
### Q: "Why ChromaDB instead of Pinecone/Weaviate/FAISS?"
> **ChromaDB** was chosen because:
> - **Embedded** - runs inside the application process, no separate database server needed
> - **Persistent** - data survives container restarts via disk storage
> - **Lightweight** - fits within Cloud Run's 4GB memory limit
> - **Python-native** - first-class Python API, easy to integrate
>
> **Why not alternatives:**
> - **Pinecone** - managed service, adds latency for network calls, costs money per vector
> - **FAISS** - no built-in persistence, metadata filtering is manual
> - **Weaviate** - requires separate server, overkill for this scale
### Q: "Why GPT-4o-mini instead of GPT-4o?"
> **Cost vs quality trade-off:**
> - GPT-4o-mini: ~$0.15/1M input tokens, ~$0.60/1M output tokens
> - GPT-4o: ~$5/1M input tokens, ~$15/1M output tokens
>
> For RAG, the quality difference is minimal because the answer quality depends more on the retrieved context than the model's parametric knowledge. GPT-4o-mini is 30x cheaper and 2-3x faster while being sufficient for summarizing and citing retrieved documents.
### Q: "Why 125-token chunks?"
> Based on the **Telco-RAG paper** and empirical testing:
> - Too small (50 tokens): loses context, fragments sentences
> - Too large (500 tokens): reduces retrieval precision, wastes token budget
> - **125 tokens** is optimal for Q&A-style telecom documents
>
> I also use **dynamic chunk sizing** based on category:
> - Standards docs: 125 tokens (concise, factual)
> - Network operations: 250 tokens (event-based, needs more context)
> - Performance data: 500 tokens (time-series, needs surrounding context)
### Q: "Why did you set the cache similarity threshold at 0.95?"
> It's intentionally strict because we want cache hits only for **nearly identical questions**:
> - "What is HARQ in 5G?" and "What is HARQ in 5G NR?" β same intent, should cache
> - "What is HARQ?" and "What is MIMO?" β different intent, should NOT cache
>
> At 0.95, only near-paraphrases match. At 0.90, you'd get false positives that return answers for different questions. The cost of a cache miss (re-running the pipeline, ~3s) is much lower than the cost of returning a wrong cached answer.
---
## Section 6: Challenges & Learnings
### Q: "What was the hardest challenge?"
> **Balancing retrieval precision vs recall in a specialized domain.**
>
> Telecom has thousands of similar-sounding concepts (PDSCH/PUSCH, gNB/eNB, FR1/FR2). Pure semantic search would often retrieve the wrong related concept. The hybrid search with BM25 solved this because keyword matching catches exact acronyms that embeddings might confuse.
>
> Another challenge was **empty answers from HuggingFace datasets** - many TeleQnA entries had answer indices pointing to empty choices. I had to implement validation to filter these out (hence the "data quality report" in the ingestion pipeline).
### Q: "What would you do differently if starting over?"
> 1. **Use a vector DB with built-in hybrid search** (like Qdrant or Weaviate) instead of managing separate BM25 + ChromaDB
> 2. **Implement streaming responses** - currently the UI waits for the full pipeline, but users would benefit from seeing partial results
> 3. **Use Google Cloud Secret Manager** instead of environment variables for API keys
> 4. **Add automated evaluation** with a test suite of known Q&A pairs to catch regressions
---
## Section 7: Metrics & Performance
### Q: "What are the performance characteristics?"
> | Metric | Value |
> |--------|-------|
> | Cold start | ~10-15s (Cloud Run) |
> | Warm query (cache miss) | ~2-4s |
> | Cache hit | <100ms |
> | Knowledge base | 12,500+ docs |
> | Index size | ~500MB |
> | Memory usage | ~2-3GB |
> | Concurrent users | ~10-20 (per instance) |
### Q: "How would you scale this?"
> **Horizontal:** Cloud Run auto-scales to 10 instances. Each instance handles its own requests independently since ChromaDB is embedded and read-only after ingestion.
>
> **If I needed more:**
> 1. Move to a **managed vector DB** (Pinecone/Qdrant) for shared state across instances
> 2. Add a **Redis cluster** for shared caching (currently per-instance)
> 3. Use **Cloud CDN** for static Streamlit assets
> 4. Implement **async LLM calls** to handle more concurrent requests per instance
> 5. Consider **Google Cloud Memorystore** for managed Redis
---
## Quick Reference: Key Technical Terms
| Term | What It Is | Where It's Used |
|------|-----------|-----------------|
| **RAG** | Retrieval-Augmented Generation | Core architecture pattern |
| **RRF** | Reciprocal Rank Fusion | Merging dense + BM25 results |
| **HyDE** | Hypothetical Document Embeddings | Optional retrieval improvement |
| **BM25** | Best Matching 25 (TF-IDF variant) | Keyword/sparse search |
| **HNSW** | Hierarchical Navigable Small World | ChromaDB's vector index |
| **RAGAS** | RAG Assessment framework | Answer quality evaluation |
| **TLM** | Trustworthy Language Model | Trust score computation |
| **SON** | Self-Organizing Networks | Telecom domain concept |
| **HARQ** | Hybrid Automatic Repeat Request | Common telecom example query |
| **Cross-Encoder** | Model scoring query-doc pairs | Reranking stage |
|