{ "repository_url": "https://github.com/ronelsolomon/AI-Engineer-questions.git", "owner": "ronelsolomon", "name": "AI-Engineer-questions.git", "extracted_at": "2026-03-02T22:48:21.442227", "files": { "questin.md": { "content": "\nGenAI-Systems-Interview-Guide\n\n\n# GenAI Systems Interview Guide: Comprehensive Q&A\n\n**Last Updated:** January 2026 \n**Scope:** LLM Systems, RAG, Vector Databases, Evaluation, and Production Deployment\n\n---\n\n## 1. MODELS (LLMs)\n\n### Q1: When would you choose a 13B model over a 70B model?\n\nI choose a 13B model when I care more about **latency, throughput, and cost** than chasing the last few points of accuracy. Medium models (7–13B) typically deliver much faster tokens-per-second and require far less GPU memory than 70B models, making them easier and cheaper to deploy at scale. For workloads like classification, routing, FAQ-style Q&A, simple summarization, and entity extraction, a 13B model plus strong RAG and prompts often matches user-perceived quality of a 70B while serving more users per dollar. I reserve 70B-class models for highly ambiguous questions, complex multi-step reasoning, long-context synthesis, or high-stakes decisions where marginal improvements in reasoning justify 2–5x higher latency and resource use.\n\n### Q2: How do you reduce hallucinations without changing the model?\n\nI treat hallucinations as a **grounding and process problem** rather than purely a model problem. First, I improve retrieval: better chunking, ranking, filters, and hybrid search so the model always sees highly relevant context, which is the core idea behind RAG-based mitigation. Second, I tighten prompts: explicit instructions to only answer from provided context, to say \"I don't know\" when evidence is missing, and to cite or reference supporting snippets. Third, I add verification: use an LLM-as-judge or rule-based checks to compare the answer to retrieved documents and flag unsupported claims. Finally, for critical domains (legal, finance, medical), I combine all of this with trusted, curated knowledge bases and human-in-the-loop review, because even strong RAG still leaves some residual hallucination rate.\n\n### Q3: When is fine-tuning the wrong choice?\n\nFine-tuning is a poor choice when the model already understands the **task format** and you mainly need to inject domain knowledge. In that case, RAG or tools are better: they are cheaper, easier to update, and avoid locking changing facts into weights that require periodic retraining. Fine-tuning is also problematic when you lack a large, high-quality labeled dataset and robust evaluation, because you risk overfitting and silently degrading performance. I reserve fine-tuning for systematic behavior changes (tone, style, safety policies, structured output formats, tool usage patterns) where you want the behavior \"baked in\" and stable across queries, not for storing fast-changing knowledge.\n\n### Q4: How would you approach selecting between open-source and proprietary LLM providers?\n\nI evaluate this trade-off on **cost, latency, privacy, and control**. Open-source models (Llama, Mistral, Deepseek) give me full control: I can self-host, fine-tune, and avoid vendor lock-in, but I own the operational burden and must manage updates and security. Proprietary APIs (OpenAI, Anthropic, Google) offer reliability, fast model updates, and no ops overhead, but cost more and require sending data to external providers. For sensitive data or high-volume production workloads, self-hosting open-source with careful versioning and quantization can be cost-effective. For rapid prototyping or when I need the very best reasoning, API access to frontier models is worth the cost and vendor risk. I typically use proprietary models in early development, then migrate critical workloads to cost-optimized open-source models as traffic scales.\n\n### Q5: What strategies do you use to handle context windows and long documents?\n\nFor documents longer than my model's context window (typically 4k–200k tokens), I chunk and retrieve rather than stuffing the entire document. I use either sliding-window retrieval (rank and take top-k relevant chunks) or hierarchical retrieval (summarize sections first, then drill into relevant ones) depending on whether I need broad context or specific details. For very long narratives, I sometimes build a summary index: chunk the document into sections, generate summary embeddings for each, then retrieve the most relevant summaries before fetching full sections. I also leverage \"lost in the middle\" research, which shows that LLMs often miss information in the middle of long contexts, so I strategically place the most important context at the beginning and end of my prompts.\n\n### Q6: How do you manage token counting and budget planning for inference?\n\nI manage token budgets by pre-computing token counts for all relevant content: documents, prompts, few-shot examples, and expected responses. I use tools like `tiktoken` or model-specific tokenizers to estimate costs before sending requests. In production, I set per-request token limits and monitor actual token usage per query to detect anomalies (e.g., when a user query or retrieved context is unexpectedly large). For long contexts, I use aggressive summarization and filtering to reduce tokens sent to the model. I also implement prompt templates with fixed structure so I can predict token usage with confidence.\n\n---\n\n## 2. FRAMEWORKS (ORCHESTRATION)\n\n### Q7: When does a multi-agent system become worse than a single agent?\n\nMulti-agent systems become counterproductive when **coordination overhead and failure modes** outweigh any specialization benefits. This often happens when agent roles are poorly defined, context is redundantly shared, or agents repeatedly hand tasks back and forth, causing loops and latency explosions. They are also harmful when you lack strong observability: debugging cross-agent failures is much harder than analyzing a single-agent chain. I start with a single, well-instrumented agent and only add agents when their responsibilities are clearly separable, independently verifiable, and shown (via experiments) to improve quality or determinism.\n\n### Q8: How do you prevent tool-calling loops?\n\nI prevent tool-calling loops with a mix of **guardrails, state, and monitoring**. At the prompt and orchestration level, I define explicit stopping criteria, per-request tool call limits, and timeouts so the agent must either finalize, fall back, or escalate after a small number of iterations. I maintain explicit state about which tools have been called, with what inputs and outputs, and mark \"no-progress\" iterations so the agent is forced to change strategy instead of repeating the same sequence. Finally, I log and monitor tool usage patterns (calls per request, loop signatures) so pathological behaviors can be detected and fixed offline, then guarded against with deterministic checks.\n\n### Q9: How do you debug a bad answer?\n\nI debug a bad answer by working **layer by layer** through the pipeline. First, I inspect retrieval: did we fetch the right documents, with enough coverage, and were they up to date? If retrieval looks good, I inspect the prompt: system instructions, examples, role clarity, and how the context is framed or formatted. Next, I verify routing and tools: was the right model used, were tools called correctly, and did the agent misinterpret tool outputs? Only after those checks do I attribute issues to model limitations and consider changes like a stronger model, task decomposition, or a dedicated tool.\n\n### Q10: How would you design a system to handle concurrent agent requests with shared state?\n\nFor concurrent agents sharing state, I use **immutable snapshots** and **event logging** rather than direct state mutation. Each agent sees a consistent view of state at the start of its execution and records all actions as immutable events. I use a distributed transaction log (or a simple version with timestamps and read-write locks) to ensure agents don't conflict. For high concurrency, I shard state by query ID or user so most requests don't contend. I avoid shared mutable state entirely: instead, agents write their outputs to a queue or event log, and a central orchestrator reads and applies state changes sequentially. This keeps reasoning within an agent deterministic while allowing concurrent execution across agents.\n\n### Q11: How do you evaluate agent behavior and measure reliability?\n\nI measure agent reliability on multiple dimensions. First, **task success rate**: does the agent achieve the intended goal without human intervention? Second, **efficiency**: how many steps, tool calls, or tokens does it take? Third, **safety and compliance**: does it refuse unsafe requests, avoid hallucinations, and stay within policy? Fourth, **cost**: given model and tool call costs, what is the per-request cost? I build fixed test suites covering nominal cases, edge cases, and adversarial inputs, then run them before and after changes. For human-facing agents, I sample real requests and have humans rate agent quality and rank improvement over time.\n\n---\n\n## 3. VECTOR DATABASES\n\n### Q12: How do you choose chunk size?\n\nI choose chunk size to balance **semantic completeness** with retrieval precision. For many use cases, 300–500 tokens per chunk is a good starting point, capturing a single coherent idea without pulling in too much noise. I adapt chunk size to document type: tightly-structured docs (APIs, specs, contracts) may benefit from smaller, section- or heading-aligned chunks, while narrative docs can tolerate slightly larger chunks with overlap. I validate my choice empirically using retrieval benchmarks (recall, precision, nDCG) and user-centric tests, iterating if I see fragmented answers or irrelevant context.\n\n### Q13: When does semantic search fail?\n\nSemantic search struggles when queries require **exact matches or strict filters** more than conceptual similarity. Examples include IDs, codes, email addresses, formulas, specific error messages, or numeric constraints where even minor deviations are unacceptable. It can also underperform on very short or ambiguous queries and on highly structured data where SQL or graph queries are more appropriate. In those cases, I use hybrid retrieval: combine embeddings with keyword/BM25, metadata filters, and/or structured queries to get both semantic and exact-match behavior.\n\n### Q14: How do you handle metadata and filtering in vector search?\n\nMetadata enables powerful filtering without sacrificing semantic search quality. I attach structured metadata to every chunk: source document, section, author, creation date, version, tags, and domain-specific fields. At query time, I filter on metadata (e.g., \"only documents from 2025\" or \"only from trusted authors\") before or after semantic search depending on whether metadata filters dramatically reduce the search space. Pre-filtering is faster; post-filtering gives more control over ranking. For hierarchical metadata (e.g., document → section → paragraph), I sometimes use multi-level indexing so queries can navigate from broad to specific. Metadata filtering is often more effective than tuning embedding models.\n\n### Q15: How would you design a multi-tenant vector database?\n\nFor multi-tenant scenarios, I have two main strategies. **Row-level isolation**: each tenant's data is a separate collection or schema, with strict access controls. This is simple and safe but uses more storage. **Shared index with tenant tagging**: all tenants' data lives in one index with a tenant ID in the metadata, and queries are automatically filtered by tenant. This is more efficient but requires careful validation that queries always include tenant filters, otherwise data leaks. I prefer separate indices per large tenant and shared indices for small tenants to balance cost and isolation. I also implement strict authorization checks at query time and audit all cross-tenant access to prevent even accidental breaches.\n\n---\n\n## 4. DATA EXTRACTION & INGESTION\n\n### Q16: How do you handle tables in PDFs?\n\nI treat tables as **structured data**, not prose. I extract them into formats like CSV or JSON, ensuring I preserve row/column relationships, headers, and data types so downstream retrieval and reasoning can operate at the cell or column level. In the index, I often store each row (or logical group of rows) as a separate record, with metadata such as table name, units, and source page, to support precise retrieval and aggregation. This approach aligns with best practices for document AI and significantly improves accuracy over naive text flattening.\n\n### Q17: How do you keep embeddings in sync?\n\nI keep embeddings in sync by **versioning documents and vectors together**. Any meaningful update to source content, the embedding model, or the chunking/ingestion pipeline triggers re-embedding and invalidation of old vectors for that document version. I track index schema and ingestion code versions so retrieval issues can be traced to specific changes and rolled back if needed. Without such discipline, mixed embeddings accumulate in the same index and gradually degrade similarity quality and retrieval reliability.\n\n### Q18: How do you handle incremental updates and document versioning?\n\nFor incremental updates, I use **versioned document snapshots** and **delta indexing**. When a document changes, I create a new version rather than mutating the old one. I then re-chunk and re-embed only the changed sections, invalidating embeddings for affected chunks. If the change is minor (typos, formatting), I can sometimes skip re-embedding if I'm confident the vectors won't shift meaningfully. For large corpora, I batch re-embeddings into off-peak jobs and use async updates so the old index stays online until the new one is ready. I also maintain a document changelog so I can trace the history of changes and understand whether a retrieval failure is due to stale data or retrieval logic.\n\n### Q19: How do you validate data quality in an ingestion pipeline?\n\nI validate data at multiple stages. **On ingest**: check for duplicates, missing fields, encoding errors, and format compliance against schema. **During chunking**: verify chunk sizes are within expected ranges, overlaps are consistent, and metadata is present and well-formed. **After embedding**: spot-check embeddings for NaNs, ensure similarity scores are in expected ranges, and verify that known similar documents have high similarity. **In production**: sample retrieved chunks regularly to ensure they're relevant and up-to-date. I also run periodic audits comparing source data to indexed data to catch data drift or corruption. Strong validation catches issues early and prevents cascading failures.\n\n---\n\n## 5. LLM ACCESS & INFERENCE\n\n### Q20: When would you self-host models?\n\nI self-host models when I need **tight control** over data, latency, and cost at scale and I can justify the operational overhead. This is particularly valuable when compliance or customer requirements prohibit sending data to third-party APIs or demand on-prem/VPC deployments. Self-hosting lets me exploit batching, quantization, and custom hardware to significantly reduce per-token cost, and quantization can shrink memory needs enough to run 70B models on commodity GPUs. For low-volume or rapidly evolving workloads, I usually prefer managed APIs because their elasticity and operational maturity outweigh cost savings from owning the stack.\n\n### Q21: How do you reduce inference cost?\n\nI reduce inference cost at **multiple layers**. At the model layer, I choose smaller or specialized models for simple tasks, and use quantization and efficient runtimes (like vLLM) to improve throughput per GPU. At the request layer, I batch compatible prompts, cache frequent or deterministic responses, and aggressively trim context to only what is needed. At the system level, I add routing so cheap models handle the majority of simple queries, while expensive models are reserved for complex or high-value cases, often yielding better cost reductions than model changes alone.\n\n### Q22: How would you design request batching and queue management for high-throughput inference?\n\nFor high throughput, I use **dynamic batching** with deadline constraints. Requests wait briefly (e.g., 100ms) for other requests to arrive so I can batch them together, but never miss SLAs. I group requests by model, input length, and other relevant features to minimize padding and maximize GPU utilization. I use a priority queue so time-sensitive or high-value requests jump ahead. For streaming responses, I generate tokens for one batch, stream them immediately, then start the next batch to hide latency. I also monitor queue depth and request latency to auto-scale workers and prevent queuing from becoming a bottleneck.\n\n### Q23: How do you handle rate limiting and backpressure from model providers?\n\nI implement **exponential backoff with jitter** for API rate limits and circuit breakers for provider outages. When I hit a rate limit, I pause requests, wait with exponential backoff (starting at 1s, doubling until 60s), then retry. Jitter prevents thundering herd when multiple clients hit limits simultaneously. If errors persist, I trigger a circuit breaker that stops sending requests for a period, logging the issue so on-call engineers notice. For known quota limits, I track usage per request type and proactively shed load (e.g., downgrade to a cheaper model or queue non-critical requests) before hitting hard limits. I also monitor provider status pages and communicate quota increases in advance.\n\n---\n\n## 6. EMBEDDINGS\n\n### Q24: When do you re-embed data?\n\nI re-embed data whenever anything that defines the **embedding space** changes. That includes switching embedding models, changing their parameters, altering chunking strategies, or significantly updating the underlying content. Mixing embeddings from different configurations in one index makes similarity scores unreliable and directly harms retrieval quality. For large corpora, I often re-embed incrementally by version and keep a reindexing plan so updates don't disrupt production traffic.\n\n### Q25: How do you evaluate embeddings?\n\nI evaluate embeddings using **task-focused retrieval benchmarks**. I build a set of queries with known relevant documents and measure metrics like recall@k, precision@k, and nDCG, comparing candidate embedding models or configurations. I complement this with manual inspection of top results for representative queries to ensure retrieved passages actually support correct answers. For critical systems, I track downstream answer quality when swapping embedding models to confirm retrieval improvements translate to better end-to-end outcomes.\n\n### Q26: How do you choose between dense, sparse, and hybrid embeddings?\n\n**Dense embeddings** (standard transformers) capture semantic similarity well and scale to massive corpora, but struggle with exact matches and rare terms. **Sparse embeddings** (BM25, SPLADE) excel at exact-match and keyword retrieval but don't capture semantics well. **Hybrid retrieval** combines both: run dense and sparse search in parallel, then re-rank results using a learned combiner or simple weighted average. For most real-world applications, I start with dense retrieval and add sparse/hybrid only if I see specific failures on exact-match queries. The overhead of hybrid retrieval (2x queries) is worth it for critical systems where recall is paramount.\n\n### Q27: How would you debug poor embedding quality?\n\nI debug poor embeddings by tracing back to the source. First, I check whether the **embedding model** is appropriate: is it trained on my domain (medical embeddings for medical text, for example)? Second, I verify **chunking quality**: are chunks coherent units or random fragments? Bad chunking will produce bad embeddings even with a good model. Third, I inspect **retrieval failures** manually: sample failed queries and check whether top-ranked documents are actually relevant or just high-scoring noise. Fourth, I measure **embedding stability**: do similar queries produce similar embeddings, or is there high variance? Finally, I test candidate new models on my task-specific benchmarks before swapping in production.\n\n---\n\n## 7. EVALUATION\n\n### Q28: How do you evaluate hallucinations?\n\nI define hallucinations as **claims not supported** by available evidence and evaluate directly against that definition. I run pipelines that compare the model's answer to retrieved context using automated heuristics and LLM-as-judge evaluations to label statements as supported, contradicted, or unsupported. For high-risk domains, I add human review and domain-specific test sets where ground truth is known so I can measure both faithfulness and factual correctness. This gives me a quantitative hallucination rate that I can track over time as I adjust retrieval, prompts, or models.\n\n### Q29: How do you catch regressions?\n\nI catch regressions by maintaining **fixed evaluation suites** and running them before and after any change to ingestion, retrieval, prompts, models, or orchestration. These suites measure retrieval metrics, answer quality, hallucination rate, and guardrail behavior so I can spot silent regressions early. In production, I pair this with canary or shadow deployments and monitoring of user-facing metrics (error rates, dissatisfaction signals, override usage) to detect real-world degradations. Changes only roll fully once they pass both offline evaluations and limited-scope production checks.\n\n### Q30: How do you build a gold-standard evaluation dataset?\n\nI build evaluation datasets iteratively. First, I collect **real user queries** or representative examples from subject matter experts (SMEs). Second, I create **ground truth answers** by having SMEs label correct answers, noting which documents/chunks support each claim. Third, I pair queries with **expected retrieval results** so I can evaluate retrieval independently from answer quality. Fourth, I iterate: run my system on these test queries, identify failures, add them to the test set, and repeat. For high-stakes domains, I maintain separate **human review queues** where SMEs regularly audit both successful and failed cases to catch subtle errors. A good test set is never complete; it evolves as the system and user needs change.\n\n### Q31: How do you measure end-to-end system quality?\n\nI measure end-to-end quality using **task-specific metrics** that matter to users. For Q&A, I measure correctness, completeness (does the answer fully address the question?), and faithfulness (are all claims supported?). For summarization, I measure factuality, relevance, and conciseness. For open-ended generation, I use a combination of automatic metrics (ROUGE, BERTScore) and human judgments. I also track **user satisfaction** (ratings, override rates, feedback) because automatic metrics sometimes miss important quality dimensions. Finally, I decompose failures to understand whether they originate in retrieval, grounding, tool use, or model reasoning so I can improve the right component.\n\n---\n\n## 8. SYSTEM DESIGN & PRODUCTION\n\n### Q32: How would you design a RAG system for high-availability and fault tolerance?\n\nFor high availability, I use **replicated index shards** across multiple machines and **load-balanced retrieval**. I maintain a primary and secondary index so if the primary fails, queries automatically route to the secondary. For large indices, I shard by document or query type so failures affect only a subset of queries. I also implement **graceful degradation**: if semantic search is slow or fails, fall back to keyword search; if retrieval fails entirely, feed the model with the full user query and let it reason without external context. I use **health checks** and monitoring to detect failures early and trigger failover before users notice. Finally, I implement **circuit breakers** for dependencies so a slow embedding service doesn't cascade to the entire system.\n\n### Q33: How do you design the UX for a slow or uncertain AI system?\n\nFor systems that are inherently slow (e.g., 5+ second latency), I use **streaming responses** to show progress and give users confidence the system is working. I also **surface uncertainty**: show confidence scores, cite sources, and explicitly say \"I'm not confident in this answer\" when appropriate. For uncertain answers, I either offer a human-in-the-loop fallback or present alternative answers so users can choose. I also **surface limitations transparently**: \"I can answer questions about documents from 2024; for earlier data, please check the archive.\" This builds trust even when the system isn't perfect. Finally, I always provide a way for users to give feedback (thumbs up/down, corrections) so the system can learn and improve over time.\n\n### Q34: How would you handle prompt injection and adversarial inputs?\n\nPrompt injection is when a user tries to trick the system into ignoring its instructions by injecting conflicting instructions into the input. I defend against this by **strict input validation**: reject or sanitize inputs that contain prompt-like keywords (e.g., \"ignore\", \"system prompt\", \"instead do\"). I also **separate user input from instructions** at the prompt level by using strong delimiters and making it clear which parts are user-provided. I add **output validation**: check that the model's response follows the intended structure and contains no injected instructions. For critical systems, I log and audit all suspicious inputs and have security reviews for any novel injection attempts. I also educate users: make it clear that the system has fixed instructions and prompts can't be overridden.\n\n### Q35: How do you monitor and alert on LLM system quality in production?\n\nI monitor both **system health** and **output quality**. System health includes latency, error rates, throughput, and resource usage. Output quality includes hallucination rate, user satisfaction, refusal rate, and safety metrics. I set alerts on sudden changes: e.g., if latency jumps from 2s to 10s, or hallucination rate rises from 2% to 10%, I get paged. I also track trends: if hallucination rate is slowly climbing, that's a sign that the knowledge base is drifting or the retrieval is degrading. I use **anomaly detection** to flag unusual patterns (e.g., a user asking the same question 100 times in a row, which might indicate a test or adversarial attack). Finally, I implement **user feedback loops**: users rate answers, and I track which features correlate with high satisfaction, then double down on those.\n\n---\n\n## Final Principles\n\nGenAI is fundamentally a **systems engineering problem**. Reliable, high-quality systems require:\n\n1. **Disciplined retrieval**: Strong ranking, filtering, and validation of context\n2. **Clean ingestion**: Structured data extraction, versioning, and quality checks\n3. **Smart orchestration**: Clear agent boundaries, explicit state, comprehensive monitoring\n4. **Evaluation rigor**: Fixed test sets, end-to-end metrics, and continuous regression detection\n5. **Production readiness**: Fault tolerance, graceful degradation, transparent UX, and security\n\n**Prompt tuning and model selection alone are insufficient.** The best results come from thoughtful system design, empirical validation, and operational discipline.\n\n---\n\n## Additional Resources\n\n- **DataCamp RAG Guide**: https://www.datacamp.com/blog/rag-interview-questions\n- **Generative AI System Design Interview**: https://igotanoffer.com/en/advice/generative-ai-system-design-interview\n- **Document Chunking Strategies**: https://www.dataquest.io/blog/document-chunking-strategies-for-vector-databases/\n- **Prompt Engineering Interview Questions**: https://www.pinecone.io/learn/chunking-strategies/\n- **LLM Cost Optimization**: https://blog.vllm.ai/2024/09/05/perf-update.html\n\n---\n\n**Prepared for:** Technical Interview Preparation \n**Audience:** ML Engineers, AI Systems Engineers, GenAI Specialists \n**Difficulty:** Intermediate to Advanced\n", "size": 27913, "language": "markdown" } }, "_cache_metadata": { "url": "https://github.com/ronelsolomon/AI-Engineer-questions.git", "content_type": "github", "cached_at": "2026-03-02T22:48:21.442554", "cache_key": "34de00c9468eb9197864c154647cc3d6" } }