Spaces:

arjun10g
/

RAG-PSYCH

Running

App Files Files Community

RAG-PSYCH / docs /interview-talking-points.md

arjun10g

Initial deploy to Hugging Face Spaces

08fc97e about 1 month ago

preview code

raw

history blame contribute delete

9.07 kB

	# Interview Talking Points

	Rehearsed answers mapped to likely questions. Each answer is anchored in a
	concrete decision you made in this codebase, so you can always point at code
	when pressed for specifics.

	## "Walk me through your RAG pipeline."

	Ingestion is a three-step transform. I pull MTSamples psychiatry notes from
	a CSV, chunk them with a section-aware regex that recognizes clinical headers
	like HPI, ASSESSMENT, and PLAN — with a recursive character splitter as a
	fallback for notes that don't follow the template. Each chunk gets embedded
	with S-PubMedBert-MS-MARCO, a BERT-family encoder pretrained on biomedical
	text, producing a 768-dimensional vector. Chunks, metadata, and vectors land
	in a Postgres table with an HNSW index for approximate nearest neighbor
	search and a GIN index on a tsvector column for BM25.

	At query time, I embed the user query with the same model, run hybrid search
	— cosine similarity plus BM25, weighted 70/30 — and take the top 5 chunks.
	Those get formatted into a prompt with a strict system instruction: only use
	the provided context, cite every claim with `[chunk_id]`, and refuse with a
	specific string if the answer isn't in the context. After generation, I
	validate that every cited chunk_id actually appears in the retrieved set —
	if one doesn't, the model hallucinated a citation, and we flag the response.

	FastAPI serves this via a `/query` endpoint with Pydantic validation and
	audit-style logging, Streamlit demos it with a side-by-side view of the
	answer and retrieved chunks, and Docker Compose ties it all together.

	## "How do you prevent hallucinations?"

	Three layers of defense.

	Layer one is retrieval quality. Better chunks mean less hallucination
	pressure on the LLM. I use a domain-specific embedder, hybrid search to
	catch both semantic matches and exact strings, and I evaluate precision@5
	on a hand-labeled set so I know when retrieval is failing.

	Layer two is prompt engineering. The system prompt forces citations for
	every claim and specifies an exact refusal string when the context doesn't
	support an answer. Exact-string refusal is deliberate — soft refusals like
	"I'm not sure but..." still hallucinate. An exact string is machine-checkable.

	Layer three is post-generation validation. I parse the citations out of the
	response and verify each cited chunk_id appears in the retrieved set. An
	invalid citation is a strong signal the model fabricated content. In
	production I'd extend this to faithfulness scoring — checking whether each
	claim is actually supported by its cited chunk, probably via a cross-encoder
	or a second LLM call.

	## "Why pgvector over Pinecone or Weaviate?"

	For a dataset that fits on a laptop, pgvector wins on operational simplicity
	— one image, one query language, metadata filtering and vector search in the
	same SQL statement. Transactional guarantees come free. At multi-million-chunk
	scale, the tradeoffs shift and a managed service might win on operational
	overhead. But for a prototype, pulling in a managed vector DB is
	over-engineering, and it signals "I follow tutorials" rather than "I make
	considered choices."

	## "How do you handle PHI and HIPAA?"

	This project uses MTSamples only — public, de-identified data — so HIPAA
	doesn't technically apply. But I built it as if it did, because those habits
	need to exist before real data shows up.

	Specifically: API keys live only in a gitignored `.env` file, never in code.
	Audit logs contain query hashes, not raw query text — because in production
	queries might reference patient identifiers. All SQL is parameterized, so
	query content can't be injected. Containers run as non-root. CORS is locked
	down to the UI origin only. The health endpoint reveals nothing about
	internals.

	For a real deployment with PHI: BAA with Anthropic first, or swap the
	generator for a self-hosted Llama via vLLM in a VPC. Add de-identification
	in ingestion — NER for names, dates, MRNs. Encrypt the vector DB volume at
	rest. Immutable audit logging to a WORM bucket. Role-based access controls
	with short-lived credentials. Network segmentation with an egress proxy
	logging every outbound API call.

	## "Why didn't you fine-tune the embedder?"

	Two reasons. First, without a labeled evaluation set big enough to measure
	a real delta, fine-tuning is theatrical — you produce a model and a feeling,
	not a measurable improvement. Second, the base model is already
	transfer-learned twice: BERT → PubMedBERT for biomedical domain,
	PubMedBERT → S-PubMedBERT-MS-MARCO for the sentence-embedding objective.
	It's domain-appropriate out of the box.

	That said, here's exactly what I'd do next if I had two more days. Generate
	synthetic query-chunk pairs with Claude — maybe 1000-2000 triplets of
	(query, relevant_chunk, irrelevant_chunk). Have a clinician validate a
	random sample. Fine-tune with MultipleNegativesRankingLoss. Measure
	precision@5 and MRR against the current baseline before shipping. If the
	delta is under 3 points, keep the off-the-shelf model and spend the effort
	elsewhere. That's the production mindset — measure, then decide.

	## "What's the difference between a BERT embedder and a GPT-style LLM?"

	Different architectures for different jobs. BERT is a bi-directional encoder
	— every token attends to every other token in both directions, and the model
	produces a single fixed-size vector per input. Trained with masked language
	modeling and then contrastive objectives for embedders. It's good at "what
	does this text mean?" in a single vector.

	GPT-style models are autoregressive decoders — left-to-right, next-token
	prediction. They're good at generation, not at producing single
	representations. You can extract embeddings from them, but they're generally
	worse than purpose-built bi-directional encoders for retrieval.

	In this pipeline, the BERT-family encoder handles retrieval, the Claude
	decoder handles generation. Right tool for each job.

	## "How would you scale this to millions of notes?"

	Two levers. Ingestion becomes embarrassingly parallel — PySpark or Ray
	across a cluster, embedding in batches on GPU nodes. I'd keep pgvector for
	moderate scale (tens of millions of chunks) with proper HNSW tuning and
	partitioning, or move to a managed service like Databricks Vector Search
	beyond that.

	At query time, the bottleneck moves to two places: the ANN index (tune
	`ef_search` for the precision/latency tradeoff), and the LLM call (add a
	reranker so you can retrieve top-50 cheaply and reranker-narrow to top-5
	before paying the generation cost). Prompt caching on the system prompt
	saves real money at scale.

	Monitoring becomes essential. Track retrieval precision, faithfulness,
	latency percentiles, refusal rate. Drift in any of these is the first
	signal that the index or the corpus has changed in ways that hurt quality.

	## "How do you work with clinical stakeholders?"

	Demo-driven. A Streamlit prototype in a meeting is worth ten slides.
	Clinicians tell you almost instantly what's wrong with retrieval when they
	can see both the answer and the sources side-by-side — which is why my UI
	shows retrieved chunks alongside every answer. Transparency isn't a
	nice-to-have in healthcare; it's the feature that makes clinicians trust
	the tool enough to use it.

	Iteration cycles: build minimum viable retrieval, show it, collect specific
	failure cases from the clinicians, bake those into the eval set, improve
	against the eval, show the delta. Repeat. Every "the system got this wrong"
	becomes a test case that prevents regression.

	## "What would you do differently next time?"

	Three things, roughly in order.

	Build the eval harness first. I built the pipeline and then the eval, which
	meant I had to retrofit test queries around what the system already did.
	Writing the eval first would have forced clearer thinking about what
	"success" meant.

	Invest earlier in reranking. Cross-encoder reranking on top-20 candidates
	improves precision significantly for marginal latency cost. I left it as a
	"next step" but it should have been in V1.

	Make evaluation continuous. Right now `run_eval.py` is a one-shot script.
	In production I'd want these metrics tracked over time, alerting on
	regression, with a versioned eval set that grows as clinicians flag new
	failure cases.

	## Bridge lines for your existing experience

	- Data pipelines: "I've built ETL pipelines in Python — ingestion here is
	the same pattern, with a neural network doing the transform step."
	- PostgreSQL: "I know Postgres well, so pgvector let me reuse all my
	existing indexing and query optimization intuition."
	- FastAPI: "This is my normal API stack — the only difference is the
	endpoint returns both the answer and the sources for auditability."
	- Docker: "Standard multi-container setup. The interesting constraint
	was container hardening — non-root users, no secrets baked in."
	- Streamlit: "I use Streamlit for demos that need to land with
	non-technical stakeholders. Here it's showing clinical users both the
	answer and the evidence."