Spaces:

BrejBala
/

rag-agent-workbench-api

Sleeping

App Files Files Community

rag-agent-workbench-api / docs /CONTEXT.md

BrejBala

final changes with API key

b09b8a3 about 1 month ago

preview code

raw

history blame contribute delete

27.9 kB

	# RAG Agent Workbench – Context and Design

	## Project Purpose

	RAG Agent Workbench is a lightweight experimentation backend for retrieval-augmented generation (RAG). It focuses on:
	- Fast ingestion of documents into a Pinecone index with integrated embeddings.
	- Simple, production-style APIs for search and chat-style question answering.
	- Keeping the backend slim: no local embedding or LLM models, relying instead on managed services.

	---

	## Current Architecture

	- Client(s)
	- Any HTTP client (curl, scripts in `scripts/`, future UI) talks to the FastAPI backend.

	- Backend (FastAPI, `backend/app`)
	- `routers/`
	- `health.py` – service status.
	- `ingest.py` – /ingest/wiki, /ingest/openalex, /ingest/arxiv.
	- `documents.py` – manual uploads and stats.
	- `search.py` – semantic search over Pinecone.
	- `chat.py` – agentic RAG chat using LangGraph + LangChain.
	- `services/`
	- `ingestors/` – fetch content from arXiv, OpenAlex, Wikipedia.
	- `chunking.py` – chunk documents into Pinecone-ready records.
	- `dedupe.py` – in-memory duplicate record removal.
	- `normalize.py` – text normalisation and doc id generation.
	- `pinecone_store.py` – Pinecone init, search, upsert, stats.
	- `llm/groq_llm.py` – Groq-backed chat model wrapper.
	- `tools/tavily_tool.py` – Tavily web search integration.
	- `prompts/rag_prompt.py` – RAG system + user prompts.
	- `chat/graph.py` – LangGraph state graph for /chat.
	- `core/`
	- `config.py` – env-driven configuration.
	- `errors.py` – app-specific exceptions + handlers.
	- `logging.py` – basic logging setup.
	- `tracing.py` – LangSmith / LangChain tracing helper.
	- `schemas/` – Pydantic models for all endpoints.

	- Vector Store
	- Pinecone index with integrated embeddings.
	- Text field configurable via `PINECONE_TEXT_FIELD`.

	- LLM and Tools
	- Groq OpenAI-compatible chat model via `langchain-openai`.
	- Tavily web search via `langchain-community` tool (optional).
	- LangGraph orchestrates retrieval → routing → web search → generation.

	---

	## Implemented Endpoints

	\| HTTP Method \| Path \| Description \|
	\|------------\|-------------------------\|------------------------------------------------------------------\|
	\| GET \| `/health` \| Health check with service name and version. \|
	\| POST \| `/ingest/arxiv` \| Ingest recent arXiv entries matching a query. \|
	\| POST \| `/ingest/openalex` \| Ingest OpenAlex works matching a query. \|
	\| POST \| `/ingest/wiki` \| Ingest Wikipedia pages by title. \|
	\| POST \| `/documents/upload-text`\| Upload raw/manual text or Docling-converted content. \|
	\| GET \| `/documents/stats` \| Get vector counts per namespace from Pinecone. \|
	\| POST \| `/search` \| Semantic search over Pinecone using integrated embeddings. \|
	\| POST \| `/chat` \| Production-style RAG chat using LangGraph + Groq + Pinecone. \|
	\| POST \| `/chat/stream` \| SSE streaming variant of `/chat`. \|

	---

	## Key Design Decisions

	- Integrated embeddings only
	- No local embedding models; Pinecone is configured with integrated embeddings.
	- Backend stays light and easy to deploy in constrained environments.

	- OpenAI-compatible LLM interface
	- Groq is accessed via the OpenAI-compatible API (`langchain-openai`).
	- Avoids additional provider-specific SDKs and keeps integration simple.

	- Agentic RAG flow using LangGraph
	- Chat pipeline is modelled as a state graph:
	1. `normalize_input` – set defaults, normalise chat history.
	2. `retrieve_context` – Pinecone retrieval.
	3. `decide_next` – route to web search or generation.
	4. `web_search` – Tavily search (optional).
	5. `generate_answer` – Groq LLM with RAG prompts.
	6. `format_response` – reserved for post-processing.
	- This makes the flow explicit and easy to extend.

	- Web search as a conditional fallback
	- Tavily web search is used only when:
	- Retrieval returns no hits, or
	- Top score is below a threshold (`min_score`), and
	- `use_web_fallback=true` and `TAVILY_API_KEY` is configured.
	- When Tavily is not configured, the system degrades gracefully to retrieval-only.

	- LangSmith tracing via environment flags
	- Tracing is enabled purely via environment:
	- `LANGCHAIN_TRACING_V2=true`
	- `LANGCHAIN_API_KEY` set
	- Optional: `LANGCHAIN_PROJECT`
	- `core/tracing.py` exposes helper functions that:
	- Check if tracing is enabled.
	- Construct callback handlers (`LangChainTracer`) for LangGraph/LangChain.
	- Expose trace metadata in API responses.

	- Error handling boundary
	- External dependencies (Pinecone, Groq, Tavily) are wrapped so that:
	- Configuration errors return 500s with clear messages.
	- Upstream service failures raise `UpstreamServiceError` and surface as HTTP 502.
	- This keeps failure modes explicit for clients.

	---

	## Work Package History

	### Work Package A

	- Scope
	- Initial backend setup with FastAPI, Pinecone integration, and ingestion/search endpoints.
	- Highlights
	- `/ingest/wiki`, `/ingest/openalex`, `/ingest/arxiv` for sourcing content.
	- `/documents/upload-text` for manual/Docling-based uploads.
	- `/search` and `/documents/stats` endpoints to query and inspect the index.
	- How to test
	- Use `scripts/seed_ingest.py` and `scripts/smoke_arxiv.py` to seed and smoke-test ingestion.

	### Work Package B (this change)

	- Scope
	- Add a production-style `/chat` RAG endpoint using LangGraph and LangChain.
	- Integrate Groq as the LLM and Tavily as an optional web search fallback.
	- Introduce LangSmith tracing hooks and update documentation and smoke tests.

	- Functional changes
	- New router: `backend/app/routers/chat.py`
	- `POST /chat`
	- Runs a LangGraph state graph:
	1. Normalises inputs and defaults.
	2. Retrieves context from Pinecone.
	3. Decides whether to invoke web search.
	4. Runs Tavily web search when enabled and needed.
	5. Calls Groq LLM with a RAG prompt to generate the answer.
	6. Returns answer, sources, timings, and trace metadata.
	- `POST /chat/stream`
	- Same pipeline as `/chat` but returns Server-Sent Events (SSE).
	- Streams tokens from the final answer plus a terminating event with the full JSON payload.

	- New schemas: `backend/app/schemas/chat.py`
	- `ChatRequest` with:
	- `query`, `namespace`, `top_k`, `use_web_fallback`,
	`min_score`, `max_web_results`, and `chat_history`.
	- `SourceHit` representing document/web snippets.
	- `ChatTimings` and `ChatTraceMetadata` for timings and LangSmith info.
	- `ChatResponse` combining answer, sources, timings, and trace metadata.

	- New services:
	- `backend/app/services/llm/groq_llm.py`
	- `get_llm()` returns a Groq-backed `ChatOpenAI` with:
	- `base_url` = `GROQ_BASE_URL` (default `https://api.groq.com/openai/v1`).
	- `model` = `GROQ_MODEL` (default `llama-3.1-8b-instant`).
	- Timeouts and retries from HTTP settings.
	- Raises a configuration error if `GROQ_API_KEY` is missing.

	- `backend/app/services/tools/tavily_tool.py`
	- `is_tavily_configured()` checks `TAVILY_API_KEY`.
	- `get_tavily_tool(max_results)` wraps `TavilySearchResults` from
	`langchain-community`.
	- Logs a warning and returns `None` when Tavily is not configured, disabling web fallback gracefully.

	- `backend/app/services/prompts/rag_prompt.py`
	- Defines RAG system and user prompts.
	- `build_rag_messages(chat_history, question, sources)` builds
	LangChain messages that:
	- Use only supplied context.
	- Label context snippets as `[1]`, `[2]`, etc., and instruct the model
	to cite them inline.

	- `backend/app/services/chat/graph.py`
	- Implements the LangGraph `ChatState` and state graph with nodes:
	- `normalize_input`
	- `retrieve_context`
	- `decide_next`
	- `web_search`
	- `generate_answer`
	- `format_response`
	- Uses Pinecone search for retrieval and Tavily for optional web search.
	- Calls the Groq LLM via `get_llm()` with LangChain Runnable config
	(`callbacks`) so LangSmith traces are collected when enabled.
	- Records `retrieve_ms`, `web_ms`, and `generate_ms` in `timings`.

	- New core utility:
	- `backend/app/core/tracing.py`
	- `is_tracing_enabled()` checks `LANGCHAIN_TRACING_V2` and `LANGCHAIN_API_KEY`.
	- `get_tracing_callbacks()` returns a `LangChainTracer` callback list when enabled.
	- `get_tracing_response_metadata()` returns `{langsmith_project, trace_enabled}`.

	- Configuration changes:
	- `backend/app/core/config.py` adds:
	- `GROQ_API_KEY`, `GROQ_BASE_URL`, `GROQ_MODEL`.
	- `TAVILY_API_KEY`.
	- `RAG_DEFAULT_TOP_K`, `RAG_MIN_SCORE`, `RAG_MAX_WEB_RESULTS`.
	- `backend/.env.example` updated with the new env vars, including LangSmith options.

	- Error handling:
	- `backend/app/core/errors.py` introduces `UpstreamServiceError`.
	- Centralised handler converts `UpstreamServiceError` into HTTP 502 responses.

	- Documentation and scripts:
	- `backend/README.md` updated with `/chat` and `/chat/stream` usage,
	env vars, and a local test checklist.
	- New scripts:
	- `scripts/smoke_chat.py` – uses `/ingest/wiki` and `/chat` for a local smoke test.
	- `scripts/smoke_chat_web.py` – tests `/chat` with `use_web_fallback=true`
	and a query that should trigger web search.

	- How to test
	1. Start the backend:
	```bash
	cd backend
	uvicorn app.main:app --reload --port 8000
	```
	2. Ingest some Wikipedia pages:
	```bash
	python ../scripts/smoke_chat.py --backend-url http://localhost:8000 --namespace dev
	```
	3. Test web fallback (requires `TAVILY_API_KEY`):
	```bash
	python ../scripts/smoke_chat_web.py --backend-url http://localhost:8000 --namespace dev
	```
	4. Verify LangSmith traces:
	- Set `LANGCHAIN_TRACING_V2=true`, `LANGCHAIN_API_KEY`, and optionally `LANGCHAIN_PROJECT`.
	- Run `/chat` again and confirm traces appear in LangSmith.

	---

	## Known Issues / Limits

	- No local models
	- The backend intentionally does not host local embedding or LLM models.
	- All intelligence is delegated to Pinecone (integrated embeddings), Groq, and Tavily.

	- Retrieval quality depends on ingestion
	- The usefulness of `/chat` depends heavily on the quality and coverage of the ingested documents.
	- For some queries, even the best matching chunks may not be sufficient to answer without web fallback.

	- Best-effort web search
	- Tavily integration is optional and depends on the external Tavily API.
	- When Tavily is unavailable or misconfigured, the backend falls back to retrieval-only answers.

	- Simple SSE streaming
	- `/chat/stream` streams tokens derived from the final answer string rather than streaming directly from the LLM.
	- This keeps implementation simple while still providing a streaming interface.

	---

	## Work Package C

	### Scope

	- Make the backend deploy-ready on Hugging Face Spaces using Docker.
	- Add a minimal Streamlit frontend suitable for Streamlit Community Cloud (no Docker).
	- Add production polish: basic API protection, rate limiting, caching, metrics, and a small benchmarking script.
	- Keep configuration sane by default, with environment variables as overrides rather than hard requirements.

	### Backend changes (HF Spaces deploy + runtime)

	- Docker / port behaviour
	- `backend/Dockerfile` now:
	- Exposes port 7860 (the default for many Hugging Face Spaces deployments).
	- Uses a shell-form `CMD` so `PORT` can be honoured when set:
	- `uvicorn app.main:app --host 0.0.0.0 --port ${PORT:-7860}`
	- New helper: `backend/app/core/runtime.py`
	- `get_port()`:
	- Reads `PORT` from the environment.
	- Defaults to `7860` when unset or invalid.
	- Logs: `Starting on port=<port> hf_spaces_mode=<bool>` using a simple heuristic (`SPACE_ID` / `SPACE_REPO_ID` env vars).
	- Called from `app.main` at import time so the log line is visible in container logs during startup.

	### API key protection and CORS

	- API key protection
	- New module: `backend/app/core/auth.py`
	- Defines `require_api_key` FastAPI dependency using `APIKeyHeader` (`X-API-Key`).
	- `validate_api_key_configuration()` runs at startup and enforces:
	- In production-like environments (`ENV=production` or on Hugging Face Spaces via `SPACE_ID` / `HF_HOME`):
	- `API_KEY` must be set or the backend fails fast with a clear error.
	- In local development:
	- If `API_KEY` is missing, the backend runs open but logs a prominent warning.
	- `require_api_key` behaviour:
	- If `API_KEY` is not configured (dev mode), the dependency is a no-op.
	- If `API_KEY` is configured:
	- Missing or mismatched `X-API-Key` results in HTTP 403.
	- Wiring:
	- All routers except `/health` are registered with `dependencies=[Depends(require_api_key)]`.
	- Docs and OpenAPI endpoints are explicitly secured:
	- `GET /openapi.json` – returns `app.openapi()`, protected by `require_api_key`.
	- `GET /docs` – Swagger UI via `get_swagger_ui_html`, protected by `require_api_key`.
	- `GET /redoc` – ReDoc UI via `get_redoc_html`, protected by `require_api_key`.
	- Effect:
	- In HF Spaces / production:
	- `/docs`, `/redoc`, `/openapi.json`, `/chat`, `/search`, `/documents/`, `/ingest/`, `/metrics` all require `X-API-Key`.
	- `/health` remains public for simple uptime checks.
	- In local dev with no `API_KEY`:
	- All endpoints (including docs) are accessible without a key for convenience.

	- CORS configuration
	- `backend/app/core/security.py` now focuses solely on CORS:
	- Reads `ALLOWED_ORIGINS` env var as a comma-separated list.
	- If unset or empty:
	- Defaults to `["*"]` (permissive, useful for local dev and quick demos).
	- Applies FastAPI `CORSMiddleware` with:
	- `allow_origins=origins`
	- `allow_methods=["*"]`
	- `allow_headers=["*"]`
	- API key enforcement is handled entirely via `core/auth.py` and router/dependency wiring.

	### Rate limiting (SlowAPI)

	- New module: `backend/app/core/rate_limit.py`
	- Uses `slowapi.Limiter` with `get_remote_address` as the key function.
	- `setup_rate_limiter(app)`:
	- Reads `RATE_LIMIT_ENABLED` from `Settings` (defaults to `True`).
	- If disabled:
	- Logs `"Rate limiting is disabled via settings."`
	- Does not attach middleware (decorators become no-ops at runtime).
	- If enabled:
	- Attaches SlowAPI middleware: `app.middleware("http")(limiter.middleware)`.
	- Registers a custom `RateLimitExceeded` handler returning JSON:
	- HTTP `429`
	- Body: `{"detail": "Rate limit exceeded. Please slow down your requests.", "retry_after": ...}` when available.
	- Logs violations with client IP and path.

	- Endpoint-specific limits (per IP):
	- `/chat` and `/chat/stream`:
	- Decorated with `@limiter.limit("30/minute")`.
	- `/ingest` endpoints:
	- `/ingest/arxiv`, `/ingest/openalex`, `/ingest/wiki`:
	- `@limiter.limit("10/minute")`.
	- `/search`:
	- `@limiter.limit("60/minute")`.

	- Operational toggle:
	- New config flag in `Settings`:
	- `RATE_LIMIT_ENABLED: bool = True`
	- `.env.example`:
	- `RATE_LIMIT_ENABLED=true` (set to `false` to disable entirely).

	### Caching (cachetools, in-memory)

	- New module: `backend/app/core/cache.py`
	- Uses `cachetools.TTLCache` with short in-memory TTLs (no external store):
	- Search cache:
	- `TTL = 60s`, `maxsize = 1024`.
	- Keys: `(namespace, query, top_k, filters_json)` where `filters_json` is a JSON-serialised, sorted representation of the `filters` dict.
	- Chat cache:
	- `TTL = 60s`, `maxsize = 512`.
	- Keys: `(namespace, query, top_k, min_score, use_web_fallback)`.
	- Only used when no chat history is provided.

	- API:
	- `cache_enabled() -> bool` (reads `CACHE_ENABLED` from settings, default `True`).
	- `get_search_cached(...)` / `set_search_cached(...)`.
	- `get_chat_cached(...)` / `set_chat_cached(...)`.
	- `get_cache_stats()` returns hit/miss counters:
	- `search_hits`, `search_misses`, `chat_hits`, `chat_misses`.

	- Hit/miss logging:
	- Each cache lookup logs a hit or miss with namespace and query for observability.

	- Integration into endpoints:
	- `/search` (`backend/app/routers/search.py`):
	- On each request:
	1. Check `get_search_cached(...)`.
	2. If hit: use cached `hits_raw` list.
	3. If miss: call Pinecone search and then `set_search_cached(...)`.
	- Response construction (mapping text field to `chunk_text`) remains unchanged.

	- `/chat` (`backend/app/routers/chat.py`):
	- Caching is only considered when `chat_history` is empty and caching is enabled.
	- Flow:
	1. Test `cache_enabled()` and `not payload.chat_history`.
	2. Attempt `get_chat_cached(...)`.
	3. On hit:
	- Log and return the cached `ChatResponse`.
	- Still call `record_chat_timings(...)` so `/metrics` reflects cached responses.
	4. On miss:
	- Run the LangGraph pipeline as before.
	- Record timings via `record_chat_timings(...)`.
	- Store the `ChatResponse` in the chat cache via `set_chat_cached(...)`.

	- Operational toggle:
	- New config flag in `Settings`:
	- `CACHE_ENABLED: bool = True`
	- `.env.example`:
	- `CACHE_ENABLED=true` (set to `false` to fully disable caching).

	### Metrics and observability

	- New module: `backend/app/core/metrics.py`
	- In-memory metrics only, with a small footprint and no external dependencies beyond stdlib.
	- Tracks:
	- Request counts by path:
	- `_request_counts[path]` incremented for every request, via `metrics_middleware`.
	- Error counts by path:
	- `_error_counts[path]` incremented for any response with `status_code >= 400` or for unhandled exceptions.
	- Chat timing metrics:
	- Focused on `/chat` and `/chat/stream`.
	- Expected fields:
	- `retrieve_ms`, `web_ms`, `generate_ms`, `total_ms`.
	- Stored in:
	- `_timing_samples`: `deque(maxlen=20)` for the last 20 samples.
	- `_timing_sums` and `_timing_count` for averages.

	- Middleware:
	- `metrics_middleware(request, call_next)`:
	- Records per-path request and error counts.
	- Logs debug-level timing for each request.

	- API functions:
	- `record_chat_timings(timings: Mapping[str, float])`:
	- Updates sums, counts, and the ring buffer.
	- Called from both `/chat` and `/chat/stream` after timings are known.
	- `get_metrics_snapshot()`:
	- Builds a snapshot dictionary containing:
	- `requests_by_path`
	- `errors_by_path`
	- `timings`:
	- `average_ms` for each timing field.
	- `p50_ms` and `p95_ms` based on the last 20 samples.
	- `cache`:
	- `search_hits`, `search_misses`, `chat_hits`, `chat_misses` from `core.cache`.
	- `sample_count` and `samples` (the last 20 timing entries).

	- `/metrics` endpoint
	- New router: `backend/app/routers/metrics.py`
	- `GET /metrics` returns `get_metrics_snapshot()` as JSON.
	- Registered in `app.main` with tag `["metrics"]`.
	- Left public (not behind API key) to simplify monitoring and demos.

	- App wiring (`backend/app/main.py`)
	- After creating the FastAPI app:
	- `configure_security(app)` – CORS + optional API key.
	- `setup_rate_limiter(app)` – SlowAPI middleware when enabled.
	- `setup_metrics(app)` – metrics middleware.
	- Routers:
	- `health`, `ingest`, `search`, `documents`, `chat`, `metrics` all included.
	- Exception handlers:
	- Still configured via `setup_exception_handlers(app)`.

	### Benchmarking script

	- New script: `scripts/bench_local.py`
	- Purpose:
	- Provide a simple, cross-platform (including Windows) asyncio load tester for the backend.
	- Focused on `/chat`, with optional `/search` benchmarking.
	- Implementation:
	- Uses `httpx.AsyncClient` and `asyncio`.
	- Command-line arguments:
	- `--backend-url` (default: `http://localhost:8000`)
	- `--namespace` (default: `dev`)
	- `--concurrency` (default: `10`)
	- `--requests` (default: `50`)
	- `--include-search` (optional flag to also benchmark `/search`)
	- `--api-key` (optional `X-API-Key` value)
	- For each benchmark:
	- Issues the specified number of requests with the provided concurrency.
	- Records per-request latency (ms) and whether an error occurred.
	- Outputs:
	- Total requests, successes, errors, and error rate.
	- Average latency.
	- p50 and p95 latencies.
	- Entrypoint:
	- `python scripts/bench_local.py --backend-url http://localhost:8000 --namespace dev --concurrency 10 --requests 50`

	### Streamlit frontend (Streamlit Community Cloud)

	- New directory: `frontend/`
	- Main app: `frontend/app.py`
	- Dependencies:
	- `streamlit`
	- `httpx`
	- Backend configuration:
	- Reads `BACKEND_BASE_URL` from `st.secrets["BACKEND_BASE_URL"]` or the `BACKEND_BASE_URL` environment variable.
	- Reads `API_KEY` from `st.secrets["API_KEY"]` or the `API_KEY` environment variable.
	- Sidebar ("Backend" + settings):
	- Shows backend URL and API key status.
	- "Ping /health" button that calls the backend and shows the JSON response.
	- `top_k` slider, `min_score` slider, `use_web_fallback` checkbox.
	- "Show sources" toggle and "Clear chat" button.
	- "Recent uploads" section with quick actions:
	- For each recent upload, displays title, namespace, timestamp.
	- A "Search this document" button pre-fills the chat input with a prompt such as `Summarize: <title>`.
	- Chatbot UI:
	- Uses `st.chat_message` and `st.chat_input` with conversation stored in `st.session_state.messages`.
	- When the user sends a message:
	- Appends it to history and displays it.
	- Calls `/chat/stream` with `X-API-Key` (if available) and streams tokens into the UI.
	- If `/chat/stream` is unavailable (e.g. 404), falls back to `/chat`.
	- Assistant messages:
	- Display the answer text.
	- Optionally show sources in an expandable "Sources" section with titles, URLs, scores, and truncated snippets.
	- If `API_KEY` is not configured in secrets or environment:
	- The app warns and disables sending messages to the protected backend.
	- UI document upload:
	- A top-level “📄 Upload Document” button opens a `@st.dialog` modal.
	- Inside the dialog:
	- `st.file_uploader` for `.pdf`, `.md`, `.txt`, `.docx`, `.pptx`, `.xlsx`, `.html`, `.htm`.
	- Inputs for title (defaulting to filename), namespace, source label, tags, and notes.
	- A checkbox to allow uploading even when extracted text is very short.
	- On submit:
	- The frontend converts the file to text/markdown (using Docling when installed, or raw text for `.md`/`.txt`).
	- Calls backend `POST /documents/upload-text` with `X-API-Key`.
	- On success, records the upload in `st.session_state.recent_uploads` and triggers a rerun to close the dialog.

	- Root-level `requirements.txt`
	- Added to support Streamlit Community Cloud, where the root requirements file is used:
	- `streamlit`
	- `httpx`
	- Backend Docker image continues to use `backend/requirements.txt`, keeping the backend container small and independent.

	---

	## Operational Runbook

	### Rotating keys and secrets

	- Backend (Hugging Face Spaces or other container hosts)
	- Update environment variables / secrets:
	- `PINECONE_API_KEY`, `PINECONE_HOST`, `PINECONE_INDEX_NAME`, `PINECONE_NAMESPACE`, `PINECONE_TEXT_FIELD`
	- `GROQ_API_KEY`, `GROQ_BASE_URL`, `GROQ_MODEL`
	- `TAVILY_API_KEY`
	- `LANGCHAIN_API_KEY`, `LANGCHAIN_TRACING_V2`, `LANGCHAIN_PROJECT`
	- `API_KEY` for HTTP clients
	- Redeploy or restart the Space to apply changes.
	- Verify:
	- `GET /health` returns `status: ok`.
	- `/chat` and `/search` work as expected.
	- `/metrics` shows traffic and cache counters updating.

	- Frontend (Streamlit Community Cloud)
	- Use Streamlit Secrets manager (no secrets in repo):
	- `BACKEND_BASE_URL` – full URL of the backend (e.g. HF Spaces URL).
	- `API_KEY` – must match backend `API_KEY` if API protection is enabled.
	- After rotating backend keys:
	- If `API_KEY` changed, update it in Streamlit secrets.
	- No code changes required.

	### Disabling rate limiting and caching

	- Rate limiting
	- Set `RATE_LIMIT_ENABLED=false` in the backend environment (or `.env` for local).
	- Restart the backend.
	- SlowAPI middleware will not be attached; `@limiter.limit(...)` decorators become effectively no-op for enforcement.
	- `/metrics` will still track request counts and errors.

	- Caching
	- Set `CACHE_ENABLED=false` in the backend environment.
	- Restart the backend.
	- Search and chat endpoints will bypass in-memory TTL caches entirely.
	- `get_cache_stats()` will still report counters, which will stop increasing.

	### Diagnosing common deployment issues

	- Symptom: 404 / connection errors on Hugging Face Spaces
	- Check:
	- The Space is configured as Docker and points to the `backend/` subdirectory (or uses the provided `backend/Dockerfile`).
	- Logs show the startup message:
	- `"Starting on port=... hf_spaces_mode=..."`.
	- HF Spaces sets `PORT` automatically; the Docker `CMD` will honour it.
	- Verify:
	- Open `/docs` and `/health` in the browser using the Space URL.
	- If 404/500 persists:
	- Ensure `PINECONE_*` and `GROQ_API_KEY` are set.
	- Check logs for `PineconeIndexConfigError` or missing LLM configuration.

	- Symptom: 401 Unauthorized from frontend
	- Ensure:
	- Backend `API_KEY` is set and matches the `API_KEY` in Streamlit secrets.
	- Requests include `X-API-Key` header (Streamlit app does this automatically when `API_KEY` is present).
	- Confirm `/health` is still reachable without a key (by design).

	- Symptom: 429 Too Many Requests
	- Indicates SlowAPI rate limiting is active.
	- Options:
	- Reduce load (e.g. from `bench_local.py`).
	- Temporarily set `RATE_LIMIT_ENABLED=false` for heavy local testing.
	- Inspect `/metrics`:
	- Check request counts and error counts for affected paths.

	- Symptom: Stale results after ingestion
	- By default, caches are short-lived (60 seconds) but may briefly serve stale results:
	- When ingesting new documents, `/search` or `/chat` responses may not immediately reflect new content.
	- Workarounds:
	- Wait a minute for TTL expiry.
	- For strict freshness, disable caching with `CACHE_ENABLED=false`.

	- Symptom: Streamlit frontend cannot reach backend
	- Verify:
	- `BACKEND_BASE_URL` in Streamlit secrets is correct and publicly reachable.
	- CORS config on the backend:
	- For debugging, keep `ALLOWED_ORIGINS` unset (defaults to `"*"`).
	- For locked-down deployment, ensure the Streamlit app origin is included.
	- Use the Connectivity panel:
	- Click "Ping /health" and inspect the response or error message.

	---