BrejBala's picture
final changes with API key
b09b8a3

RAG Agent Workbench – Context and Design

Project Purpose

RAG Agent Workbench is a lightweight experimentation backend for retrieval-augmented generation (RAG). It focuses on:

  • Fast ingestion of documents into a Pinecone index with integrated embeddings.
  • Simple, production-style APIs for search and chat-style question answering.
  • Keeping the backend slim: no local embedding or LLM models, relying instead on managed services.

Current Architecture

  • Client(s)

    • Any HTTP client (curl, scripts in scripts/, future UI) talks to the FastAPI backend.
  • Backend (FastAPI, backend/app)

    • routers/
      • health.py – service status.
      • ingest.py – /ingest/wiki, /ingest/openalex, /ingest/arxiv.
      • documents.py – manual uploads and stats.
      • search.py – semantic search over Pinecone.
      • chat.py – agentic RAG chat using LangGraph + LangChain.
    • services/
      • ingestors/ – fetch content from arXiv, OpenAlex, Wikipedia.
      • chunking.py – chunk documents into Pinecone-ready records.
      • dedupe.py – in-memory duplicate record removal.
      • normalize.py – text normalisation and doc id generation.
      • pinecone_store.py – Pinecone init, search, upsert, stats.
      • llm/groq_llm.py – Groq-backed chat model wrapper.
      • tools/tavily_tool.py – Tavily web search integration.
      • prompts/rag_prompt.py – RAG system + user prompts.
      • chat/graph.py – LangGraph state graph for /chat.
    • core/
      • config.py – env-driven configuration.
      • errors.py – app-specific exceptions + handlers.
      • logging.py – basic logging setup.
      • tracing.py – LangSmith / LangChain tracing helper.
    • schemas/ – Pydantic models for all endpoints.
  • Vector Store

    • Pinecone index with integrated embeddings.
    • Text field configurable via PINECONE_TEXT_FIELD.
  • LLM and Tools

    • Groq OpenAI-compatible chat model via langchain-openai.
    • Tavily web search via langchain-community tool (optional).
    • LangGraph orchestrates retrieval β†’ routing β†’ web search β†’ generation.

Implemented Endpoints

HTTP Method Path Description
GET /health Health check with service name and version.
POST /ingest/arxiv Ingest recent arXiv entries matching a query.
POST /ingest/openalex Ingest OpenAlex works matching a query.
POST /ingest/wiki Ingest Wikipedia pages by title.
POST /documents/upload-text Upload raw/manual text or Docling-converted content.
GET /documents/stats Get vector counts per namespace from Pinecone.
POST /search Semantic search over Pinecone using integrated embeddings.
POST /chat Production-style RAG chat using LangGraph + Groq + Pinecone.
POST /chat/stream SSE streaming variant of /chat.

Key Design Decisions

  • Integrated embeddings only

    • No local embedding models; Pinecone is configured with integrated embeddings.
    • Backend stays light and easy to deploy in constrained environments.
  • OpenAI-compatible LLM interface

    • Groq is accessed via the OpenAI-compatible API (langchain-openai).
    • Avoids additional provider-specific SDKs and keeps integration simple.
  • Agentic RAG flow using LangGraph

    • Chat pipeline is modelled as a state graph:
      1. normalize_input – set defaults, normalise chat history.
      2. retrieve_context – Pinecone retrieval.
      3. decide_next – route to web search or generation.
      4. web_search – Tavily search (optional).
      5. generate_answer – Groq LLM with RAG prompts.
      6. format_response – reserved for post-processing.
    • This makes the flow explicit and easy to extend.
  • Web search as a conditional fallback

    • Tavily web search is used only when:
      • Retrieval returns no hits, or
      • Top score is below a threshold (min_score), and
      • use_web_fallback=true and TAVILY_API_KEY is configured.
    • When Tavily is not configured, the system degrades gracefully to retrieval-only.
  • LangSmith tracing via environment flags

    • Tracing is enabled purely via environment:
      • LANGCHAIN_TRACING_V2=true
      • LANGCHAIN_API_KEY set
      • Optional: LANGCHAIN_PROJECT
    • core/tracing.py exposes helper functions that:
      • Check if tracing is enabled.
      • Construct callback handlers (LangChainTracer) for LangGraph/LangChain.
      • Expose trace metadata in API responses.
  • Error handling boundary

    • External dependencies (Pinecone, Groq, Tavily) are wrapped so that:
      • Configuration errors return 500s with clear messages.
      • Upstream service failures raise UpstreamServiceError and surface as HTTP 502.
    • This keeps failure modes explicit for clients.

Work Package History

Work Package A

  • Scope
    • Initial backend setup with FastAPI, Pinecone integration, and ingestion/search endpoints.
  • Highlights
    • /ingest/wiki, /ingest/openalex, /ingest/arxiv for sourcing content.
    • /documents/upload-text for manual/Docling-based uploads.
    • /search and /documents/stats endpoints to query and inspect the index.
  • How to test
    • Use scripts/seed_ingest.py and scripts/smoke_arxiv.py to seed and smoke-test ingestion.

Work Package B (this change)

  • Scope

    • Add a production-style /chat RAG endpoint using LangGraph and LangChain.
    • Integrate Groq as the LLM and Tavily as an optional web search fallback.
    • Introduce LangSmith tracing hooks and update documentation and smoke tests.
  • Functional changes

    • New router: backend/app/routers/chat.py

      • POST /chat
        • Runs a LangGraph state graph:
          1. Normalises inputs and defaults.
          2. Retrieves context from Pinecone.
          3. Decides whether to invoke web search.
          4. Runs Tavily web search when enabled and needed.
          5. Calls Groq LLM with a RAG prompt to generate the answer.
          6. Returns answer, sources, timings, and trace metadata.
      • POST /chat/stream
        • Same pipeline as /chat but returns Server-Sent Events (SSE).
        • Streams tokens from the final answer plus a terminating event with the full JSON payload.
    • New schemas: backend/app/schemas/chat.py

      • ChatRequest with:
        • query, namespace, top_k, use_web_fallback, min_score, max_web_results, and chat_history.
      • SourceHit representing document/web snippets.
      • ChatTimings and ChatTraceMetadata for timings and LangSmith info.
      • ChatResponse combining answer, sources, timings, and trace metadata.
    • New services:

      • backend/app/services/llm/groq_llm.py

        • get_llm() returns a Groq-backed ChatOpenAI with:
          • base_url = GROQ_BASE_URL (default https://api.groq.com/openai/v1).
          • model = GROQ_MODEL (default llama-3.1-8b-instant).
          • Timeouts and retries from HTTP settings.
        • Raises a configuration error if GROQ_API_KEY is missing.
      • backend/app/services/tools/tavily_tool.py

        • is_tavily_configured() checks TAVILY_API_KEY.
        • get_tavily_tool(max_results) wraps TavilySearchResults from langchain-community.
        • Logs a warning and returns None when Tavily is not configured, disabling web fallback gracefully.
      • backend/app/services/prompts/rag_prompt.py

        • Defines RAG system and user prompts.
        • build_rag_messages(chat_history, question, sources) builds LangChain messages that:
          • Use only supplied context.
          • Label context snippets as [1], [2], etc., and instruct the model to cite them inline.
      • backend/app/services/chat/graph.py

        • Implements the LangGraph ChatState and state graph with nodes:
          • normalize_input
          • retrieve_context
          • decide_next
          • web_search
          • generate_answer
          • format_response
        • Uses Pinecone search for retrieval and Tavily for optional web search.
        • Calls the Groq LLM via get_llm() with LangChain Runnable config (callbacks) so LangSmith traces are collected when enabled.
        • Records retrieve_ms, web_ms, and generate_ms in timings.
    • New core utility:

      • backend/app/core/tracing.py
        • is_tracing_enabled() checks LANGCHAIN_TRACING_V2 and LANGCHAIN_API_KEY.
        • get_tracing_callbacks() returns a LangChainTracer callback list when enabled.
        • get_tracing_response_metadata() returns {langsmith_project, trace_enabled}.
    • Configuration changes:

      • backend/app/core/config.py adds:
        • GROQ_API_KEY, GROQ_BASE_URL, GROQ_MODEL.
        • TAVILY_API_KEY.
        • RAG_DEFAULT_TOP_K, RAG_MIN_SCORE, RAG_MAX_WEB_RESULTS.
      • backend/.env.example updated with the new env vars, including LangSmith options.
    • Error handling:

      • backend/app/core/errors.py introduces UpstreamServiceError.
      • Centralised handler converts UpstreamServiceError into HTTP 502 responses.
    • Documentation and scripts:

      • backend/README.md updated with /chat and /chat/stream usage, env vars, and a local test checklist.
      • New scripts:
        • scripts/smoke_chat.py – uses /ingest/wiki and /chat for a local smoke test.
        • scripts/smoke_chat_web.py – tests /chat with use_web_fallback=true and a query that should trigger web search.
  • How to test

    1. Start the backend:
      cd backend
      uvicorn app.main:app --reload --port 8000
      
    2. Ingest some Wikipedia pages:
      python ../scripts/smoke_chat.py --backend-url http://localhost:8000 --namespace dev
      
    3. Test web fallback (requires TAVILY_API_KEY):
      python ../scripts/smoke_chat_web.py --backend-url http://localhost:8000 --namespace dev
      
    4. Verify LangSmith traces:
      • Set LANGCHAIN_TRACING_V2=true, LANGCHAIN_API_KEY, and optionally LANGCHAIN_PROJECT.
      • Run /chat again and confirm traces appear in LangSmith.

Known Issues / Limits

  • No local models

    • The backend intentionally does not host local embedding or LLM models.
    • All intelligence is delegated to Pinecone (integrated embeddings), Groq, and Tavily.
  • Retrieval quality depends on ingestion

    • The usefulness of /chat depends heavily on the quality and coverage of the ingested documents.
    • For some queries, even the best matching chunks may not be sufficient to answer without web fallback.
  • Best-effort web search

    • Tavily integration is optional and depends on the external Tavily API.
    • When Tavily is unavailable or misconfigured, the backend falls back to retrieval-only answers.
  • Simple SSE streaming

    • /chat/stream streams tokens derived from the final answer string rather than streaming directly from the LLM.
    • This keeps implementation simple while still providing a streaming interface.

Work Package C

Scope

  • Make the backend deploy-ready on Hugging Face Spaces using Docker.
  • Add a minimal Streamlit frontend suitable for Streamlit Community Cloud (no Docker).
  • Add production polish: basic API protection, rate limiting, caching, metrics, and a small benchmarking script.
  • Keep configuration sane by default, with environment variables as overrides rather than hard requirements.

Backend changes (HF Spaces deploy + runtime)

  • Docker / port behaviour
    • backend/Dockerfile now:
      • Exposes port 7860 (the default for many Hugging Face Spaces deployments).
      • Uses a shell-form CMD so PORT can be honoured when set:
        • uvicorn app.main:app --host 0.0.0.0 --port ${PORT:-7860}
    • New helper: backend/app/core/runtime.py
      • get_port():
        • Reads PORT from the environment.
        • Defaults to 7860 when unset or invalid.
        • Logs: Starting on port=<port> hf_spaces_mode=<bool> using a simple heuristic (SPACE_ID / SPACE_REPO_ID env vars).
      • Called from app.main at import time so the log line is visible in container logs during startup.

API key protection and CORS

  • API key protection

    • New module: backend/app/core/auth.py
      • Defines require_api_key FastAPI dependency using APIKeyHeader (X-API-Key).
      • validate_api_key_configuration() runs at startup and enforces:
        • In production-like environments (ENV=production or on Hugging Face Spaces via SPACE_ID / HF_HOME):
          • API_KEY must be set or the backend fails fast with a clear error.
        • In local development:
          • If API_KEY is missing, the backend runs open but logs a prominent warning.
      • require_api_key behaviour:
        • If API_KEY is not configured (dev mode), the dependency is a no-op.
        • If API_KEY is configured:
          • Missing or mismatched X-API-Key results in HTTP 403.
    • Wiring:
      • All routers except /health are registered with dependencies=[Depends(require_api_key)].
      • Docs and OpenAPI endpoints are explicitly secured:
        • GET /openapi.json – returns app.openapi(), protected by require_api_key.
        • GET /docs – Swagger UI via get_swagger_ui_html, protected by require_api_key.
        • GET /redoc – ReDoc UI via get_redoc_html, protected by require_api_key.
      • Effect:
        • In HF Spaces / production:
          • /docs, /redoc, /openapi.json, /chat, /search, /documents/*, /ingest/*, /metrics all require X-API-Key.
          • /health remains public for simple uptime checks.
        • In local dev with no API_KEY:
          • All endpoints (including docs) are accessible without a key for convenience.
  • CORS configuration

    • backend/app/core/security.py now focuses solely on CORS:
      • Reads ALLOWED_ORIGINS env var as a comma-separated list.
      • If unset or empty:
        • Defaults to ["*"] (permissive, useful for local dev and quick demos).
      • Applies FastAPI CORSMiddleware with:
        • allow_origins=origins
        • allow_methods=["*"]
        • allow_headers=["*"]
    • API key enforcement is handled entirely via core/auth.py and router/dependency wiring.

Rate limiting (SlowAPI)

  • New module: backend/app/core/rate_limit.py

    • Uses slowapi.Limiter with get_remote_address as the key function.
    • setup_rate_limiter(app):
      • Reads RATE_LIMIT_ENABLED from Settings (defaults to True).
      • If disabled:
        • Logs "Rate limiting is disabled via settings."
        • Does not attach middleware (decorators become no-ops at runtime).
      • If enabled:
        • Attaches SlowAPI middleware: app.middleware("http")(limiter.middleware).
        • Registers a custom RateLimitExceeded handler returning JSON:
          • HTTP 429
          • Body: {"detail": "Rate limit exceeded. Please slow down your requests.", "retry_after": ...} when available.
        • Logs violations with client IP and path.
  • Endpoint-specific limits (per IP):

    • /chat and /chat/stream:
      • Decorated with @limiter.limit("30/minute").
    • /ingest endpoints:
      • /ingest/arxiv, /ingest/openalex, /ingest/wiki:
        • @limiter.limit("10/minute").
    • /search:
      • @limiter.limit("60/minute").
  • Operational toggle:

    • New config flag in Settings:
      • RATE_LIMIT_ENABLED: bool = True
    • .env.example:
      • RATE_LIMIT_ENABLED=true (set to false to disable entirely).

Caching (cachetools, in-memory)

  • New module: backend/app/core/cache.py

    • Uses cachetools.TTLCache with short in-memory TTLs (no external store):

      • Search cache:
        • TTL = 60s, maxsize = 1024.
        • Keys: (namespace, query, top_k, filters_json) where filters_json is a JSON-serialised, sorted representation of the filters dict.
      • Chat cache:
        • TTL = 60s, maxsize = 512.
        • Keys: (namespace, query, top_k, min_score, use_web_fallback).
        • Only used when no chat history is provided.
    • API:

      • cache_enabled() -> bool (reads CACHE_ENABLED from settings, default True).
      • get_search_cached(...) / set_search_cached(...).
      • get_chat_cached(...) / set_chat_cached(...).
      • get_cache_stats() returns hit/miss counters:
        • search_hits, search_misses, chat_hits, chat_misses.
    • Hit/miss logging:

      • Each cache lookup logs a hit or miss with namespace and query for observability.
  • Integration into endpoints:

    • /search (backend/app/routers/search.py):

      • On each request:
        1. Check get_search_cached(...).
        2. If hit: use cached hits_raw list.
        3. If miss: call Pinecone search and then set_search_cached(...).
      • Response construction (mapping text field to chunk_text) remains unchanged.
    • /chat (backend/app/routers/chat.py):

      • Caching is only considered when chat_history is empty and caching is enabled.
      • Flow:
        1. Test cache_enabled() and not payload.chat_history.
        2. Attempt get_chat_cached(...).
        3. On hit:
          • Log and return the cached ChatResponse.
          • Still call record_chat_timings(...) so /metrics reflects cached responses.
        4. On miss:
          • Run the LangGraph pipeline as before.
          • Record timings via record_chat_timings(...).
          • Store the ChatResponse in the chat cache via set_chat_cached(...).
  • Operational toggle:

    • New config flag in Settings:
      • CACHE_ENABLED: bool = True
    • .env.example:
      • CACHE_ENABLED=true (set to false to fully disable caching).

Metrics and observability

  • New module: backend/app/core/metrics.py

    • In-memory metrics only, with a small footprint and no external dependencies beyond stdlib.

    • Tracks:

      • Request counts by path:
        • _request_counts[path] incremented for every request, via metrics_middleware.
      • Error counts by path:
        • _error_counts[path] incremented for any response with status_code >= 400 or for unhandled exceptions.
      • Chat timing metrics:
        • Focused on /chat and /chat/stream.
        • Expected fields:
          • retrieve_ms, web_ms, generate_ms, total_ms.
        • Stored in:
          • _timing_samples: deque(maxlen=20) for the last 20 samples.
          • _timing_sums and _timing_count for averages.
    • Middleware:

      • metrics_middleware(request, call_next):
        • Records per-path request and error counts.
        • Logs debug-level timing for each request.
    • API functions:

      • record_chat_timings(timings: Mapping[str, float]):
        • Updates sums, counts, and the ring buffer.
        • Called from both /chat and /chat/stream after timings are known.
      • get_metrics_snapshot():
        • Builds a snapshot dictionary containing:
          • requests_by_path
          • errors_by_path
          • timings:
            • average_ms for each timing field.
            • p50_ms and p95_ms based on the last 20 samples.
          • cache:
            • search_hits, search_misses, chat_hits, chat_misses from core.cache.
          • sample_count and samples (the last 20 timing entries).
  • /metrics endpoint

    • New router: backend/app/routers/metrics.py
      • GET /metrics returns get_metrics_snapshot() as JSON.
    • Registered in app.main with tag ["metrics"].
    • Left public (not behind API key) to simplify monitoring and demos.
  • App wiring (backend/app/main.py)

    • After creating the FastAPI app:
      • configure_security(app) – CORS + optional API key.
      • setup_rate_limiter(app) – SlowAPI middleware when enabled.
      • setup_metrics(app) – metrics middleware.
    • Routers:
      • health, ingest, search, documents, chat, metrics all included.
    • Exception handlers:
      • Still configured via setup_exception_handlers(app).

Benchmarking script

  • New script: scripts/bench_local.py
    • Purpose:
      • Provide a simple, cross-platform (including Windows) asyncio load tester for the backend.
      • Focused on /chat, with optional /search benchmarking.
    • Implementation:
      • Uses httpx.AsyncClient and asyncio.
      • Command-line arguments:
        • --backend-url (default: http://localhost:8000)
        • --namespace (default: dev)
        • --concurrency (default: 10)
        • --requests (default: 50)
        • --include-search (optional flag to also benchmark /search)
        • --api-key (optional X-API-Key value)
      • For each benchmark:
        • Issues the specified number of requests with the provided concurrency.
        • Records per-request latency (ms) and whether an error occurred.
      • Outputs:
        • Total requests, successes, errors, and error rate.
        • Average latency.
        • p50 and p95 latencies.
    • Entrypoint:
      • python scripts/bench_local.py --backend-url http://localhost:8000 --namespace dev --concurrency 10 --requests 50

Streamlit frontend (Streamlit Community Cloud)

  • New directory: frontend/

    • Main app: frontend/app.py
      • Dependencies:
        • streamlit
        • httpx
      • Backend configuration:
        • Reads BACKEND_BASE_URL from st.secrets["BACKEND_BASE_URL"] or the BACKEND_BASE_URL environment variable.
        • Reads API_KEY from st.secrets["API_KEY"] or the API_KEY environment variable.
      • Sidebar ("Backend" + settings):
        • Shows backend URL and API key status.
        • "Ping /health" button that calls the backend and shows the JSON response.
        • top_k slider, min_score slider, use_web_fallback checkbox.
        • "Show sources" toggle and "Clear chat" button.
        • "Recent uploads" section with quick actions:
          • For each recent upload, displays title, namespace, timestamp.
          • A "Search this document" button pre-fills the chat input with a prompt such as Summarize: <title>.
      • Chatbot UI:
        • Uses st.chat_message and st.chat_input with conversation stored in st.session_state.messages.
        • When the user sends a message:
          • Appends it to history and displays it.
          • Calls /chat/stream with X-API-Key (if available) and streams tokens into the UI.
          • If /chat/stream is unavailable (e.g. 404), falls back to /chat.
        • Assistant messages:
          • Display the answer text.
          • Optionally show sources in an expandable "Sources" section with titles, URLs, scores, and truncated snippets.
        • If API_KEY is not configured in secrets or environment:
          • The app warns and disables sending messages to the protected backend.
      • UI document upload:
        • A top-level β€œπŸ“„ Upload Document” button opens a @st.dialog modal.
        • Inside the dialog:
          • st.file_uploader for .pdf, .md, .txt, .docx, .pptx, .xlsx, .html, .htm.
          • Inputs for title (defaulting to filename), namespace, source label, tags, and notes.
          • A checkbox to allow uploading even when extracted text is very short.
          • On submit:
            • The frontend converts the file to text/markdown (using Docling when installed, or raw text for .md/.txt).
            • Calls backend POST /documents/upload-text with X-API-Key.
            • On success, records the upload in st.session_state.recent_uploads and triggers a rerun to close the dialog.
  • Root-level requirements.txt

    • Added to support Streamlit Community Cloud, where the root requirements file is used:
      • streamlit
      • httpx
    • Backend Docker image continues to use backend/requirements.txt, keeping the backend container small and independent.

Operational Runbook

Rotating keys and secrets

  • Backend (Hugging Face Spaces or other container hosts)

    • Update environment variables / secrets:
      • PINECONE_API_KEY, PINECONE_HOST, PINECONE_INDEX_NAME, PINECONE_NAMESPACE, PINECONE_TEXT_FIELD
      • GROQ_API_KEY, GROQ_BASE_URL, GROQ_MODEL
      • TAVILY_API_KEY
      • LANGCHAIN_API_KEY, LANGCHAIN_TRACING_V2, LANGCHAIN_PROJECT
      • API_KEY for HTTP clients
    • Redeploy or restart the Space to apply changes.
    • Verify:
      • GET /health returns status: ok.
      • /chat and /search work as expected.
      • /metrics shows traffic and cache counters updating.
  • Frontend (Streamlit Community Cloud)

    • Use Streamlit Secrets manager (no secrets in repo):
      • BACKEND_BASE_URL – full URL of the backend (e.g. HF Spaces URL).
      • API_KEY – must match backend API_KEY if API protection is enabled.
    • After rotating backend keys:
      • If API_KEY changed, update it in Streamlit secrets.
      • No code changes required.

Disabling rate limiting and caching

  • Rate limiting

    • Set RATE_LIMIT_ENABLED=false in the backend environment (or .env for local).
    • Restart the backend.
    • SlowAPI middleware will not be attached; @limiter.limit(...) decorators become effectively no-op for enforcement.
    • /metrics will still track request counts and errors.
  • Caching

    • Set CACHE_ENABLED=false in the backend environment.
    • Restart the backend.
    • Search and chat endpoints will bypass in-memory TTL caches entirely.
    • get_cache_stats() will still report counters, which will stop increasing.

Diagnosing common deployment issues

  • Symptom: 404 / connection errors on Hugging Face Spaces

    • Check:
      • The Space is configured as Docker and points to the backend/ subdirectory (or uses the provided backend/Dockerfile).
      • Logs show the startup message:
        • "Starting on port=... hf_spaces_mode=...".
      • HF Spaces sets PORT automatically; the Docker CMD will honour it.
    • Verify:
      • Open /docs and /health in the browser using the Space URL.
      • If 404/500 persists:
        • Ensure PINECONE_* and GROQ_API_KEY are set.
        • Check logs for PineconeIndexConfigError or missing LLM configuration.
  • Symptom: 401 Unauthorized from frontend

    • Ensure:
      • Backend API_KEY is set and matches the API_KEY in Streamlit secrets.
      • Requests include X-API-Key header (Streamlit app does this automatically when API_KEY is present).
    • Confirm /health is still reachable without a key (by design).
  • Symptom: 429 Too Many Requests

    • Indicates SlowAPI rate limiting is active.
    • Options:
      • Reduce load (e.g. from bench_local.py).
      • Temporarily set RATE_LIMIT_ENABLED=false for heavy local testing.
    • Inspect /metrics:
      • Check request counts and error counts for affected paths.
  • Symptom: Stale results after ingestion

    • By default, caches are short-lived (60 seconds) but may briefly serve stale results:
      • When ingesting new documents, /search or /chat responses may not immediately reflect new content.
    • Workarounds:
      • Wait a minute for TTL expiry.
      • For strict freshness, disable caching with CACHE_ENABLED=false.
  • Symptom: Streamlit frontend cannot reach backend

    • Verify:
      • BACKEND_BASE_URL in Streamlit secrets is correct and publicly reachable.
      • CORS config on the backend:
        • For debugging, keep ALLOWED_ORIGINS unset (defaults to "*").
        • For locked-down deployment, ensure the Streamlit app origin is included.
    • Use the Connectivity panel:
      • Click "Ping /health" and inspect the response or error message.