Spaces:

16bitSega
/

Agentic_RAG

Sleeping

App Files Files Community

Oleksii Obolonskyi commited on Jan 30

Commit

6f19c35

1 Parent(s): 123d866

Persist FAISS indexes across restarts

Browse files

Files changed (2) hide show

README.md +19 -16
app.py +274 -102

README.md CHANGED Viewed

@@ -54,13 +54,12 @@ Set these environment variables (local dev or Hugging Face Spaces secrets):
 ```bash
 export HF_TOKEN=hf_your_token_here
-export RAG_HF_MODEL=HuggingFaceTB/SmolLM3-3B
-export RAG_HF_MODEL_FALLBACKS=HuggingFaceTB/SmolLM2-1.7B,HuggingFaceTB/SmolLM2-360M
-export RAG_HF_PROVIDER=hf-inference
-export RAG_LLM_BACKEND=hf
 ```
-Optional: set `RAG_HF_API_URL` for display/debug if you are using a custom endpoint.
 ### 3) Prepare sources
@@ -88,8 +87,8 @@ streamlit run app.py
 ```
 Open `http://localhost:8501`. On first run, the app builds FAISS indexes:
-- `data/normalized/index_books.faiss`
-- `data/normalized/index_articles.faiss`
 ## Configuration
@@ -98,16 +97,15 @@ You can override defaults via environment variables:
 ```bash
 export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl
 export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl
-export RAG_BOOK_INDEX_PATH=data/normalized/index_books.faiss
-export RAG_ARTICLE_INDEX_PATH=data/normalized/index_articles.faiss
 export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json
 export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json
 export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
 export HF_TOKEN=hf_your_token_here
-export RAG_HF_PROVIDER=hf-inference
-export RAG_HF_MODEL=HuggingFaceTB/SmolLM3-3B
-export RAG_HF_MODEL_FALLBACKS=HuggingFaceTB/SmolLM2-1.7B,HuggingFaceTB/SmolLM2-360M
-export RAG_LLM_BACKEND=hf
 export RAG_MAX_CONTEXT_TOKENS=6000
 export RAG_INJECT_MAX_CHUNKS=6
 export RAG_MAX_GENERATION_TOKENS=512
@@ -119,9 +117,14 @@ export RAG_ARTICLE_SOURCES=sources_articles.json
 ## Deploy to Hugging Face Spaces
 1. Create a new Space (Streamlit SDK) and push this repo.
-2. In Space Settings → Secrets, set `HF_TOKEN` (required) and optionally `GITHUB_TOKEN`.
-3. In Space Settings → Variables, set `RAG_HF_MODEL`, `RAG_LLM_BACKEND=hf`, and `RAG_HF_PROVIDER`.
-4. Optional: `RAG_HF_MODEL_FALLBACKS`, `RAG_INJECT_MAX_CHUNKS`, and `RAG_RETRIEVE_TOPK_MULT`.
 ## Common maintenance tasks

 ```bash
 export HF_TOKEN=hf_your_token_here
+export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
+export RAG_HF_PROVIDER_SUFFIX=featherless-ai
+export RAG_LLM_BACKEND=hf-router
 ```
+Optional: set `RAG_HF_PROVIDER_SUFFIX` if your model id is missing the provider suffix.
 ### 3) Prepare sources
 ```
 Open `http://localhost:8501`. On first run, the app builds FAISS indexes:
+- `data/cache/index_books.faiss` (local)
+- `data/cache/index_articles.faiss` (local)
 ## Configuration
 ```bash
 export RAG_BOOK_CHUNKS_PATH=data/normalized/chunks_books.jsonl
 export RAG_ARTICLE_CHUNKS_PATH=data/normalized/chunks_articles.jsonl
+export RAG_BOOK_INDEX_PATH=data/cache/index_books.faiss
+export RAG_ARTICLE_INDEX_PATH=data/cache/index_articles.faiss
 export RAG_BOOK_MANIFEST_PATH=data/normalized/manifest_books.json
 export RAG_ARTICLE_MANIFEST_PATH=data/normalized/manifest_articles.json
 export RAG_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
 export HF_TOKEN=hf_your_token_here
+export RAG_HF_MODEL=Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai
+export RAG_HF_PROVIDER_SUFFIX=featherless-ai
+export RAG_LLM_BACKEND=hf-router
 export RAG_MAX_CONTEXT_TOKENS=6000
 export RAG_INJECT_MAX_CHUNKS=6
 export RAG_MAX_GENERATION_TOKENS=512
 ## Deploy to Hugging Face Spaces
 1. Create a new Space (Streamlit SDK) and push this repo.
+2. Enable Persistent Storage and set caches:
+   - `HF_HOME=/data/.huggingface`
+   - `SENTENCE_TRANSFORMERS_HOME=/data/.sentence-transformers`
+3. In Space Settings → Secrets, set `HF_TOKEN` (required) and optionally `GITHUB_TOKEN`.
+4. In Space Settings → Variables, set `RAG_HF_MODEL` and `RAG_LLM_BACKEND=hf-router`.
+5. Optional: `RAG_HF_PROVIDER_SUFFIX`, `RAG_INJECT_MAX_CHUNKS`, and `RAG_RETRIEVE_TOPK_MULT`.
+With persistent storage enabled, FAISS indexes are stored in `/data/rag_cache` and reused across restarts. They rebuild only when the normalized chunk/manifest files change.
 ## Common maintenance tasks

app.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import os
 import re
 import json
 import html
 from dataclasses import dataclass
 from pathlib import Path
@@ -19,6 +20,16 @@ from sentence_transformers import SentenceTransformer
 load_dotenv(Path(__file__).resolve().parent / ".env", override=True)
 COMPANY_NAME = "O_O.inc"
 COMPANY_EMAIL = "o.obolonsky@proton.me"
 COMPANY_PHONE = "+380953555919"
@@ -49,8 +60,8 @@ CONFIG = AppConfig(
     article_chunks_path=os.environ.get("RAG_ARTICLE_CHUNKS_PATH", "data/normalized/chunks_articles.jsonl"),
     book_manifest_path=os.environ.get("RAG_BOOK_MANIFEST_PATH", "data/normalized/manifest_books.json"),
     article_manifest_path=os.environ.get("RAG_ARTICLE_MANIFEST_PATH", "data/normalized/manifest_articles.json"),
-    book_index_path=os.environ.get("RAG_BOOK_INDEX_PATH", "data/normalized/index_books.faiss"),
-    article_index_path=os.environ.get("RAG_ARTICLE_INDEX_PATH", "data/normalized/index_articles.faiss"),
     embed_model=os.environ.get("RAG_EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2"),
     max_context_tokens=int(os.getenv("RAG_MAX_CONTEXT_TOKENS", "6000")),
     inject_max_chunks=int(os.getenv("RAG_INJECT_MAX_CHUNKS", os.getenv("RAG_MAX_CHUNKS", "6"))),
@@ -70,6 +81,8 @@ BOOK_MANIFEST_PATH = CONFIG.book_manifest_path
 ARTICLE_MANIFEST_PATH = CONFIG.article_manifest_path
 BOOK_INDEX_PATH = CONFIG.book_index_path
 ARTICLE_INDEX_PATH = CONFIG.article_index_path
 EMBED_MODEL = CONFIG.embed_model
 MAX_CONTEXT_TOKENS = CONFIG.max_context_tokens
 INJECT_MAX_CHUNKS = CONFIG.inject_max_chunks
@@ -82,8 +95,10 @@ PER_DOC_CAP = CONFIG.per_doc_cap
 OVERLAP_FILTER = CONFIG.overlap_filter
 RETRIEVE_TOPK_MULT = CONFIG.retrieve_topk_mult
 HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
-HF_MODEL = os.getenv("RAG_HF_MODEL", "Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai").strip()
 OLLAMA_BASE_URL = os.environ.get("RAG_OLLAMA_URL", "http://localhost:11434").rstrip("/")
 OLLAMA_MODEL = os.environ.get("RAG_OLLAMA_MODEL", "llama3.2:1b")
@@ -330,23 +345,99 @@ def build_faiss_index(vectors: np.ndarray) -> faiss.Index:
     index.add(vectors)
     return index
 def load_or_build_index(
     chunks: List[Chunk],
     embedder: SentenceTransformer,
     index_path: str,
-    source_path: Optional[str] = None,
-) -> faiss.Index:
     p = Path(index_path)
-    src = Path(source_path) if source_path else None
-    if p.exists() and (not src or not src.exists() or p.stat().st_mtime >= src.stat().st_mtime):
-        return faiss.read_index(str(p))
     texts = [c.text for c in chunks]
-    vecs = embedder.encode(texts, batch_size=32, show_progress_bar=True, normalize_embeddings=True)
     vecs = np.asarray(vecs, dtype="float32")
     index = build_faiss_index(vecs)
     p.parent.mkdir(parents=True, exist_ok=True)
     faiss.write_index(index, str(p))
-    return index
 def retrieve(query: str, embedder: SentenceTransformer, index: faiss.Index, chunks: List[Chunk], k: int = 8) -> List[Tuple[float, Chunk]]:
     qv = embedder.encode([query], normalize_embeddings=True)
@@ -594,9 +685,16 @@ def answer_question(
         "chunks_cap": INJECT_MAX_CHUNKS,
         "context_cap": MAX_CONTEXT_TOKENS,
     }
-    answer, err = llm_chat(prompt)
     if err:
-        st.error(err)
         return f"Model error: {err}", citations, False
     if not answer:
         st.error("Empty response from model")
@@ -610,6 +708,34 @@ def system_message() -> str:
         "Keep answers concise. Cite sources using the provided citation tags exactly."
     )
 def is_running_on_spaces() -> bool:
     if os.environ.get("HF_SPACE_ID") or os.environ.get("SPACE_ID"):
         return True
@@ -617,28 +743,27 @@ def is_running_on_spaces() -> bool:
 @st.cache_resource(show_spinner=False)
 def get_hf_router_client() -> OpenAI:
-    return OpenAI(
-        base_url="https://router.huggingface.co/v1",
-        api_key=HF_TOKEN,
-    )
-def hf_chat(prompt: str) -> Tuple[str, Optional[str]]:
-    if not HF_TOKEN:
-        return "", "Missing HF_TOKEN (or HUGGINGFACEHUB_API_TOKEN)"
     try:
         client = get_hf_router_client()
         completion = client.chat.completions.create(
-            model=HF_MODEL,
             messages=[
-                {"role": "system", "content": "You are a helpful assistant."},
                 {"role": "user", "content": prompt},
             ],
             max_tokens=MAX_GENERATION_TOKENS,
             temperature=0.2,
         )
-        return (completion.choices[0].message.content or "").strip(), None
     except Exception as e:
-        return "", str(e)
 def ollama_chat(prompt: str, timeout: Tuple[int, int] = (10, 600)) -> Tuple[str, Optional[str]]:
     url = f"{OLLAMA_BASE_URL}/api/chat"
@@ -660,7 +785,7 @@ def ollama_chat(prompt: str, timeout: Tuple[int, int] = (10, 600)) -> Tuple[str,
     except Exception as e:
         return "", str(e)
-def llm_chat(prompt: str, timeout: Tuple[int, int] = (10, 600)) -> Tuple[str, Optional[str]]:
     """
     Routes generation to HF if configured; otherwise falls back to Ollama.
     Prefer explicit env var if you want:
@@ -669,14 +794,16 @@ def llm_chat(prompt: str, timeout: Tuple[int, int] = (10, 600)) -> Tuple[str, Op
     backend = (os.environ.get("RAG_LLM_BACKEND", "") or "").strip().lower()
     if backend == "hf-router":
-        return hf_chat(prompt)
     if backend == "ollama":
-        return ollama_chat(prompt)
     if is_running_on_spaces():
-        return hf_chat(prompt)
     if (HF_TOKEN or "").strip():
-        return hf_chat(prompt)
-    return ollama_chat(prompt)
 def github_create_issue(title: str, body: str, labels: Optional[List[str]] = None) -> Tuple[Optional[int], Optional[str]]:
     global _GITHUB_TOKEN_LOGGED
@@ -746,39 +873,6 @@ button[aria-label^="MCP •"]::before{content:"MCP";position:absolute;left:0.6re
 if "is_thinking" not in st.session_state:
     st.session_state["is_thinking"] = False
-with st.sidebar:
-    st.markdown(f"**Company:** {COMPANY_NAME}")
-    st.markdown(f"**Contact:** {COMPANY_EMAIL} · {COMPANY_PHONE}")
-    st.caption(COMPANY_ABOUT)
-    st.write("")
-    st.subheader("Support")
-    st.caption("If an answer is not found in the dataset, you can create a support ticket (GitHub issue).")
-    st.session_state.setdefault("open_ticket_ui", False)
-    if st.button("Open ticket form", use_container_width=True, disabled=st.session_state["is_thinking"]):
-        st.session_state["open_ticket_ui"] = True
-    st.write("")
-    st.subheader("LLM")
-    st.markdown(f"- Active model: `{HF_MODEL}`")
-    st.write("")
-    st.subheader("Embedding model (retrieval)")
-    st.code(EMBED_MODEL)
-    st.write("")
-    st.subheader("Retrieval settings")
-    st.caption(f"book_k={BOOK_K}, article_k={ARTICLE_K}, per_doc_cap={PER_DOC_CAP}, overlap_filter={OVERLAP_FILTER}")
-    st.markdown("### Dataset Stats")
-    ts = st.session_state.get("token_stats")
-    if ts:
-        st.markdown("**Token Consumption (est.)**")
-        st.markdown(f"- Context tokens: `{ts['context_tokens']}` / `{ts['context_cap']}`")
-        st.markdown(f"- Chunks used: `{ts['chunks_used']}` / `{ts['chunks_cap']}`")
-        st.markdown(f"- Prompt tokens: `{ts['prompt_tokens']}`")
-        st.markdown(f"- Generation tokens (max): `{ts['generation_tokens']}`")
-        st.markdown(f"- **Total per request (est.):** `{ts['total_tokens']}`")
-        if ts["context_tokens"] >= int(0.9 * ts["context_cap"]):
-            st.warning("Context near token limit; answers may truncate.")
-    else:
-        st.markdown("_Ask a question to see token usage._")
 @st.cache_data(show_spinner=False)
 def load_dataset(path: str) -> List[Chunk]:
     return read_chunks_jsonl(path)
@@ -811,8 +905,115 @@ doc_index = merge_doc_indexes(book_doc_index, article_doc_index)
 book_stats = compute_stats(book_chunks, book_manifest, book_doc_index)
 article_stats = compute_stats(article_chunks, article_manifest, article_doc_index)
 embedder = load_embedder(EMBED_MODEL)
-book_index = load_or_build_index(book_chunks, embedder, BOOK_INDEX_PATH, BOOK_CHUNKS_PATH)
-article_index = load_or_build_index(article_chunks, embedder, ARTICLE_INDEX_PATH, ARTICLE_CHUNKS_PATH)
 if "chat" not in st.session_state:
     st.session_state["chat"] = []
@@ -854,42 +1055,6 @@ def parse_generated_questions(text: str) -> List[str]:
             break
     return cleaned
-with st.sidebar:
-    st.write("")
-    st.markdown("**Books + MCP**")
-    st.write(f"Chunk length: min {book_stats['length_min']}, median {book_stats['length_median']}, max {book_stats['length_max']}")
-    st.write("")
-    st.markdown("**Articles**")
-    st.write(f"Chunk length: min {article_stats['length_min']}, median {article_stats['length_median']}, max {article_stats['length_max']}")
-    st.write("")
-    st.markdown("**By type (inferred)**")
-    for k in ["book", "mcp", "article"]:
-        total = 0
-        if k in book_stats["type_counts"]:
-            total += book_stats["type_counts"][k]
-        if k in article_stats["type_counts"]:
-            total += article_stats["type_counts"][k]
-        if total:
-            st.write(f"{k}: {total}")
-    st.write("")
-    st.session_state.setdefault("show_sources", False)
-    st.markdown('<div class="stacked-control sources-btn">', unsafe_allow_html=True)
-    if st.button("Sources (click to expand the list)", use_container_width=True, disabled=st.session_state["is_thinking"]):
-        st.session_state["show_sources"] = not st.session_state["show_sources"]
-    st.markdown("</div>", unsafe_allow_html=True)
-    if st.session_state["show_sources"]:
-        if book_stats["mcp_docs_count"]:
-            mcp_line = f"MCP: {book_stats['mcp_docs_count']} docs"
-            if book_stats["mcp_blocks_total"]:
-                mcp_line += f", {book_stats['mcp_blocks_total']} blocks"
-            st.write(mcp_line)
-        for line in book_stats["sources_lines"]:
-            st.write(line)
-        if article_stats["sources_lines"]:
-            st.write("")
-            st.markdown("**Article sources**")
-            for line in article_stats["sources_lines"]:
-                st.write(line)
 def run_enhance(question: str, enhanced_key: str):
     if not question or not enhanced_key:
@@ -925,9 +1090,16 @@ def run_regen():
         "chunks_cap": INJECT_MAX_CHUNKS,
         "context_cap": MAX_CONTEXT_TOKENS,
     }
-    text, err = llm_chat(gen_prompt)
     if err:
-        st.error(err)
         st.warning(f"LLM request failed: {err}")
         return
     if not text:

 import os
 import re
 import json
+import hashlib
 import html
 from dataclasses import dataclass
 from pathlib import Path
 load_dotenv(Path(__file__).resolve().parent / ".env", override=True)
+def get_persist_dir() -> str:
+    if os.path.isdir("/data") and os.access("/data", os.W_OK):
+        p = "/data/rag_cache"
+    else:
+        p = "data/cache"
+    os.makedirs(p, exist_ok=True)
+    return p
+PERSIST_DIR = get_persist_dir()
 COMPANY_NAME = "O_O.inc"
 COMPANY_EMAIL = "o.obolonsky@proton.me"
 COMPANY_PHONE = "+380953555919"
     article_chunks_path=os.environ.get("RAG_ARTICLE_CHUNKS_PATH", "data/normalized/chunks_articles.jsonl"),
     book_manifest_path=os.environ.get("RAG_BOOK_MANIFEST_PATH", "data/normalized/manifest_books.json"),
     article_manifest_path=os.environ.get("RAG_ARTICLE_MANIFEST_PATH", "data/normalized/manifest_articles.json"),
+    book_index_path=os.environ.get("RAG_BOOK_INDEX_PATH", os.path.join(PERSIST_DIR, "index_books.faiss")),
+    article_index_path=os.environ.get("RAG_ARTICLE_INDEX_PATH", os.path.join(PERSIST_DIR, "index_articles.faiss")),
     embed_model=os.environ.get("RAG_EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2"),
     max_context_tokens=int(os.getenv("RAG_MAX_CONTEXT_TOKENS", "6000")),
     inject_max_chunks=int(os.getenv("RAG_INJECT_MAX_CHUNKS", os.getenv("RAG_MAX_CHUNKS", "6"))),
 ARTICLE_MANIFEST_PATH = CONFIG.article_manifest_path
 BOOK_INDEX_PATH = CONFIG.book_index_path
 ARTICLE_INDEX_PATH = CONFIG.article_index_path
+BOOK_META_PATH = BOOK_INDEX_PATH + ".meta.json"
+ARTICLE_META_PATH = ARTICLE_INDEX_PATH + ".meta.json"
 EMBED_MODEL = CONFIG.embed_model
 MAX_CONTEXT_TOKENS = CONFIG.max_context_tokens
 INJECT_MAX_CHUNKS = CONFIG.inject_max_chunks
 OVERLAP_FILTER = CONFIG.overlap_filter
 RETRIEVE_TOPK_MULT = CONFIG.retrieve_topk_mult
+HF_BASE_URL = "https://router.huggingface.co/v1"
+HF_MODEL_RAW = os.getenv("RAG_HF_MODEL", "Qwen/Qwen2.5-7B-Instruct-1M").strip()
+HF_MODEL_SUFFIX = os.getenv("RAG_HF_PROVIDER_SUFFIX", "").strip()
 HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
 OLLAMA_BASE_URL = os.environ.get("RAG_OLLAMA_URL", "http://localhost:11434").rstrip("/")
 OLLAMA_MODEL = os.environ.get("RAG_OLLAMA_MODEL", "llama3.2:1b")
     index.add(vectors)
     return index
+def file_fingerprint(path: str) -> Optional[str]:
+    try:
+        stinfo = os.stat(path)
+    except FileNotFoundError:
+        return None
+    h = hashlib.sha256()
+    h.update(f"{stinfo.st_size}:{int(stinfo.st_mtime)}".encode("utf-8"))
+    try:
+        with open(path, "rb") as f:
+            head = f.read(1024 * 1024)
+            h.update(head)
+            if stinfo.st_size > 1024 * 1024:
+                f.seek(max(0, stinfo.st_size - 1024 * 1024))
+                tail = f.read(1024 * 1024)
+                h.update(tail)
+    except OSError:
+        return None
+    return h.hexdigest()
+def compute_fingerprint(kind: str, embed_model: str, chunks_path: str, manifest_path: str, params: Dict) -> str:
+    payload = {
+        "kind": kind,
+        "embed_model": embed_model,
+        "chunks_fp": file_fingerprint(chunks_path),
+        "manifest_fp": file_fingerprint(manifest_path),
+        "params": params,
+    }
+    raw = json.dumps(payload, sort_keys=True).encode("utf-8")
+    return hashlib.sha256(raw).hexdigest()
+def load_meta(path: str) -> Dict:
+    if not Path(path).exists():
+        return {}
+    try:
+        return json.loads(Path(path).read_text(encoding="utf-8"))
+    except Exception:
+        return {}
+def save_meta(path: str, meta: Dict) -> None:
+    tmp = f"{path}.tmp"
+    Path(tmp).write_text(json.dumps(meta, indent=2, sort_keys=True), encoding="utf-8")
+    os.replace(tmp, path)
 def load_or_build_index(
+    kind: str,
     chunks: List[Chunk],
     embedder: SentenceTransformer,
+    chunks_path: str,
+    manifest_path: str,
     index_path: str,
+    meta_path: str,
+    *,
+    params: Optional[Dict] = None,
+    fingerprint: Optional[str] = None,
+) -> Tuple[faiss.Index, Dict]:
     p = Path(index_path)
+    if params is None:
+        params = {
+            "normalize_embeddings": True,
+            "dim": getattr(embedder, "get_sentence_embedding_dimension", lambda: None)(),
+            "engine": "faiss",
+        }
+    if fingerprint is None:
+        fingerprint = compute_fingerprint(kind, EMBED_MODEL, chunks_path, manifest_path, params)
+    if p.exists() and p.stat().st_size > 0 and Path(meta_path).exists():
+        meta = load_meta(meta_path)
+        if meta.get("fingerprint") == fingerprint:
+            return faiss.read_index(str(p)), meta
     texts = [c.text for c in chunks]
+    show_progress = os.getenv("RAG_SHOW_EMBED_PROGRESS", "0") == "1"
+    with st.spinner(f"Building {kind} retrieval index (first run or dataset changed)..."):
+        vecs = embedder.encode(
+            texts,
+            batch_size=32,
+            show_progress_bar=show_progress,
+            normalize_embeddings=True,
+        )
     vecs = np.asarray(vecs, dtype="float32")
     index = build_faiss_index(vecs)
     p.parent.mkdir(parents=True, exist_ok=True)
     faiss.write_index(index, str(p))
+    meta = {
+        "fingerprint": fingerprint,
+        "kind": kind,
+        "embed_model": EMBED_MODEL,
+        "chunks_path": chunks_path,
+        "manifest_path": manifest_path,
+        "params": params,
+        "built_at": datetime.now(timezone.utc).isoformat(),
+    }
+    save_meta(meta_path, meta)
+    return index, meta
 def retrieve(query: str, embedder: SentenceTransformer, index: faiss.Index, chunks: List[Chunk], k: int = 8) -> List[Tuple[float, Chunk]]:
     qv = embedder.encode([query], normalize_embeddings=True)
         "chunks_cap": INJECT_MAX_CHUNKS,
         "context_cap": MAX_CONTEXT_TOKENS,
     }
+    answer, err, meta = llm_chat(prompt)
+    if meta and meta.get("model"):
+        st.session_state["active_model"] = meta["model"]
     if err:
+        if is_model_not_supported(err):
+            render_model_recommendations()
+            with st.expander("Model error details"):
+                st.code(err)
+        else:
+            st.error(err)
         return f"Model error: {err}", citations, False
     if not answer:
         st.error("Empty response from model")
         "Keep answers concise. Cite sources using the provided citation tags exactly."
     )
+def get_effective_hf_model() -> str:
+    if HF_MODEL_SUFFIX and ":" not in HF_MODEL_RAW:
+        return f"{HF_MODEL_RAW}:{HF_MODEL_SUFFIX}"
+    return HF_MODEL_RAW
+RECOMMENDED_MODELS = [
+    "Qwen/Qwen2.5-7B-Instruct-1M:featherless-ai",
+    "Qwen/Qwen2.5-7B-Instruct:featherless-ai",
+    "mistralai/Mistral-7B-Instruct-v0.3",
+    "HuggingFaceTB/SmolLM3-3B",
+    "google/gemma-2-9b-it",
+]
+def is_model_not_supported(err: str) -> bool:
+    s = (err or "").lower()
+    return "model_not_supported" in s or "not supported by any provider you have enabled" in s
+def render_model_recommendations() -> None:
+    st.error("HF Router: model is not supported by your enabled providers.")
+    st.markdown("**Fix options:**")
+    st.markdown("- Use the provider-suffixed model id shown on the model page (e.g. `...:featherless-ai`).")
+    st.markdown("- Or enable additional Inference Providers in your HF account settings.")
+    st.markdown("- Or switch to a model that is served by a provider you have enabled.")
+    st.markdown("**Try one of these model IDs:**")
+    for mid in RECOMMENDED_MODELS:
+        st.code(mid)
+    st.markdown("Set `RAG_HF_MODEL` to one of the above, or set `RAG_HF_PROVIDER_SUFFIX=featherless-ai` for Qwen.")
 def is_running_on_spaces() -> bool:
     if os.environ.get("HF_SPACE_ID") or os.environ.get("SPACE_ID"):
         return True
 @st.cache_resource(show_spinner=False)
 def get_hf_router_client() -> OpenAI:
+    token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
+    if not token:
+        raise RuntimeError("HF_TOKEN is not set. Add it as a Hugging Face Secret.")
+    return OpenAI(base_url=HF_BASE_URL, api_key=token)
+def hf_router_chat(prompt: str) -> Tuple[str, Optional[str], Optional[Dict[str, str]]]:
+    model_id = get_effective_hf_model()
     try:
         client = get_hf_router_client()
         completion = client.chat.completions.create(
+            model=model_id,
             messages=[
+                {"role": "system", "content": "You are a helpful assistant. Follow the instructions and use provided context only when required."},
                 {"role": "user", "content": prompt},
             ],
             max_tokens=MAX_GENERATION_TOKENS,
             temperature=0.2,
         )
+        return (completion.choices[0].message.content or "").strip(), None, {"model": model_id}
     except Exception as e:
+        return "", str(e), {"model": model_id}
 def ollama_chat(prompt: str, timeout: Tuple[int, int] = (10, 600)) -> Tuple[str, Optional[str]]:
     url = f"{OLLAMA_BASE_URL}/api/chat"
     except Exception as e:
         return "", str(e)
+def llm_chat(prompt: str, timeout: Tuple[int, int] = (10, 600)) -> Tuple[str, Optional[str], Optional[Dict[str, str]]]:
     """
     Routes generation to HF if configured; otherwise falls back to Ollama.
     Prefer explicit env var if you want:
     backend = (os.environ.get("RAG_LLM_BACKEND", "") or "").strip().lower()
     if backend == "hf-router":
+        return hf_router_chat(prompt)
     if backend == "ollama":
+        text, err = ollama_chat(prompt)
+        return text, err, None
     if is_running_on_spaces():
+        return hf_router_chat(prompt)
     if (HF_TOKEN or "").strip():
+        return hf_router_chat(prompt)
+    text, err = ollama_chat(prompt)
+    return text, err, None
 def github_create_issue(title: str, body: str, labels: Optional[List[str]] = None) -> Tuple[Optional[int], Optional[str]]:
     global _GITHUB_TOKEN_LOGGED
 if "is_thinking" not in st.session_state:
     st.session_state["is_thinking"] = False
 @st.cache_data(show_spinner=False)
 def load_dataset(path: str) -> List[Chunk]:
     return read_chunks_jsonl(path)
 book_stats = compute_stats(book_chunks, book_manifest, book_doc_index)
 article_stats = compute_stats(article_chunks, article_manifest, article_doc_index)
 embedder = load_embedder(EMBED_MODEL)
+@st.cache_resource(show_spinner=False)
+def get_indexes(book_fp: str, article_fp: str) -> Tuple[faiss.Index, faiss.Index]:
+    params = {
+        "normalize_embeddings": True,
+        "dim": getattr(embedder, "get_sentence_embedding_dimension", lambda: None)(),
+        "engine": "faiss",
+    }
+    book_index, _ = load_or_build_index(
+        "books",
+        book_chunks,
+        embedder,
+        BOOK_CHUNKS_PATH,
+        BOOK_MANIFEST_PATH,
+        BOOK_INDEX_PATH,
+        BOOK_META_PATH,
+        params=params,
+        fingerprint=book_fp,
+    )
+    article_index, _ = load_or_build_index(
+        "articles",
+        article_chunks,
+        embedder,
+        ARTICLE_CHUNKS_PATH,
+        ARTICLE_MANIFEST_PATH,
+        ARTICLE_INDEX_PATH,
+        ARTICLE_META_PATH,
+        params=params,
+        fingerprint=article_fp,
+    )
+    return book_index, article_index
+index_params = {
+    "normalize_embeddings": True,
+    "dim": getattr(embedder, "get_sentence_embedding_dimension", lambda: None)(),
+    "engine": "faiss",
+}
+book_fp = compute_fingerprint("books", EMBED_MODEL, BOOK_CHUNKS_PATH, BOOK_MANIFEST_PATH, index_params)
+article_fp = compute_fingerprint("articles", EMBED_MODEL, ARTICLE_CHUNKS_PATH, ARTICLE_MANIFEST_PATH, index_params)
+book_index, article_index = get_indexes(book_fp, article_fp)
+with st.sidebar:
+    st.markdown(f"**Company:** {COMPANY_NAME}")
+    st.markdown(f"**Contact:** {COMPANY_EMAIL} · {COMPANY_PHONE}")
+    st.caption(COMPANY_ABOUT)
+    st.write("")
+    st.subheader("Support")
+    st.caption("If an answer is not found in the dataset, you can create a support ticket (GitHub issue).")
+    st.session_state.setdefault("open_ticket_ui", False)
+    if st.button("Open ticket form", use_container_width=True, disabled=st.session_state["is_thinking"]):
+        st.session_state["open_ticket_ui"] = True
+    st.write("")
+    st.subheader("LLM")
+    st.markdown(f"- Active model: `{st.session_state.get('active_model', get_effective_hf_model())}`")
+    st.write("")
+    st.subheader("Embedding model (retrieval)")
+    st.code(EMBED_MODEL)
+    st.write("")
+    st.subheader("Retrieval settings")
+    st.caption(f"book_k={BOOK_K}, article_k={ARTICLE_K}, per_doc_cap={PER_DOC_CAP}, overlap_filter={OVERLAP_FILTER}")
+    st.markdown("### Dataset Stats")
+    st.write("")
+    st.markdown("**Books + MCP**")
+    st.write(f"Chunk length: min {book_stats['length_min']}, median {book_stats['length_median']}, max {book_stats['length_max']}")
+    st.write("")
+    st.markdown("**Articles**")
+    st.write(f"Chunk length: min {article_stats['length_min']}, median {article_stats['length_median']}, max {article_stats['length_max']}")
+    st.write("")
+    st.markdown("**By type (inferred)**")
+    for k in ["book", "mcp", "article"]:
+        total = 0
+        if k in book_stats["type_counts"]:
+            total += book_stats["type_counts"][k]
+        if k in article_stats["type_counts"]:
+            total += article_stats["type_counts"][k]
+        if total:
+            st.write(f"{k}: {total}")
+    st.write("")
+    ts = st.session_state.get("token_stats")
+    if ts:
+        st.markdown("**Token Consumption (est.)**")
+        st.markdown(f"- Context tokens: `{ts['context_tokens']}` / `{ts['context_cap']}`")
+        st.markdown(f"- Chunks used: `{ts['chunks_used']}` / `{ts['chunks_cap']}`")
+        st.markdown(f"- Prompt tokens: `{ts['prompt_tokens']}`")
+        st.markdown(f"- Generation tokens (max): `{ts['generation_tokens']}`")
+        st.markdown(f"- **Total per request (est.):** `{ts['total_tokens']}`")
+        if ts["context_tokens"] >= int(0.9 * ts["context_cap"]):
+            st.warning("Context near token limit; answers may truncate.")
+    else:
+        st.markdown("_Ask a question to see token usage._")
+    st.write("")
+    st.session_state.setdefault("show_sources", False)
+    st.markdown('<div class="stacked-control sources-btn">', unsafe_allow_html=True)
+    if st.button("Sources (click to expand the list)", use_container_width=True, disabled=st.session_state["is_thinking"]):
+        st.session_state["show_sources"] = not st.session_state["show_sources"]
+    st.markdown("</div>", unsafe_allow_html=True)
+    if st.session_state["show_sources"]:
+        if book_stats["mcp_docs_count"]:
+            mcp_line = f"MCP: {book_stats['mcp_docs_count']} docs"
+            if book_stats["mcp_blocks_total"]:
+                mcp_line += f", {book_stats['mcp_blocks_total']} blocks"
+            st.write(mcp_line)
+        for line in book_stats["sources_lines"]:
+            st.write(line)
+        if article_stats["sources_lines"]:
+            st.write("")
+            st.markdown("**Article sources**")
+            for line in article_stats["sources_lines"]:
+                st.write(line)
 if "chat" not in st.session_state:
     st.session_state["chat"] = []
             break
     return cleaned
 def run_enhance(question: str, enhanced_key: str):
     if not question or not enhanced_key:
         "chunks_cap": INJECT_MAX_CHUNKS,
         "context_cap": MAX_CONTEXT_TOKENS,
     }
+    text, err, meta = llm_chat(gen_prompt)
+    if meta and meta.get("model"):
+        st.session_state["active_model"] = meta["model"]
     if err:
+        if is_model_not_supported(err):
+            render_model_recommendations()
+            with st.expander("Model error details"):
+                st.code(err)
+        else:
+            st.error(err)
         st.warning(f"LLM request failed: {err}")
         return
     if not text: