Spaces:

quantumbit
/

chatbot-gitconnect

Paused

App Files Files Community

quantumbit Copilot commited on 24 days ago

Commit

fdb66ba

0 Parent(s):

initial commit

Browse files

Co-authored-by: Copilot <copilot@github.com>

Files changed (17) hide show

.dockerignore +22 -0
.env.example +9 -0
.github/workflows/deploy-hf-space.yml +39 -0
.gitignore +46 -0
Dockerfile +18 -0
README.md +90 -0
app/__init__.py +0 -0
app/config.py +33 -0
app/main.py +166 -0
app/models.py +50 -0
app/services/__init__.py +0 -0
app/services/gemini_service.py +109 -0
app/services/pdf_service.py +55 -0
app/services/rag_service.py +130 -0
app/services/student_service.py +22 -0
app/vector_store.py +107 -0
requirements.txt +11 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,22 @@

+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+*.so
+*.egg-info/
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.venv/
+venv/
+env/
+.env
+app/.env
+.vscode/
+.git/
+.gitignore
+data/
+*.log

.env.example ADDED Viewed

	@@ -0,0 +1,9 @@

+GEMINI_API_KEY=your_gemini_api_key
+GEMINI_MODEL=gemini-2.5-flash
+EMBEDDING_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2
+STUDENT_PERFORMANCE_URL_TEMPLATE=https://git-connect-backend-v2.vercel.app/api/student/{student_id}/performance
+VECTOR_DATA_DIR=data/vector_index
+RAW_TEXT_DIR=data/raw_text
+PDF_TIMEOUT_SEC=60
+PDF_MAX_RETRIES=3
+PDF_RETRY_BACKOFF_SEC=1.5

.github/workflows/deploy-hf-space.yml ADDED Viewed

	@@ -0,0 +1,39 @@

+name: Deploy To Hugging Face Space
+on:
+  push:
+    branches:
+      - main
+  workflow_dispatch:
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+      - name: Validate required secrets
+        shell: bash
+        run: |
+          test -n "${{ secrets.HF_TOKEN }}" || (echo "HF_TOKEN is missing" && exit 1)
+          test -n "${{ secrets.HF_SPACE_REPO_ID }}" || (echo "HF_SPACE_REPO_ID is missing" && exit 1)
+      - name: Configure git user
+        run: |
+          git config --global user.name "github-actions[bot]"
+          git config --global user.email "github-actions[bot]@users.noreply.github.com"
+      - name: Push repository to Hugging Face Space
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+          HF_SPACE_REPO_ID: ${{ secrets.HF_SPACE_REPO_ID }}
+        shell: bash
+        run: |
+          HF_URL="https://oauth2:${HF_TOKEN}@huggingface.co/spaces/${HF_SPACE_REPO_ID}"
+          git remote remove hf 2>/dev/null || true
+          git remote add hf "${HF_URL}"
+          git push hf HEAD:main --force

.gitignore ADDED Viewed

	@@ -0,0 +1,46 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+*.so
+*.egg-info/
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+# Virtual environments
+.venv/
+venv/
+env/
+# Environment files and secrets
+.env
+app/.env
+*.env.local
+# IDE/editor
+.vscode/
+.idea/
+# OS files
+.DS_Store
+Thumbs.db
+# Runtime and local data
+data/raw_text/
+data/vector_index/
+*.faiss
+*.meta.json
+*.log
+# Hugging Face and model cache
+.cache/
+.huggingface/
+# Build artifacts
+dist/
+build/
+prompt.md
+stud_info.md
+test_db_to_api.py

Dockerfile ADDED Viewed

	@@ -0,0 +1,18 @@

+FROM python:3.12-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
+WORKDIR /app
+COPY requirements.txt /app/requirements.txt
+RUN pip install --upgrade pip && pip install -r /app/requirements.txt
+COPY app /app/app
+COPY .env.example /app/.env.example
+COPY README.md /app/README.md
+EXPOSE 7860
+CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+# GitConnect FastAPI Service
+FastAPI backend with two endpoints:
+- Syllabus processing from PDF URLs with FAISS vector indexing and multilingual course summaries.
+- Chatbot endpoint using student performance data + semester-scoped syllabus RAG with MMR.
+Embedding setup:
+- Uses local Hugging Face sentence embeddings (default: `sentence-transformers/all-MiniLM-L6-v2`).
+- Uses FAISS (`IndexFlatIP` with L2-normalized vectors) for similarity search.
+- Gemini is used for summarization and chatbot generation.
+## Setup
+1. Create a virtual environment and activate it.
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. Create `.env` from `.env.example` and fill `GEMINI_API_KEY`.
+4. Run server:
+```bash
+uvicorn app.main:app --reload
+```
+## Endpoints
+- `GET /health`
+- `POST /api/syllabus/process`
+- `POST /api/chat`
+## Deploy to Hugging Face Spaces
+This repo is configured for Docker Spaces deployment.
+Files used for deployment:
+- `Dockerfile`
+- `requirements.txt`
+- `app/`
+- `.github/workflows/deploy-hf-space.yml`
+Files intentionally not pushed from local machine:
+- `.env` and `app/.env`
+- `data/` generated files
+- local caches and `__pycache__/`
+GitHub Action deployment:
+- Trigger: push to `main` or manual run
+- Workflow file: `.github/workflows/deploy-hf-space.yml`
+Required GitHub repository secrets:
+- `HF_TOKEN`: Hugging Face write token
+- `HF_SPACE_REPO_ID`: in format `username/space-name`
+After setting secrets, pushing to `main` will sync this repo to your Hugging Face Space `main` branch.
+Student performance is fetched from:
+- `STUDENT_PERFORMANCE_URL_TEMPLATE` (default)
+- `https://git-connect-backend-v2.vercel.app/api/student/{student_id}/performance`
+## Sample syllabus request
+```json
+[
+  {
+    "course_code": "22CS501",
+    "name": "Database Management Systems",
+    "course_type": "theory",
+    "syllabus_url": "https://example.com/dbms.pdf",
+    "semester": 5
+  }
+]
+```
+## Sample chat request
+```json
+{
+  "query": "How can I improve attendance this semester?",
+  "history": [
+    {"role": "user", "content": "Hi"},
+    {"role": "assistant", "content": "Hello"}
+  ],
+  "student_id": "STU123",
+  "lang_code": "en",
+  "semester": 5
+}
+```

app/__init__.py ADDED Viewed

File without changes

app/config.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import os
+from pathlib import Path
+from dotenv import load_dotenv
+# Load both potential env file locations.
+_APP_DIR = Path(__file__).resolve().parent
+_ROOT_ENV = _APP_DIR.parent / ".env"
+_APP_ENV = _APP_DIR / ".env"
+load_dotenv(dotenv_path=_ROOT_ENV, override=False)
+load_dotenv(dotenv_path=_APP_ENV, override=True)
+class Settings:
+    gemini_api_key: str = os.getenv("GEMINI_API_KEY", "")
+    embedding_model_name: str = os.getenv(
+        "EMBEDDING_MODEL_NAME",
+        "sentence-transformers/all-MiniLM-L6-v2",
+    )
+    student_performance_url_template: str = os.getenv(
+        "STUDENT_PERFORMANCE_URL_TEMPLATE",
+        "https://git-connect-backend-v2.vercel.app/api/student/{student_id}/performance",
+    )
+    vector_data_dir: str = os.getenv("VECTOR_DATA_DIR", "data/vector_index")
+    raw_text_dir: str = os.getenv("RAW_TEXT_DIR", "data/raw_text")
+    gemini_model: str = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
+    pdf_timeout_sec: int = int(os.getenv("PDF_TIMEOUT_SEC", "60"))
+    pdf_max_retries: int = int(os.getenv("PDF_MAX_RETRIES", "3"))
+    pdf_retry_backoff_sec: float = float(os.getenv("PDF_RETRY_BACKOFF_SEC", "1.5"))
+settings = Settings()

app/main.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import os
+from typing import List
+from fastapi import FastAPI, HTTPException
+from app.config import settings
+from app.models import (
+    ChatRequest,
+    ChatResponse,
+    CourseInput,
+    CourseProcessError,
+    CourseSummary,
+    SyllabusProcessResponse,
+)
+from app.services.gemini_service import GeminiService
+from app.services.pdf_service import chunk_text, fetch_pdf_text
+from app.services.rag_service import build_student_documents, mmr_select
+from app.services.student_service import fetch_student_info
+from app.vector_store import LocalVectorStore
+app = FastAPI(title="GitConnect Chatbot Service", version="0.1.0")
+@app.get("/health")
+def health() -> dict:
+    return {"status": "ok"}
+@app.post("/api/syllabus/process", response_model=SyllabusProcessResponse)
+def process_syllabus(courses: List[CourseInput]) -> SyllabusProcessResponse:
+    if not courses:
+        return SyllabusProcessResponse(
+            results=[],
+            failed=[],
+            total_received=0,
+            total_processed=0,
+            total_failed=0,
+        )
+    try:
+        gemini = GeminiService(
+            settings.gemini_api_key,
+            settings.gemini_model,
+            settings.embedding_model_name,
+        )
+    except ValueError as exc:
+        raise HTTPException(status_code=500, detail=str(exc)) from exc
+    vector_store = LocalVectorStore(settings.vector_data_dir)
+    os.makedirs(settings.raw_text_dir, exist_ok=True)
+    results: List[CourseSummary] = []
+    failed: List[CourseProcessError] = []
+    for course in courses:
+        try:
+            syllabus_text = fetch_pdf_text(
+                str(course.syllabus_url),
+                timeout=settings.pdf_timeout_sec,
+                max_retries=settings.pdf_max_retries,
+                backoff_sec=settings.pdf_retry_backoff_sec,
+            )
+            if not syllabus_text:
+                raise RuntimeError("No text extracted from PDF.")
+            raw_path = os.path.join(settings.raw_text_dir, f"{course.course_code}.txt")
+            with open(raw_path, "w", encoding="utf-8") as f:
+                f.write(syllabus_text)
+            chunks = chunk_text(syllabus_text)
+            if not chunks:
+                raise RuntimeError("Unable to create text chunks from syllabus content.")
+            embeddings = [
+                gemini.embed_text(chunk, task_type="retrieval_document")
+                for chunk in chunks
+            ]
+            vector_store.upsert_documents(course.semester, course.course_code, chunks, embeddings)
+            ai_summary = gemini.summarize_multilingual(course.name, syllabus_text)
+            results.append(CourseSummary(course_code=course.course_code, ai_summary=ai_summary))
+        except Exception as exc:
+            failed.append(CourseProcessError(course_code=course.course_code, error=str(exc)))
+    return SyllabusProcessResponse(
+        results=results,
+        failed=failed,
+        total_received=len(courses),
+        total_processed=len(results),
+        total_failed=len(failed),
+    )
+@app.post("/api/chat", response_model=ChatResponse)
+def chat(req: ChatRequest) -> ChatResponse:
+    try:
+        gemini = GeminiService(
+            settings.gemini_api_key,
+            settings.gemini_model,
+            settings.embedding_model_name,
+        )
+    except ValueError as exc:
+        raise HTTPException(status_code=500, detail=str(exc)) from exc
+    vector_store = LocalVectorStore(settings.vector_data_dir)
+    try:
+        student_info = fetch_student_info(
+            settings.student_performance_url_template,
+            req.student_id,
+        )
+    except Exception as exc:
+        raise HTTPException(status_code=502, detail=f"Student info fetch failed: {exc}") from exc
+    try:
+        query_embedding = gemini.embed_text(req.query, task_type="retrieval_query")
+        syllabus_hits = vector_store.search(req.semester, query_embedding, top_k=20)
+        for hit in syllabus_hits:
+            hit["source"] = "syllabus"
+        student_docs = build_student_documents(student_info)
+        for doc in student_docs:
+            doc["embedding"] = gemini.embed_text(
+                doc["chunk"],
+                task_type="retrieval_document",
+            )
+        combined_candidates = syllabus_hits + student_docs
+        hits = mmr_select(
+            query_embedding=query_embedding,
+            candidates=combined_candidates,
+            top_k=8,
+            lambda_param=0.7,
+        )
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"RAG retrieval failed: {exc}") from exc
+    rag_chunks = [f"[{h.get('source', 'unknown')}] {h['chunk']}" for h in hits]
+    retrieved_course_codes = sorted(
+        list(
+            {
+                h.get("course_code", "")
+                for h in hits
+                if str(h.get("course_code", "")).strip()
+            }
+        )
+    )
+    try:
+        reply = gemini.chat_with_context(
+            query=req.query,
+            lang_code=req.lang_code,
+            history=[msg.model_dump() for msg in req.history],
+            student_info=student_info,
+            rag_chunks=rag_chunks,
+        )
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"LLM response failed: {exc}") from exc
+    return ChatResponse(
+        reply_markdown=reply,
+        retrieved_course_codes=retrieved_course_codes,
+        student_info=student_info,
+    )

app/models.py ADDED Viewed

	@@ -0,0 +1,50 @@

+from typing import Dict, List, Literal, Optional
+from pydantic import BaseModel, Field, HttpUrl
+LangCode = Literal["en", "hn", "mr", "kn"]
+class CourseInput(BaseModel):
+    course_code: str = Field(..., min_length=1)
+    name: str = Field(..., min_length=1)
+    course_type: str = Field(..., min_length=1)
+    syllabus_url: HttpUrl
+    semester: int = Field(..., ge=1, le=12)
+class CourseSummary(BaseModel):
+    course_code: str
+    ai_summary: Dict[LangCode, str]
+class CourseProcessError(BaseModel):
+    course_code: str
+    error: str
+class SyllabusProcessResponse(BaseModel):
+    results: List[CourseSummary]
+    failed: List[CourseProcessError] = Field(default_factory=list)
+    total_received: int
+    total_processed: int
+    total_failed: int
+class ChatMessage(BaseModel):
+    role: Literal["user", "assistant", "system"]
+    content: str
+class ChatRequest(BaseModel):
+    query: str
+    history: List[ChatMessage] = Field(default_factory=list, max_length=5)
+    student_id: int = Field(..., ge=1)
+    lang_code: LangCode
+    semester: int = Field(..., ge=1, le=12)
+class ChatResponse(BaseModel):
+    reply_markdown: str
+    retrieved_course_codes: List[str]
+    student_info: Optional[dict] = None

app/services/__init__.py ADDED Viewed

File without changes

app/services/gemini_service.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import json
+from typing import Dict, List
+import google.generativeai as genai
+from sentence_transformers import SentenceTransformer
+class GeminiService:
+    _embedding_model_cache: dict[str, SentenceTransformer] = {}
+    def __init__(self, api_key: str, model_name: str, embedding_model_name: str) -> None:
+        if not api_key:
+            raise ValueError("GEMINI_API_KEY is not set.")
+        genai.configure(api_key=api_key)
+        self._model = genai.GenerativeModel(model_name)
+        self._embedding_model = self._get_embedding_model(embedding_model_name)
+    def embed_text(self, text: str, task_type: str) -> List[float]:
+        # Small local HF embeddings for RAG; task_type kept for API compatibility.
+        _ = task_type
+        emb = self._embedding_model.encode(text, normalize_embeddings=False)
+        return emb.tolist()
+    def summarize_multilingual(self, course_name: str, syllabus_text: str) -> Dict[str, str]:
+        prompt = f"""
+You are an academic assistant.
+Summarize the course syllabus content for course: {course_name}.
+Return STRICT JSON with this exact schema and keys only:
+{{
+  "en": "English summary",
+  "mr": "Marathi summary",
+  "kn": "Kannada summary",
+  "hn": "Hindi summary"
+}}
+Rules:
+- Keep each summary clear for students and parents.
+- 80-140 words per language.
+- Use only the syllabus context below.
+Syllabus context:
+{syllabus_text[:12000]}
+"""
+        raw = self._model.generate_content(prompt).text
+        return self._safe_parse_summary_json(raw)
+    def chat_with_context(
+        self,
+        query: str,
+        lang_code: str,
+        history: List[dict],
+        student_info: dict,
+        rag_chunks: List[str],
+    ) -> str:
+        history_text = "\n".join(
+            [f"{msg.get('role', 'user')}: {msg.get('content', '')}" for msg in history]
+        )
+        syllabus_context = "\n\n---\n\n".join(rag_chunks[:8])
+        prompt = f"""
+You are a helpful college assistant chatbot for students and parents.
+Respond in language code: {lang_code}
+Supported codes: en, hn, mr, kn.
+Return the final answer in markdown.
+Student data (attendance, result etc.):
+{json.dumps(student_info, ensure_ascii=False)}
+Recent chat history:
+{history_text}
+Relevant syllabus context:
+{syllabus_context}
+User query:
+{query}
+Answer guidelines:
+- Be accurate and grounded in provided info.
+- If data is missing, state what is missing.
+- Keep response practical and concise.
+- Use markdown with bullets or short headings when useful.
+"""
+        return self._model.generate_content(prompt).text
+    def _safe_parse_summary_json(self, raw: str) -> Dict[str, str]:
+        text = raw.strip()
+        if text.startswith("```"):
+            text = text.strip("`")
+            if text.startswith("json"):
+                text = text[4:].strip()
+        parsed = json.loads(text)
+        return {
+            "en": str(parsed.get("en", "")),
+            "mr": str(parsed.get("mr", "")),
+            "kn": str(parsed.get("kn", "")),
+            "hn": str(parsed.get("hn", "")),
+        }
+    @classmethod
+    def _get_embedding_model(cls, embedding_model_name: str) -> SentenceTransformer:
+        if embedding_model_name not in cls._embedding_model_cache:
+            cls._embedding_model_cache[embedding_model_name] = SentenceTransformer(
+                embedding_model_name
+            )
+        return cls._embedding_model_cache[embedding_model_name]

app/services/pdf_service.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import io
+import time
+import requests
+from pypdf import PdfReader
+def fetch_pdf_text(
+    pdf_url: str,
+    timeout: int = 60,
+    max_retries: int = 3,
+    backoff_sec: float = 1.5,
+) -> str:
+    last_exc: Exception | None = None
+    for attempt in range(max_retries):
+        try:
+            response = requests.get(pdf_url, timeout=timeout)
+            response.raise_for_status()
+            pdf_stream = io.BytesIO(response.content)
+            reader = PdfReader(pdf_stream)
+            extracted = []
+            for page in reader.pages:
+                text = page.extract_text() or ""
+                if text.strip():
+                    extracted.append(text)
+            return "\n\n".join(extracted).strip()
+        except Exception as exc:
+            last_exc = exc
+            if attempt < max_retries - 1:
+                sleep_sec = backoff_sec * (2 ** attempt)
+                time.sleep(sleep_sec)
+    raise RuntimeError(f"Failed to fetch PDF after {max_retries} attempts: {last_exc}")
+def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 150) -> list[str]:
+    if not text.strip():
+        return []
+    clean_text = " ".join(text.split())
+    chunks = []
+    start = 0
+    step = max(chunk_size - overlap, 1)
+    while start < len(clean_text):
+        end = min(start + chunk_size, len(clean_text))
+        chunks.append(clean_text[start:end])
+        if end >= len(clean_text):
+            break
+        start += step
+    return chunks

app/services/rag_service.py ADDED Viewed

	@@ -0,0 +1,130 @@

+from typing import Dict, List
+import numpy as np
+def build_student_documents(student_info: Dict) -> List[Dict]:
+    docs: List[Dict] = []
+    attendance = student_info.get("attendance", {})
+    results = student_info.get("results", {})
+    overall_attendance = attendance.get("overall_pct")
+    overall_status = attendance.get("overall_status")
+    current_cgpa = results.get("current_cgpa")
+    docs.append(
+        {
+            "source": "student_profile",
+            "course_code": "",
+            "chunk": (
+                f"Student profile overview: current CGPA is {current_cgpa}. "
+                f"Overall attendance is {overall_attendance}% with status {overall_status}."
+            ),
+        }
+    )
+    for subject in attendance.get("subjects", []):
+        docs.append(
+            {
+                "source": "attendance",
+                "course_code": str(subject.get("course_code", "")),
+                "chunk": (
+                    f"Attendance for {subject.get('subject', '')} ({subject.get('course_code', '')}): "
+                    f"{subject.get('attended_classes', 0)}/{subject.get('total_classes', 0)} classes, "
+                    f"{subject.get('attendance_pct', 0)}%, status {subject.get('status', 'unknown')}."
+                ),
+            }
+        )
+    for sem in results.get("semesters", []):
+        sem_no = sem.get("semester")
+        docs.append(
+            {
+                "source": "results_semester",
+                "course_code": "",
+                "chunk": (
+                    f"Semester {sem_no} performance summary: SGPA {sem.get('sgpa', 'NA')}, "
+                    f"CGPA {sem.get('cgpa', 'NA')}."
+                ),
+            }
+        )
+        for subject in sem.get("subjects", []):
+            docs.append(
+                {
+                    "source": "results_subject",
+                    "course_code": str(subject.get("course_code", "")),
+                    "chunk": (
+                        f"Result for {subject.get('subject', '')} ({subject.get('course_code', '')}) "
+                        f"in semester {sem_no}: total score {subject.get('total', 'NA')}"
+                    ),
+                }
+            )
+    return docs
+def mmr_select(
+    query_embedding: List[float],
+    candidates: List[Dict],
+    top_k: int = 8,
+    lambda_param: float = 0.7,
+) -> List[Dict]:
+    if not candidates:
+        return []
+    vectors = []
+    valid_candidates = []
+    for c in candidates:
+        emb = c.get("embedding")
+        if not emb:
+            continue
+        vec = np.array(emb, dtype=np.float32)
+        if np.linalg.norm(vec) == 0:
+            continue
+        vectors.append(vec)
+        valid_candidates.append(c)
+    if not valid_candidates:
+        return []
+    q = np.array(query_embedding, dtype=np.float32)
+    q_norm = np.linalg.norm(q)
+    if q_norm == 0:
+        return valid_candidates[:top_k]
+    vectors_np = np.stack(vectors)
+    vec_norms = np.linalg.norm(vectors_np, axis=1)
+    query_sims = (vectors_np @ q) / (vec_norms * q_norm)
+    selected_idx: List[int] = []
+    candidate_idx = list(range(len(valid_candidates)))
+    first_idx = int(np.argmax(query_sims))
+    selected_idx.append(first_idx)
+    candidate_idx.remove(first_idx)
+    while candidate_idx and len(selected_idx) < top_k:
+        best_idx = None
+        best_score = -1e9
+        for idx in candidate_idx:
+            relevance = float(query_sims[idx])
+            diversity = max(
+                float(np.dot(vectors_np[idx], vectors_np[s]) / (vec_norms[idx] * vec_norms[s]))
+                for s in selected_idx
+            )
+            mmr_score = lambda_param * relevance - (1.0 - lambda_param) * diversity
+            if mmr_score > best_score:
+                best_score = mmr_score
+                best_idx = idx
+        if best_idx is None:
+            break
+        selected_idx.append(best_idx)
+        candidate_idx.remove(best_idx)
+    return [valid_candidates[i] for i in selected_idx]

app/services/student_service.py ADDED Viewed

	@@ -0,0 +1,22 @@

+import requests
+def fetch_student_info(
+    student_performance_url_template: str,
+    student_id: int,
+    timeout: int = 20,
+) -> dict:
+    if "{student_id}" not in student_performance_url_template:
+        raise ValueError(
+            "STUDENT_PERFORMANCE_URL_TEMPLATE must include '{student_id}'."
+        )
+    student_url = student_performance_url_template.format(student_id=student_id)
+    get_resp = requests.get(student_url, timeout=timeout)
+    if not get_resp.ok:
+        raise RuntimeError(
+            f"Failed to fetch student info from {student_url}; "
+            f"status {get_resp.status_code}."
+        )
+    return get_resp.json()

app/vector_store.py ADDED Viewed

	@@ -0,0 +1,107 @@

+import json
+import os
+from typing import List
+import faiss
+import numpy as np
+class LocalVectorStore:
+    def __init__(self, base_dir: str) -> None:
+        self.base_dir = base_dir
+        os.makedirs(self.base_dir, exist_ok=True)
+    def upsert_documents(
+        self,
+        semester: int,
+        course_code: str,
+        chunks: List[str],
+        embeddings: List[List[float]],
+    ) -> None:
+        meta_path = self._semester_meta_path(semester)
+        records = self._load(meta_path)
+        records = [r for r in records if r.get("course_code") != course_code]
+        for chunk, vector in zip(chunks, embeddings):
+            records.append(
+                {
+                    "course_code": course_code,
+                    "chunk": chunk,
+                    "embedding": vector,
+                }
+            )
+        self._save(meta_path, records)
+        self._rebuild_faiss_index(semester, records)
+    def search(self, semester: int, query_embedding: List[float], top_k: int = 6) -> List[dict]:
+        meta_path = self._semester_meta_path(semester)
+        index_path = self._semester_index_path(semester)
+        records = self._load(meta_path)
+        if not records:
+            return []
+        if not os.path.exists(index_path):
+            self._rebuild_faiss_index(semester, records)
+        if not os.path.exists(index_path):
+            return []
+        index = faiss.read_index(index_path)
+        q = np.array(query_embedding, dtype=np.float32).reshape(1, -1)
+        faiss.normalize_L2(q)
+        k = min(top_k, len(records))
+        _, indices = index.search(q, k)
+        hits = []
+        for idx in indices[0].tolist():
+            if idx == -1:
+                continue
+            if 0 <= idx < len(records):
+                record = records[idx]
+                hits.append(
+                    {
+                        "course_code": record.get("course_code", ""),
+                        "chunk": record.get("chunk", ""),
+                    }
+                )
+        return hits
+    def _semester_meta_path(self, semester: int) -> str:
+        return os.path.join(self.base_dir, f"semester_{semester}.meta.json")
+    def _semester_index_path(self, semester: int) -> str:
+        return os.path.join(self.base_dir, f"semester_{semester}.faiss")
+    def _rebuild_faiss_index(self, semester: int, records: List[dict]) -> None:
+        index_path = self._semester_index_path(semester)
+        if not records:
+            if os.path.exists(index_path):
+                os.remove(index_path)
+            return
+        vectors = np.array([r["embedding"] for r in records], dtype=np.float32)
+        if vectors.ndim != 2 or vectors.shape[0] == 0:
+            return
+        faiss.normalize_L2(vectors)
+        dim = vectors.shape[1]
+        index = faiss.IndexFlatIP(dim)
+        index.add(vectors)
+        faiss.write_index(index, index_path)
+    @staticmethod
+    def _load(path: str) -> List[dict]:
+        if not os.path.exists(path):
+            return []
+        with open(path, "r", encoding="utf-8") as f:
+            return json.load(f)
+    @staticmethod
+    def _save(path: str, data: List[dict]) -> None:
+        with open(path, "w", encoding="utf-8") as f:
+            json.dump(data, f, ensure_ascii=False)

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+requests>=2.32.0
+pydantic>=2.8.0
+python-dotenv>=1.0.1
+pypdf>=5.0.0
+numpy>=2.0.0
+google-generativeai>=0.7.2
+psycopg[binary]>=3.2.0
+sentence-transformers>=3.0.1
+faiss-cpu>=1.8.0.post1