Spaces:

Nitish-py
/

Evaluator-core

Sleeping

App Files Files Community

jayeshdiro commited on Mar 28

Commit

facefda

0 Parent(s):

Initial commit

Browse files

Files changed (12) hide show

.gitattributes +35 -0
.gitignore +1 -0
DESIGN_NOTE.md +50 -0
Dockerfile +20 -0
README.md +117 -0
app.py +551 -0
description.text +91 -0
description.txt +91 -0
docker-compose.yml +60 -0
output.json +244 -0
prompts.py +87 -0
requirements.txt +13 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ .env

DESIGN_NOTE.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# Design Note: AI-Assisted Evaluation MVP
+## Goal
+Build a small AI-assisted evaluation system that can ingest multiple artefacts, create a unified understanding, cross-check claims, and produce structured scoring grounded in retrieved evidence.
+## Design Choice
+The MVP was built as a single Streamlit app with Milvus as the evidence store. The key architectural choice was to move from source-specific chat collections to one unified collection per submission/user so that all artefacts can contribute to one evaluation context.
+## Evidence Flow
+1. Ingest artefacts from document, code, URL, and video sources.
+2. Extract text and attach source metadata.
+3. Chunk and embed the content.
+4. Store all evidence in one Milvus collection for the current username.
+5. Retrieve evidence from the full collection during evaluation.
+6. Ask the LLM to return JSON only, including summary, claims, evidence, risks, and rubric scores.
+## Why This Approach
+- It keeps the implementation practical for a 4-5 hour assignment.
+- It demonstrates the core evidence-layer thinking the assignment asks for.
+- It supports multi-source reasoning without overbuilding infrastructure.
+- It makes the output traceable to retrieved evidence snippets.
+## Current Strengths
+- Unified evidence layer
+- Multi-source ingestion
+- Retrieval-backed evaluation
+- Claim extraction with support labels
+- Rubric-based scoring
+- Structured JSON output
+## Current Limitations
+- Prototype URL validation is limited to text extraction, not browser interaction.
+- Claim cross-checking is prompt-driven, not a dedicated comparison engine.
+- Code ingestion is file-upload based, not full repository traversal.
+- Code chunking is character-based rather than semantic.
+- Confidence and scoring are LLM-generated rather than calibrated.
+## Practical Tradeoffs
+- Preferred shipping a working evaluator skeleton over building incomplete automation-heavy features.
+- Kept the app single-file to maximize iteration speed during the assignment window.
+- Added explicit output structure and normalization to reduce brittle LLM formatting.
+## Next Steps
+1. Add lightweight prototype validation for URLs.
+2. Add explicit `claim_validation` output with claimed-in vs supported-by mapping.
+3. Improve code ingestion to accept repos/zips/folders.
+4. Add stronger evidence citation formatting and exportable result files.
+## Summary
+This MVP does not fully solve the end-state problem, but it establishes the correct system direction: unified evidence ingestion, retrieval-grounded evaluation, basic claim validation, and rubric scoring across multiple artefacts.

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+FROM python:3.13.5-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt ./
+COPY ./ ./
+RUN pip3 install -r requirements.txt
+EXPOSE 8501
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
+ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

README.md ADDED Viewed

	@@ -0,0 +1,117 @@

+---
+title: Evaluator Core
+emoji: 🚀
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 8501
+pinned: false
+---
+# Evaluator-core: AI-Assisted Evaluation MVP
+## Overview
+Evaluator-core is a Streamlit-based AI-assisted evaluation MVP that ingests multiple submission artefacts into a unified evidence layer, retrieves evidence from Milvus, and generates structured JSON evaluation output.
+This MVP is designed around the assignment goal of building an evidence-backed evaluator rather than a generic chatbot.
+## What It Supports
+- `DOCUMENT` uploads: `.txt`, `.md`, `.pdf`, `.pptx`
+- `CODE` uploads: common source/config text files such as `.py`, `.js`, `.ts`, `.tsx`, `.java`, `.go`, `.html`, `.css`, `.json`, `.yaml`, `.sql`, and others configured in the uploader
+- `URL` ingestion: extracts page/article text
+- `VIDEO` ingestion: YouTube link download plus Whisper transcription
+All uploaded artefacts for one username are stored in a single Milvus collection and evaluated together.
+## Current MVP Features
+- Unified ingestion across multiple artefact types
+- Single project collection per user
+- Source metadata attached to stored chunks
+- Source inventory shown before evaluation
+- Retrieval-backed evaluation over all uploaded evidence
+- Claim extraction with `supported | partial | uncertain`
+- Rubric-based scoring with:
+  - `Problem Understanding`
+  - `Technical Approach`
+  - `Implementation Quality`
+  - `Innovation / Originality`
+  - `Communication & Demo Clarity`
+  - `Claim vs Reality Alignment`
+  - `Prototype Functionality`
+- Structured JSON output
+## Architecture
+1. Artefacts are uploaded or linked through the Streamlit UI.
+2. Text is extracted and chunked by source type.
+3. Chunks are embedded with Hugging Face embeddings.
+4. Embeddings and metadata are stored in Milvus.
+5. Evaluation retrieves relevant evidence from the unified collection.
+6. A Hugging Face-hosted LLM generates structured JSON grounded in retrieved evidence.
+## Setup
+### Prerequisites
+- Python environment
+- Docker Desktop
+- Hugging Face token with inference access
+### Install
+```powershell
+conda activate nitish_sutra
+cd "c:\Users\jayes\OneDrive\Desktop\New folder (2)\Evaluator-core"
+python -m pip install -r requirements.txt
+```
+### Environment
+Create a `.env` file in the project root with:
+```env
+HF_TOKEN=your_huggingface_token_here
+```
+### Start Milvus
+Milvus can be started using the included Docker Compose file:
+```powershell
+docker compose -f "c:\Users\jayes\OneDrive\Desktop\New folder (2)\Evaluator-core\docker-compose.yml" up -d
+```
+### Run the App
+```powershell
+streamlit run app.py
+```
+## How To Use
+1. Log in with a username.
+2. Upload evidence under `DOCUMENT`, `CODE`, `URL`, and/or `VIDEO`.
+3. Open `Evaluate`.
+4. Review the source inventory.
+5. Run evaluation and inspect the JSON output.
+## Output Shape
+The evaluator currently returns JSON with sections such as:
+- `project_summary`
+- `sources_used`
+- `claims_detected`
+- `capabilities_detected`
+- `evidence`
+- `gaps_or_risks`
+- `scores`
+- `overall_assessment`
+## Tradeoffs
+- Uses a single-file Streamlit implementation for speed.
+- Uses prompt-based evidence synthesis rather than a separate deterministic scoring engine.
+- URL ingestion currently extracts text but does not yet perform browser-based prototype validation.
+- Code ingestion currently works on uploaded files rather than full repository crawl/zip ingestion.
+## Known Gaps
+- No live browser automation for working app validation yet
+- No explicit artifact-vs-artifact mismatch engine beyond prompt-guided claim validation
+- Code chunking is text-based, not AST-aware
+- No exported evaluation history or submission archive yet
+## Deliverable Framing
+For the assignment, this should be presented as:
+- a working MVP of the evidence layer
+- a unified multi-source evaluator
+- an intentionally scoped prototype with clear next steps for URL validation and stronger cross-artifact checking

app.py ADDED Viewed

	@@ -0,0 +1,551 @@

+import os
+import json
+import logging
+from dotenv import load_dotenv
+from PyPDF2 import PdfReader
+from pptx import Presentation
+from langchain.text_splitter import CharacterTextSplitter
+from goose3 import Goose
+import streamlit as st
+import whisper
+from pytube import YouTube
+from moviepy import VideoFileClip
+import time
+from langchain_community.vectorstores import Milvus
+from pymilvus import Collection, connections, utility
+from huggingface_hub import InferenceClient
+from prompts import build_evaluation_prompt
+EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+CHAT_MODEL = "deepseek-ai/DeepSeek-V3.2:novita"
+MILVUS_CONFIG = {"host": "localhost", "port": "19530"}
+DOCUMENT_CHUNK_SIZE = 1000
+PDF_CHUNK_SIZE = 2500
+PPTX_CHUNK_SIZE = 1800
+CODE_CHUNK_SIZE = 1200
+URL_CHUNK_SIZE = 1500
+VIDEO_CHUNK_SIZE = 1000
+CHUNK_OVERLAP = 150
+CODE_FILE_TYPES = [
+    "py", "js", "ts", "jsx", "tsx", "java", "c", "cpp", "cs", "go", "rs",
+    "php", "rb", "html", "css", "scss", "json", "yaml", "yml", "toml",
+    "ini", "sh", "sql", "xml"
+]
+load_dotenv()
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s"
+)
+connections.connect(alias="default", **MILVUS_CONFIG)
+HF_TOKEN = os.getenv("HF_TOKEN")
+def get_embeddings():
+    client = InferenceClient(api_key=HF_TOKEN)
+    def embed_documents(texts):
+        result = client.feature_extraction(texts, model=EMBEDDING_MODEL)
+        if isinstance(result, dict):
+            raise ValueError(f"Embedding API error: {result}")
+        return result
+    def embed_query(text):
+        result = client.feature_extraction(text, model=EMBEDDING_MODEL)
+        if isinstance(result, dict):
+            raise ValueError(f"Embedding API error: {result}")
+        return result
+    return type(
+        "EmbeddingAdapter",
+        (),
+        {
+            "embed_documents": staticmethod(embed_documents),
+            "embed_query": staticmethod(embed_query),
+        },
+    )()
+def run_llm(prompt):
+    client = InferenceClient(api_key=HF_TOKEN)
+    completion = client.chat.completions.create(
+        model=CHAT_MODEL,
+        messages=[
+            {
+                "role": "system",
+                "content": "Answer only from the given context. Be concise and accurate."
+            },
+            {
+                "role": "user",
+                "content": prompt
+            }
+        ],
+    )
+    return completion.choices[0].message.content
+def login():
+    st.title("🔐 Login")
+    user = st.text_input("Enter username")
+    if st.button("Login"):
+        if user:
+            st.session_state["user_id"] = user.strip().lower()
+            logging.info(f"Logged in as {st.session_state['user_id']}")
+            st.success(f"Logged in as {user}")
+            st.rerun()
+        else:
+            st.error("Enter username")
+def build_chunks(texts, metadatas, chunk_size):
+    if not texts:
+        return [], []
+    documents = CharacterTextSplitter(
+        separator="\n",
+        chunk_size=chunk_size,
+        chunk_overlap=CHUNK_OVERLAP
+    ).create_documents(texts, metadatas)
+    return [doc.page_content for doc in documents], [doc.metadata for doc in documents]
+def save_source_texts(user_id, source_type, source_name, texts, locators, chunk_size):
+    metadatas = [
+        {
+            "source_type": source_type,
+            "source_name": source_name,
+            "locator": locator
+        }
+        for locator in locators
+    ]
+    chunks, metadatas = build_chunks(texts, metadatas, chunk_size)
+    if not chunks:
+        st.warning("No readable content was extracted from this source.")
+        return
+    process.success("Chunking done")
+    logging.info(
+        f"Chunking complete for {source_type} source '{source_name}' with {len(chunks)} chunks"
+    )
+    collection_name = f"multigpt_{user_id}"
+    logging.info(f"Storing {len(chunks)} chunks in collection '{collection_name}'")
+    Milvus.from_texts(
+        chunks,
+        metadatas=metadatas,
+        embedding=get_embeddings(),
+        collection_name=collection_name,
+        connection_args=MILVUS_CONFIG
+    )
+    logging.info("Upload completed successfully")
+    process.success("Uploaded")
+def ingest_text_document(file):
+    user_id = st.session_state["user_id"]
+    logging.info(f"Reading text file '{file.name}'")
+    text = file.read().decode("utf-8", errors="ignore")
+    save_source_texts(user_id, "text", file.name, [text], [""], DOCUMENT_CHUNK_SIZE)
+def ingest_pdf_document(file):
+    user_id = st.session_state["user_id"]
+    logging.info(f"Reading PDF '{file.name}'")
+    reader = PdfReader(file)
+    texts = []
+    locators = []
+    for index, page in enumerate(reader.pages, start=1):
+        page_text = page.extract_text() or ""
+        if page_text.strip():
+            texts.append(page_text)
+            locators.append(f"page={index}")
+    save_source_texts(user_id, "pdf", file.name, texts, locators, PDF_CHUNK_SIZE)
+def ingest_pptx_document(file):
+    user_id = st.session_state["user_id"]
+    logging.info(f"Reading PPTX '{file.name}'")
+    presentation = Presentation(file)
+    texts = []
+    locators = []
+    for index, slide in enumerate(presentation.slides, start=1):
+        slide_parts = []
+        for shape in slide.shapes:
+            if hasattr(shape, "text") and shape.text:
+                slide_parts.append(shape.text)
+        slide_text = "\n".join(part.strip() for part in slide_parts if part.strip())
+        if slide_text:
+            texts.append(slide_text)
+            locators.append(f"slide={index}")
+    save_source_texts(user_id, "pptx", file.name, texts, locators, PPTX_CHUNK_SIZE)
+def ingest_code_files(files):
+    user_id = st.session_state["user_id"]
+    for file in files:
+        logging.info(f"Reading code file '{file.name}'")
+        text = file.read().decode("utf-8", errors="ignore")
+        save_source_texts(user_id, "code", file.name, [text], [file.name], CODE_CHUNK_SIZE)
+def ingest_url(url):
+    user_id = st.session_state["user_id"]
+    logging.info(f"Fetching URL '{url}'")
+    g = Goose()
+    text = g.extract(url=url).cleaned_text
+    save_source_texts(user_id, "url", url, [text], [url], URL_CHUNK_SIZE)
+def ingest_youtube_video(link):
+    user_id = st.session_state["user_id"]
+    logging.info(f"Starting video ingestion for '{link}'")
+    yt = YouTube(link).streams.get_highest_resolution()
+    yt.download(filename="video.mp4")
+    process.success("Downloading video")
+    logging.info("Video download completed")
+    while not os.path.exists("video.mp4"):
+        time.sleep(5)
+    video = VideoFileClip("video.mp4")
+    process.warning("Extracting audio")
+    logging.info("Extracting audio from video")
+    audio = video.audio
+    audio.write_audiofile("audio.mp3")
+    process.warning("Transcribing")
+    logging.info("Running Whisper transcription")
+    model = whisper.load_model("base")
+    result = model.transcribe("audio.mp3")
+    save_source_texts(user_id, "video", link, [result["text"]], [link], VIDEO_CHUNK_SIZE)
+def get_vector_store(collection_name):
+    return Milvus(
+        embedding_function=get_embeddings(),
+        collection_name=collection_name,
+        connection_args=MILVUS_CONFIG
+    )
+def collection_has_data(collection_name):
+    if not utility.has_collection(collection_name):
+        return False
+    return get_vector_store(collection_name).col.num_entities > 0
+def get_source_inventory(collection_name):
+    if not utility.has_collection(collection_name):
+        return []
+    collection = Collection(collection_name)
+    collection.load()
+    rows = collection.query(
+        expr="pk >= 0",
+        output_fields=["source_type", "source_name", "locator"]
+    )
+    summary = {}
+    for row in rows:
+        key = (row.get("source_type", "unknown"), row.get("source_name", "unknown"))
+        if key not in summary:
+            summary[key] = {
+                "source_type": key[0],
+                "source_name": key[1],
+                "chunks": 0,
+                "locators": set()
+            }
+        summary[key]["chunks"] += 1
+        if row.get("locator"):
+            summary[key]["locators"].add(row["locator"])
+    inventory = []
+    for item in summary.values():
+        inventory.append(
+            {
+                "source_type": item["source_type"],
+                "source_name": item["source_name"],
+                "chunks": item["chunks"],
+                "locators": sorted(item["locators"]) if item["locators"] else []
+            }
+        )
+    return sorted(inventory, key=lambda item: (item["source_type"], item["source_name"]))
+def render_evidence_inventory():
+    user_id = st.session_state["user_id"]
+    collection_name = f"multigpt_{user_id}"
+    st.subheader("Evidence Inventory")
+    if not utility.has_collection(collection_name):
+        logging.info(f"No collection found yet for '{collection_name}'")
+        st.info("No project data has been uploaded for this user yet.")
+        return
+    inventory = get_source_inventory(collection_name)
+    total_chunks = sum(item["chunks"] for item in inventory)
+    logging.info(
+        f"Loaded inventory for '{collection_name}' with {len(inventory)} sources and {total_chunks} chunks"
+    )
+    st.caption(f"{len(inventory)} sources indexed across {total_chunks} chunks")
+    if not inventory:
+        st.info("The collection exists, but no source records were found.")
+        return
+    table_rows = []
+    for item in inventory:
+        table_rows.append(
+            {
+                "Type": item["source_type"].upper(),
+                "Source": item["source_name"],
+                "Chunks": item["chunks"],
+                "Locators": len(item["locators"])
+            }
+        )
+    st.table(table_rows)
+def format_context(documents):
+    entries = []
+    for index, doc in enumerate(documents, start=1):
+        metadata = doc.metadata or {}
+        source_type = metadata.get("source_type", "unknown")
+        source_name = metadata.get("source_name", "unknown")
+        locator_text = metadata.get("locator", "locator=unknown")
+        entries.append(
+            f"[Evidence {index}] source_type={source_type}; "
+            f"source_name={source_name}; locator={locator_text}\n"
+            f"{doc.page_content}"
+        )
+    return "\n\n".join(entries)
+def get_rubric_criteria():
+    return [
+        "Problem Understanding",
+        "Technical Approach",
+        "Implementation Quality",
+        "Innovation / Originality",
+        "Communication & Demo Clarity",
+        "Claim vs Reality Alignment",
+        "Prototype Functionality"
+    ]
+def parse_json_response(raw_response):
+    try:
+        return json.loads(raw_response)
+    except json.JSONDecodeError:
+        start = raw_response.find("{")
+        end = raw_response.rfind("}")
+        if start != -1 and end != -1 and end > start:
+            return json.loads(raw_response[start:end + 1])
+        raise
+def normalize_evaluation_response(data):
+    defaults = {
+        "project_summary": {
+            "purpose": "",
+            "high_level_description": ""
+        },
+        "sources_used": [],
+        "claims_detected": [],
+        "capabilities_detected": [],
+        "evidence": [],
+        "gaps_or_risks": [],
+        "scores": [],
+        "overall_assessment": {
+            "verdict": "",
+            "confidence": "low",
+            "reason": ""
+        }
+    }
+    if not isinstance(data, dict):
+        return defaults
+    normalized = defaults.copy()
+    normalized.update({key: value for key, value in data.items() if key in normalized})
+    if not isinstance(normalized["project_summary"], dict):
+        normalized["project_summary"] = defaults["project_summary"]
+    else:
+        normalized["project_summary"] = {
+            "purpose": normalized["project_summary"].get("purpose", ""),
+            "high_level_description": normalized["project_summary"].get("high_level_description", "")
+        }
+    if not isinstance(normalized["overall_assessment"], dict):
+        normalized["overall_assessment"] = defaults["overall_assessment"]
+    else:
+        normalized["overall_assessment"] = {
+            "verdict": normalized["overall_assessment"].get("verdict", ""),
+            "confidence": normalized["overall_assessment"].get("confidence", "low"),
+            "reason": normalized["overall_assessment"].get("reason", "")
+        }
+    for key in ["sources_used", "claims_detected", "capabilities_detected", "evidence", "gaps_or_risks", "scores"]:
+        if not isinstance(normalized[key], list):
+            normalized[key] = []
+    score_lookup = {}
+    for item in normalized["scores"]:
+        if not isinstance(item, dict):
+            continue
+        criterion = item.get("criterion")
+        if criterion:
+            score_lookup[criterion] = {
+                "criterion": criterion,
+                "score": max(1, min(5, int(item.get("score", 1)))) if str(item.get("score", "")).isdigit() else 1,
+                "reasoning": item.get("reasoning", ""),
+                "citations": item.get("citations", []) if isinstance(item.get("citations", []), list) else [],
+                "confidence": max(0.0, min(1.0, float(item.get("confidence", 0.0)))) if isinstance(item.get("confidence", 0.0), (int, float)) else 0.0
+            }
+    normalized["scores"] = []
+    for criterion in get_rubric_criteria():
+        normalized["scores"].append(
+            score_lookup.get(
+                criterion,
+                {
+                    "criterion": criterion,
+                    "score": 1,
+                    "reasoning": "",
+                    "citations": [],
+                    "confidence": 0.0
+                }
+            )
+        )
+    return normalized
+def run_evaluation():
+    user_id = st.session_state["user_id"]
+    collection_name = f"multigpt_{user_id}"
+    logging.info(f"Starting evaluation for collection '{collection_name}'")
+    if not collection_has_data(collection_name):
+        logging.info("Evaluation skipped because no uploaded project data was found")
+        st.warning("No uploaded project data found for this user yet.")
+        return
+    process.warning("Retrieving project evidence")
+    logging.info("Retrieving project evidence from Milvus")
+    db = get_vector_store(collection_name)
+    documents = db.similarity_search(
+        "Evaluate this software project using all available uploaded evidence. "
+        "Summarize capabilities, evidence, gaps, and overall assessment.",
+        k=16
+    )
+    if not documents:
+        logging.info("Evaluation stopped because no retrievable evidence was found")
+        st.warning("No retrievable evidence was found for evaluation.")
+        return
+    prompt = build_evaluation_prompt(format_context(documents), get_rubric_criteria())
+    process.warning("Running evaluation")
+    logging.info(f"Running evaluator on {len(documents)} retrieved evidence chunks")
+    raw_response = run_llm(prompt)
+    try:
+        parsed_response = normalize_evaluation_response(parse_json_response(raw_response))
+    except json.JSONDecodeError:
+        logging.info("Model response was not valid JSON")
+        st.error("The model response was not valid JSON.")
+        st.code(raw_response, language="json")
+        return
+    logging.info("Evaluation completed successfully")
+    process.success("Evaluation ready")
+    st.json(parsed_response)
+def add_evidence_page():
+    placeholder.title("Add Evidence")
+    choice = st.sidebar.radio("Evidence Type", ['', 'DOCUMENT', 'CODE', 'URL', 'VIDEO'])
+    if choice == 'DOCUMENT':
+        st.caption("Upload decks, notes, specs, or README-style documents.")
+        file = st.file_uploader("Upload document", type=["txt", "md", "pdf", "pptx"])
+        if file:
+            extension = os.path.splitext(file.name)[1].lower()
+            if extension in [".txt", ".md"]:
+                ingest_text_document(file)
+            elif extension == ".pdf":
+                ingest_pdf_document(file)
+            elif extension == ".pptx":
+                ingest_pptx_document(file)
+            else:
+                st.error("Unsupported document type.")
+    elif choice == 'CODE':
+        st.caption("Upload source or configuration files that represent the implementation.")
+        files = st.file_uploader(
+            "Upload code files",
+            type=CODE_FILE_TYPES,
+            accept_multiple_files=True
+        )
+        if files:
+            ingest_code_files(files)
+    elif choice == 'URL':
+        st.caption("Add a product page, documentation page, or prototype URL.")
+        url = st.text_input("Enter URL")
+        if url:
+            ingest_url(url)
+    elif choice == 'VIDEO':
+        st.caption("Add a YouTube demo or walkthrough link.")
+        link = st.text_input("YouTube link")
+        if link:
+            ingest_youtube_video(link)
+def evaluate_page():
+    placeholder.title("Run Evaluation")
+    st.write("Generate a structured evaluation using all uploaded evidence for this submission.")
+    render_evidence_inventory()
+    if st.button("Run Evaluation"):
+        run_evaluation()
+def main():
+    global placeholder, process
+    placeholder = st.empty()
+    process = st.empty()
+    if "user_id" not in st.session_state:
+        login()
+        return
+    st.sidebar.write(f"👤 {st.session_state['user_id']}")
+    page = st.sidebar.radio("Navigate", ['Add Evidence', 'Evaluate', 'Logout'])
+    if page == "Add Evidence":
+        add_evidence_page()
+    elif page == "Evaluate":
+        evaluate_page()
+    elif page == "Logout":
+        logging.info("Logging out and clearing session")
+        st.session_state.clear()
+        st.rerun()
+if __name__ == "__main__":
+    main()

description.text ADDED Viewed

	@@ -0,0 +1,91 @@

+# Evaluator-core System Description
+## 1. Overview
+Evaluator-core is a lightweight AI-assisted evaluation MVP built with:
+- Streamlit
+- Hugging Face Inference APIs
+- Milvus
+- Whisper
+The system is designed to ingest multiple submission artefacts, store them in a shared evidence layer, and generate a structured evaluation output grounded in retrieved evidence.
+## 2. Current Goal
+The current MVP aims to:
+> ingest multiple artefacts, build a unified submission context, and return evidence-backed evaluation JSON
+## 3. Supported Inputs
+The current system supports:
+1. Documents
+   - `.txt`
+   - `.md`
+   - `.pdf`
+   - `.pptx`
+2. Code files
+3. URLs
+4. YouTube demo videos
+All artefacts uploaded under one username are stored in a single Milvus collection and evaluated together.
+## 4. Core Flow
+1. User logs in with a username.
+2. Artefacts are uploaded or linked through the UI.
+3. Text is extracted from each artefact.
+4. Extracted text is chunked and embedded.
+5. Chunks are stored in Milvus with source metadata.
+6. Evaluation retrieves evidence from the unified collection.
+7. A Hugging Face-hosted model returns structured JSON.
+## 5. What The Evaluator Produces
+The current output includes:
+- `project_summary`
+- `sources_used`
+- `claims_detected`
+- `capabilities_detected`
+- `evidence`
+- `gaps_or_risks`
+- `scores`
+- `overall_assessment`
+The scoring rubric currently includes:
+- Problem Understanding
+- Technical Approach
+- Implementation Quality
+- Innovation / Originality
+- Communication & Demo Clarity
+- Claim vs Reality Alignment
+- Prototype Functionality
+## 6. Current Strengths
+- Unified evidence storage across source types
+- Retrieval-backed evaluation
+- Structured JSON output
+- Basic claim extraction
+- Rubric-based scoring
+- Source inventory before evaluation
+## 7. Current Limitations
+- Prototype URL validation is still text-based, not interaction-based
+- Claim validation is prompt-driven, not a dedicated cross-artifact engine
+- Code ingestion is file-upload based, not full repository ingestion
+- Code chunking is still text-based rather than syntax-aware
+- Scores and confidence are model-generated rather than calibrated
+## 8. Architecture Direction
+This MVP is no longer a source-specific chatbot. It is now closer to an evidence-layer evaluator:
+> multi-source ingestion -> shared vector store -> retrieved evidence -> structured evaluation
+That makes it a practical early version of the assignment’s intended system, while still leaving prototype validation and stronger cross-checking as future work.

description.txt ADDED Viewed

	@@ -0,0 +1,91 @@

+# Evaluator-core System Description
+## 1. Overview
+Evaluator-core is a lightweight AI-assisted evaluation MVP built with:
+- Streamlit
+- Hugging Face Inference APIs
+- Milvus
+- Whisper
+The system is designed to ingest multiple submission artefacts, store them in a shared evidence layer, and generate a structured evaluation output grounded in retrieved evidence.
+## 2. Current Goal
+The current MVP aims to:
+> ingest multiple artefacts, build a unified submission context, and return evidence-backed evaluation JSON
+## 3. Supported Inputs
+The current system supports:
+1. Documents
+   - `.txt`
+   - `.md`
+   - `.pdf`
+   - `.pptx`
+2. Code files
+3. URLs
+4. YouTube demo videos
+All artefacts uploaded under one username are stored in a single Milvus collection and evaluated together.
+## 4. Core Flow
+1. User logs in with a username.
+2. Artefacts are uploaded or linked through the UI.
+3. Text is extracted from each artefact.
+4. Extracted text is chunked and embedded.
+5. Chunks are stored in Milvus with source metadata.
+6. Evaluation retrieves evidence from the unified collection.
+7. A Hugging Face-hosted model returns structured JSON.
+## 5. What The Evaluator Produces
+The current output includes:
+- `project_summary`
+- `sources_used`
+- `claims_detected`
+- `capabilities_detected`
+- `evidence`
+- `gaps_or_risks`
+- `scores`
+- `overall_assessment`
+The scoring rubric currently includes:
+- Problem Understanding
+- Technical Approach
+- Implementation Quality
+- Innovation / Originality
+- Communication & Demo Clarity
+- Claim vs Reality Alignment
+- Prototype Functionality
+## 6. Current Strengths
+- Unified evidence storage across source types
+- Retrieval-backed evaluation
+- Structured JSON output
+- Basic claim extraction
+- Rubric-based scoring
+- Source inventory before evaluation
+## 7. Current Limitations
+- Prototype URL validation is still text-based, not interaction-based
+- Claim validation is prompt-driven, not a dedicated cross-artifact engine
+- Code ingestion is file-upload based, not full repository ingestion
+- Code chunking is still text-based rather than syntax-aware
+- Scores and confidence are model-generated rather than calibrated
+## 8. Architecture Direction
+This MVP is no longer a source-specific chatbot. It is now closer to an evidence-layer evaluator:
+> multi-source ingestion -> shared vector store -> retrieved evidence -> structured evaluation
+That makes it a practical early version of the assignment’s intended system, while still leaving prototype validation and stronger cross-checking as future work.

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,60 @@

+version: '3.5'
+services:
+  etcd:
+    container_name: milvus-etcd
+    image: quay.io/coreos/etcd:v3.5.5
+    environment:
+      - ETCD_AUTO_COMPACTION_MODE=revision
+      - ETCD_AUTO_COMPACTION_RETENTION=1000
+      - ETCD_QUOTA_BACKEND_BYTES=4294967296
+      - ETCD_SNAPSHOT_COUNT=50000
+    volumes:
+      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
+    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
+    healthcheck:
+      test: ["CMD", "etcdctl", "endpoint", "health"]
+      interval: 30s
+      timeout: 20s
+      retries: 3
+  minio:
+    container_name: milvus-minio
+    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
+    environment:
+      MINIO_ACCESS_KEY: minioadmin
+      MINIO_SECRET_KEY: minioadmin
+    volumes:
+      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
+    command: minio server /minio_data
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
+      interval: 30s
+      timeout: 20s
+      retries: 3
+  standalone:
+    container_name: milvus-standalone
+    image: milvusdb/milvus:v2.2.16
+    command: ["milvus", "run", "standalone"]
+    environment:
+      ETCD_ENDPOINTS: etcd:2379
+      MINIO_ADDRESS: minio:9000
+    volumes:
+      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
+      interval: 30s
+      start_period: 90s
+      timeout: 20s
+      retries: 3
+    ports:
+      - "19530:19530"
+      - "9091:9091"
+    depends_on:
+      - "etcd"
+      - "minio"
+networks:
+  default:
+    name: milvus

output.json ADDED Viewed

	@@ -0,0 +1,244 @@

+{
+    "project_summary": {
+      "purpose": "",
+      "high_level_description": ""
+    },
+    "sources_used": [
+      {
+        "source_type": "text",
+        "source_name": "description.txt",
+        "notes": ""
+      },
+      {
+        "source_type": "code",
+        "source_name": "app.py",
+        "notes": ""
+      }
+    ],
+    "claims_detected": [],
+    "capabilities_detected": [
+      {
+        "capability": "Supports multiple artefact types: Documents (.txt, .md, .pdf, .pptx), Code files, URLs, YouTube demo videos",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 3"
+        ]
+      },
+      {
+        "capability": "Text is extracted from artefacts, chunked and embedded",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 1",
+          "Evidence 3"
+        ]
+      },
+      {
+        "capability": "Chunks with metadata are stored in Milvus",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 1"
+        ]
+      },
+      {
+        "capability": "Evaluates based on retrieved evidence from a unified collection",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 1"
+        ]
+      },
+      {
+        "capability": "Generates structured JSON output including project_summary, sources_used, claims_detected, capabilities_detected, evidence, gaps_or_risks, scores, overall_assessment",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 1"
+        ]
+      },
+      {
+        "capability": "Evaluation uses a Hugging Face-hosted model",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 1"
+        ]
+      },
+      {
+        "capability": "Provides a source inventory before evaluation",
+        "status": "supported",
+        "evidence_refs": [
+          "Evidence 9"
+        ]
+      }
+    ],
+    "evidence": [
+      {
+        "claim_or_observation": "The system is a lightweight AI-assisted evaluation MVP built with Streamlit, Hugging Face Inference APIs, Milvus, Whisper",
+        "support_level": "supported",
+        "evidence_refs": [
+          "Evidence 3"
+        ]
+      },
+      {
+        "claim_or_observation": "Current MVP aims to ingest multiple artefacts, build a unified submission context, and return evidence-backed evaluation JSON",
+        "support_level": "supported",
+        "evidence_refs": [
+          "Evidence 3"
+        ]
+      },
+      {
+        "claim_or_observation": "Artefacts uploaded under one username are stored in a single Milvus collection",
+        "support_level": "supported",
+        "evidence_refs": [
+          "Evidence 3"
+        ]
+      },
+      {
+        "claim_or_observation": "Current scoring rubric includes Problem Understanding, Technical Approach, Implementation Quality, Innovation / Originality, Communication & Demo Clarity, Claim vs Reality Alignment, Prototype Functionality",
+        "support_level": "supported",
+        "evidence_refs": [
+          "Evidence 1",
+          "Evidence 6"
+        ]
+      },
+      {
+        "claim_or_observation": "Current strengths include Unified evidence storage across source types, Retrieval-backed evaluation, Structured JSON output, Basic claim extraction, Rubric-based scoring, Source inventory before evaluation",
+        "support_level": "supported",
+        "evidence_refs": [
+          "Evidence 1",
+          "Evidence 2"
+        ]
+      },
+      {
+        "claim_or_observation": "Current limitations include Prototype URL validation is still text-based, not interaction-based, Claim validation is prompt-driven, not a dedicated cross-artifact engine, Code ingestion is file-upload based, not full repository ingestion, Code chunking is still text-based rather than syntax-aware, Scores and confidence are model-generated rather than calibrated",
+        "support_level": "supported",
+        "evidence_refs": [
+          "Evidence 2"
+        ]
+      }
+    ],
+    "gaps_or_risks": [
+      {
+        "issue": "Evaluation depends on an LLM-generated JSON response; parsing may fail if response is invalid",
+        "reason": "Code shows a try-catch block for JSONDecodeError, and the system logs and displays error if JSON invalid",
+        "evidence_refs": [
+          "Evidence 4"
+        ]
+      },
+      {
+        "issue": "No actual prototype URL validation or interaction",
+        "reason": "Limitations text states prototype URL validation is still text-based, not interaction-based",
+        "evidence_refs": [
+          "Evidence 2"
+        ]
+      },
+      {
+        "issue": "Claim validation is prompt-driven, not a dedicated cross-artifact engine",
+        "reason": "Limitations text states claim validation is prompt-driven",
+        "evidence_refs": [
+          "Evidence 2"
+        ]
+      },
+      {
+        "issue": "Code chunking is text-based, not syntax-aware",
+        "reason": "Limitations text states code chunking is still text-based rather than syntax-aware",
+        "evidence_refs": [
+          "Evidence 2"
+        ]
+      },
+      {
+        "issue": "Scores and confidence are model-generated, not calibrated",
+        "reason": "Limitations text states scores and confidence are model-generated rather than calibrated",
+        "evidence_refs": [
+          "Evidence 2"
+        ]
+      }
+    ],
+    "scores": [
+      {
+        "criterion": "Problem Understanding",
+        "score": 4,
+        "reasoning": "System architecture is described as evidence-layer evaluator with clear purpose; limitations acknowledged",
+        "citations": [
+          "Evidence 1",
+          "Evidence 2",
+          "Evidence 3"
+        ],
+        "confidence": 0.8
+      },
+      {
+        "criterion": "Technical Approach",
+        "score": 3,
+        "reasoning": "Approach uses multi-source ingestion, shared vector store, retrieval, and structured evaluation; but limitations exist in claim validation, code chunking, and prototype validation",
+        "citations": [
+          "Evidence 1",
+          "Evidence 2",
+          "Evidence 3",
+          "Evidence 4"
+        ],
+        "confidence": 0.75
+      },
+      {
+        "criterion": "Implementation Quality",
+        "score": 3,
+        "reasoning": "Code shows concrete implementation for artefact ingestion, storage, retrieval, and evaluation; supports multiple file types; but error handling and dependency on LLM JSON are present",
+        "citations": [
+          "Evidence 4",
+          "Evidence 10",
+          "Evidence 11",
+          "Evidence 12",
+          "Evidence 14"
+        ],
+        "confidence": 0.8
+      },
+      {
+        "criterion": "Innovation / Originality",
+        "score": 2,
+        "reasoning": "Unified evidence storage and retrieval-backed evaluation are strengths; however, the approach is described as an MVP and lacks sophisticated validation",
+        "citations": [
+          "Evidence 1",
+          "Evidence 2",
+          "Evidence 8"
+        ],
+        "confidence": 0.6
+      },
+      {
+        "criterion": "Communication & Demo Clarity",
+        "score": 3,
+        "reasoning": "System description and code structure are clear; strengths and limitations are documented; UI components shown (Streamlit)",
+        "citations": [
+          "Evidence 1",
+          "Evidence 2",
+          "Evidence 3",
+          "Evidence 7"
+        ],
+        "confidence": 0.7
+      },
+      {
+        "criterion": "Claim vs Reality Alignment",
+        "score": 3,
+        "reasoning": "Supported capabilities and limitations are explicitly listed, aligning with implementation; claim validation noted as prompt-driven",
+        "citations": [
+          "Evidence 1",
+          "Evidence 2",
+          "Evidence 3",
+          "Evidence 9"
+        ],
+        "confidence": 0.8
+      },
+      {
+        "criterion": "Prototype Functionality",
+        "score": 2,
+        "reasoning": "Evidence shows a working system for artefact ingestion, storage, retrieval, and structured evaluation; but limitations indicate lack of interactive prototype validation and reliance on text-based URL processing",
+        "citations": [
+          "Evidence 2",
+          "Evidence 4",
+          "Evidence 5",
+          "Evidence 7"
+        ],
+        "confidence": 0.7
+      }
+    ],
+    "overall_assessment": {
+      "verdict": "The project is a functional MVP for evidence-backed software project evaluation using multi-source ingestion and retrieval, with clear strengths and acknowledged limitations.",
+      "confidence": "high",
+      "reason": "Evidence from both description and code files provides consistent and detailed support for core functionalities, flow, and current state."
+    }
+  }

prompts.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import json
+def build_evaluation_prompt(context, rubric_criteria):
+    rubric_json = json.dumps(rubric_criteria)
+    return f"""
+You are evaluating one software project using retrieved evidence from mixed uploaded sources.
+Use only the supplied evidence. Do not invent facts. If something is unclear, say it is uncertain.
+Extract concrete product or implementation claims when possible and label each one as supported, partial, or uncertain based only on the evidence.
+Score the submission using the rubric criteria provided below. Use retrieved evidence only.
+Return valid JSON only. No markdown, no code fences, no explanation outside the JSON.
+Use exactly this top-level structure:
+{{
+  "project_summary": {{
+    "purpose": "",
+    "high_level_description": ""
+  }},
+  "sources_used": [
+    {{
+      "source_type": "",
+      "source_name": "",
+      "notes": ""
+    }}
+  ],
+  "claims_detected": [
+    {{
+      "claim": "",
+      "status": "supported|partial|uncertain",
+      "reason": "",
+      "evidence_refs": ["Evidence 1"]
+    }}
+  ],
+  "capabilities_detected": [
+    {{
+      "capability": "",
+      "status": "supported|partial|uncertain",
+      "evidence_refs": ["Evidence 1"]
+    }}
+  ],
+  "evidence": [
+    {{
+      "claim_or_observation": "",
+      "support_level": "supported|partial|uncertain",
+      "evidence_refs": ["Evidence 1"]
+    }}
+  ],
+  "gaps_or_risks": [
+    {{
+      "issue": "",
+      "reason": "",
+      "evidence_refs": ["Evidence 1"]
+    }}
+  ],
+  "scores": [
+    {{
+      "criterion": "",
+      "score": 1,
+      "reasoning": "",
+      "citations": ["Evidence 1"],
+      "confidence": 0.5
+    }}
+  ],
+  "overall_assessment": {{
+    "verdict": "",
+    "confidence": "low|medium|high",
+    "reason": ""
+  }}
+}}
+Rules:
+- Keep claims specific and checkable.
+- Prefer 3 to 8 claims when enough evidence exists.
+- Mark a claim as "supported" only when the evidence directly backs it.
+- Mark a claim as "partial" when the evidence suggests the claim but does not fully prove it.
+- Mark a claim as "uncertain" when the claim is plausible but not verified by the retrieved evidence.
+- Every claim, capability, evidence item, and risk must include at least one evidence reference when possible.
+- Create one score item for each rubric criterion in this exact list: {rubric_json}
+- Score each criterion on an integer scale from 1 to 5.
+- `citations` must reference evidence ids such as "Evidence 1".
+- `confidence` must be a numeric value from 0 to 1.
+- If no URL or prototype evidence exists, score "Prototype Functionality" conservatively and explain the limited evidence.
+Evidence:
+{context}
+""".strip()

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+goose3
+pydantic==1.10.12
+langchain==0.0.278
+langchain-community
+PyPDF2
+python-pptx
+python-dotenv
+streamlit
+moviepy
+pytube
+pymilvus
+huggingface_hub
+git+https://github.com/openai/whisper.git