jayeshdiro commited on
Commit
facefda
·
0 Parent(s):

Initial commit

Browse files
Files changed (12) hide show
  1. .gitattributes +35 -0
  2. .gitignore +1 -0
  3. DESIGN_NOTE.md +50 -0
  4. Dockerfile +20 -0
  5. README.md +117 -0
  6. app.py +551 -0
  7. description.text +91 -0
  8. description.txt +91 -0
  9. docker-compose.yml +60 -0
  10. output.json +244 -0
  11. prompts.py +87 -0
  12. requirements.txt +13 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ .env
DESIGN_NOTE.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Design Note: AI-Assisted Evaluation MVP
2
+
3
+ ## Goal
4
+ Build a small AI-assisted evaluation system that can ingest multiple artefacts, create a unified understanding, cross-check claims, and produce structured scoring grounded in retrieved evidence.
5
+
6
+ ## Design Choice
7
+ The MVP was built as a single Streamlit app with Milvus as the evidence store. The key architectural choice was to move from source-specific chat collections to one unified collection per submission/user so that all artefacts can contribute to one evaluation context.
8
+
9
+ ## Evidence Flow
10
+ 1. Ingest artefacts from document, code, URL, and video sources.
11
+ 2. Extract text and attach source metadata.
12
+ 3. Chunk and embed the content.
13
+ 4. Store all evidence in one Milvus collection for the current username.
14
+ 5. Retrieve evidence from the full collection during evaluation.
15
+ 6. Ask the LLM to return JSON only, including summary, claims, evidence, risks, and rubric scores.
16
+
17
+ ## Why This Approach
18
+ - It keeps the implementation practical for a 4-5 hour assignment.
19
+ - It demonstrates the core evidence-layer thinking the assignment asks for.
20
+ - It supports multi-source reasoning without overbuilding infrastructure.
21
+ - It makes the output traceable to retrieved evidence snippets.
22
+
23
+ ## Current Strengths
24
+ - Unified evidence layer
25
+ - Multi-source ingestion
26
+ - Retrieval-backed evaluation
27
+ - Claim extraction with support labels
28
+ - Rubric-based scoring
29
+ - Structured JSON output
30
+
31
+ ## Current Limitations
32
+ - Prototype URL validation is limited to text extraction, not browser interaction.
33
+ - Claim cross-checking is prompt-driven, not a dedicated comparison engine.
34
+ - Code ingestion is file-upload based, not full repository traversal.
35
+ - Code chunking is character-based rather than semantic.
36
+ - Confidence and scoring are LLM-generated rather than calibrated.
37
+
38
+ ## Practical Tradeoffs
39
+ - Preferred shipping a working evaluator skeleton over building incomplete automation-heavy features.
40
+ - Kept the app single-file to maximize iteration speed during the assignment window.
41
+ - Added explicit output structure and normalization to reduce brittle LLM formatting.
42
+
43
+ ## Next Steps
44
+ 1. Add lightweight prototype validation for URLs.
45
+ 2. Add explicit `claim_validation` output with claimed-in vs supported-by mapping.
46
+ 3. Improve code ingestion to accept repos/zips/folders.
47
+ 4. Add stronger evidence citation formatting and exportable result files.
48
+
49
+ ## Summary
50
+ This MVP does not fully solve the end-state problem, but it establishes the correct system direction: unified evidence ingestion, retrieval-grounded evaluation, basic claim validation, and rubric scoring across multiple artefacts.
Dockerfile ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.13.5-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y \
6
+ build-essential \
7
+ curl \
8
+ git \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ COPY requirements.txt ./
12
+ COPY ./ ./
13
+
14
+ RUN pip3 install -r requirements.txt
15
+
16
+ EXPOSE 8501
17
+
18
+ HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
19
+
20
+ ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Evaluator Core
3
+ emoji: 🚀
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 8501
8
+ pinned: false
9
+ ---
10
+
11
+ # Evaluator-core: AI-Assisted Evaluation MVP
12
+
13
+ ## Overview
14
+ Evaluator-core is a Streamlit-based AI-assisted evaluation MVP that ingests multiple submission artefacts into a unified evidence layer, retrieves evidence from Milvus, and generates structured JSON evaluation output.
15
+
16
+ This MVP is designed around the assignment goal of building an evidence-backed evaluator rather than a generic chatbot.
17
+
18
+ ## What It Supports
19
+ - `DOCUMENT` uploads: `.txt`, `.md`, `.pdf`, `.pptx`
20
+ - `CODE` uploads: common source/config text files such as `.py`, `.js`, `.ts`, `.tsx`, `.java`, `.go`, `.html`, `.css`, `.json`, `.yaml`, `.sql`, and others configured in the uploader
21
+ - `URL` ingestion: extracts page/article text
22
+ - `VIDEO` ingestion: YouTube link download plus Whisper transcription
23
+
24
+ All uploaded artefacts for one username are stored in a single Milvus collection and evaluated together.
25
+
26
+ ## Current MVP Features
27
+ - Unified ingestion across multiple artefact types
28
+ - Single project collection per user
29
+ - Source metadata attached to stored chunks
30
+ - Source inventory shown before evaluation
31
+ - Retrieval-backed evaluation over all uploaded evidence
32
+ - Claim extraction with `supported | partial | uncertain`
33
+ - Rubric-based scoring with:
34
+ - `Problem Understanding`
35
+ - `Technical Approach`
36
+ - `Implementation Quality`
37
+ - `Innovation / Originality`
38
+ - `Communication & Demo Clarity`
39
+ - `Claim vs Reality Alignment`
40
+ - `Prototype Functionality`
41
+ - Structured JSON output
42
+
43
+ ## Architecture
44
+ 1. Artefacts are uploaded or linked through the Streamlit UI.
45
+ 2. Text is extracted and chunked by source type.
46
+ 3. Chunks are embedded with Hugging Face embeddings.
47
+ 4. Embeddings and metadata are stored in Milvus.
48
+ 5. Evaluation retrieves relevant evidence from the unified collection.
49
+ 6. A Hugging Face-hosted LLM generates structured JSON grounded in retrieved evidence.
50
+
51
+ ## Setup
52
+ ### Prerequisites
53
+ - Python environment
54
+ - Docker Desktop
55
+ - Hugging Face token with inference access
56
+
57
+ ### Install
58
+ ```powershell
59
+ conda activate nitish_sutra
60
+ cd "c:\Users\jayes\OneDrive\Desktop\New folder (2)\Evaluator-core"
61
+ python -m pip install -r requirements.txt
62
+ ```
63
+
64
+ ### Environment
65
+ Create a `.env` file in the project root with:
66
+
67
+ ```env
68
+ HF_TOKEN=your_huggingface_token_here
69
+ ```
70
+
71
+ ### Start Milvus
72
+ Milvus can be started using the included Docker Compose file:
73
+
74
+ ```powershell
75
+ docker compose -f "c:\Users\jayes\OneDrive\Desktop\New folder (2)\Evaluator-core\docker-compose.yml" up -d
76
+ ```
77
+
78
+ ### Run the App
79
+ ```powershell
80
+ streamlit run app.py
81
+ ```
82
+
83
+ ## How To Use
84
+ 1. Log in with a username.
85
+ 2. Upload evidence under `DOCUMENT`, `CODE`, `URL`, and/or `VIDEO`.
86
+ 3. Open `Evaluate`.
87
+ 4. Review the source inventory.
88
+ 5. Run evaluation and inspect the JSON output.
89
+
90
+ ## Output Shape
91
+ The evaluator currently returns JSON with sections such as:
92
+ - `project_summary`
93
+ - `sources_used`
94
+ - `claims_detected`
95
+ - `capabilities_detected`
96
+ - `evidence`
97
+ - `gaps_or_risks`
98
+ - `scores`
99
+ - `overall_assessment`
100
+
101
+ ## Tradeoffs
102
+ - Uses a single-file Streamlit implementation for speed.
103
+ - Uses prompt-based evidence synthesis rather than a separate deterministic scoring engine.
104
+ - URL ingestion currently extracts text but does not yet perform browser-based prototype validation.
105
+ - Code ingestion currently works on uploaded files rather than full repository crawl/zip ingestion.
106
+
107
+ ## Known Gaps
108
+ - No live browser automation for working app validation yet
109
+ - No explicit artifact-vs-artifact mismatch engine beyond prompt-guided claim validation
110
+ - Code chunking is text-based, not AST-aware
111
+ - No exported evaluation history or submission archive yet
112
+
113
+ ## Deliverable Framing
114
+ For the assignment, this should be presented as:
115
+ - a working MVP of the evidence layer
116
+ - a unified multi-source evaluator
117
+ - an intentionally scoped prototype with clear next steps for URL validation and stronger cross-artifact checking
app.py ADDED
@@ -0,0 +1,551 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import logging
4
+ from dotenv import load_dotenv
5
+ from PyPDF2 import PdfReader
6
+ from pptx import Presentation
7
+ from langchain.text_splitter import CharacterTextSplitter
8
+ from goose3 import Goose
9
+ import streamlit as st
10
+ import whisper
11
+ from pytube import YouTube
12
+ from moviepy import VideoFileClip
13
+ import time
14
+
15
+ from langchain_community.vectorstores import Milvus
16
+ from pymilvus import Collection, connections, utility
17
+
18
+ from huggingface_hub import InferenceClient
19
+ from prompts import build_evaluation_prompt
20
+
21
+ EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
22
+ CHAT_MODEL = "deepseek-ai/DeepSeek-V3.2:novita"
23
+ MILVUS_CONFIG = {"host": "localhost", "port": "19530"}
24
+ DOCUMENT_CHUNK_SIZE = 1000
25
+ PDF_CHUNK_SIZE = 2500
26
+ PPTX_CHUNK_SIZE = 1800
27
+ CODE_CHUNK_SIZE = 1200
28
+ URL_CHUNK_SIZE = 1500
29
+ VIDEO_CHUNK_SIZE = 1000
30
+ CHUNK_OVERLAP = 150
31
+ CODE_FILE_TYPES = [
32
+ "py", "js", "ts", "jsx", "tsx", "java", "c", "cpp", "cs", "go", "rs",
33
+ "php", "rb", "html", "css", "scss", "json", "yaml", "yml", "toml",
34
+ "ini", "sh", "sql", "xml"
35
+ ]
36
+
37
+ load_dotenv()
38
+ logging.basicConfig(
39
+ level=logging.INFO,
40
+ format="%(asctime)s [%(levelname)s] %(message)s"
41
+ )
42
+
43
+ connections.connect(alias="default", **MILVUS_CONFIG)
44
+
45
+ HF_TOKEN = os.getenv("HF_TOKEN")
46
+
47
+
48
+ def get_embeddings():
49
+ client = InferenceClient(api_key=HF_TOKEN)
50
+
51
+ def embed_documents(texts):
52
+ result = client.feature_extraction(texts, model=EMBEDDING_MODEL)
53
+ if isinstance(result, dict):
54
+ raise ValueError(f"Embedding API error: {result}")
55
+ return result
56
+
57
+ def embed_query(text):
58
+ result = client.feature_extraction(text, model=EMBEDDING_MODEL)
59
+ if isinstance(result, dict):
60
+ raise ValueError(f"Embedding API error: {result}")
61
+ return result
62
+
63
+ return type(
64
+ "EmbeddingAdapter",
65
+ (),
66
+ {
67
+ "embed_documents": staticmethod(embed_documents),
68
+ "embed_query": staticmethod(embed_query),
69
+ },
70
+ )()
71
+
72
+ def run_llm(prompt):
73
+ client = InferenceClient(api_key=HF_TOKEN)
74
+ completion = client.chat.completions.create(
75
+ model=CHAT_MODEL,
76
+ messages=[
77
+ {
78
+ "role": "system",
79
+ "content": "Answer only from the given context. Be concise and accurate."
80
+ },
81
+ {
82
+ "role": "user",
83
+ "content": prompt
84
+ }
85
+ ],
86
+ )
87
+ return completion.choices[0].message.content
88
+
89
+ def login():
90
+ st.title("🔐 Login")
91
+
92
+ user = st.text_input("Enter username")
93
+
94
+ if st.button("Login"):
95
+ if user:
96
+ st.session_state["user_id"] = user.strip().lower()
97
+ logging.info(f"Logged in as {st.session_state['user_id']}")
98
+ st.success(f"Logged in as {user}")
99
+ st.rerun()
100
+ else:
101
+ st.error("Enter username")
102
+
103
+ def build_chunks(texts, metadatas, chunk_size):
104
+ if not texts:
105
+ return [], []
106
+
107
+ documents = CharacterTextSplitter(
108
+ separator="\n",
109
+ chunk_size=chunk_size,
110
+ chunk_overlap=CHUNK_OVERLAP
111
+ ).create_documents(texts, metadatas)
112
+ return [doc.page_content for doc in documents], [doc.metadata for doc in documents]
113
+
114
+ def save_source_texts(user_id, source_type, source_name, texts, locators, chunk_size):
115
+ metadatas = [
116
+ {
117
+ "source_type": source_type,
118
+ "source_name": source_name,
119
+ "locator": locator
120
+ }
121
+ for locator in locators
122
+ ]
123
+ chunks, metadatas = build_chunks(texts, metadatas, chunk_size)
124
+
125
+ if not chunks:
126
+ st.warning("No readable content was extracted from this source.")
127
+ return
128
+
129
+ process.success("Chunking done")
130
+ logging.info(
131
+ f"Chunking complete for {source_type} source '{source_name}' with {len(chunks)} chunks"
132
+ )
133
+ collection_name = f"multigpt_{user_id}"
134
+ logging.info(f"Storing {len(chunks)} chunks in collection '{collection_name}'")
135
+ Milvus.from_texts(
136
+ chunks,
137
+ metadatas=metadatas,
138
+ embedding=get_embeddings(),
139
+ collection_name=collection_name,
140
+ connection_args=MILVUS_CONFIG
141
+ )
142
+ logging.info("Upload completed successfully")
143
+ process.success("Uploaded")
144
+
145
+ def ingest_text_document(file):
146
+ user_id = st.session_state["user_id"]
147
+ logging.info(f"Reading text file '{file.name}'")
148
+
149
+ text = file.read().decode("utf-8", errors="ignore")
150
+ save_source_texts(user_id, "text", file.name, [text], [""], DOCUMENT_CHUNK_SIZE)
151
+
152
+ def ingest_pdf_document(file):
153
+ user_id = st.session_state["user_id"]
154
+ logging.info(f"Reading PDF '{file.name}'")
155
+
156
+ reader = PdfReader(file)
157
+ texts = []
158
+ locators = []
159
+
160
+ for index, page in enumerate(reader.pages, start=1):
161
+ page_text = page.extract_text() or ""
162
+ if page_text.strip():
163
+ texts.append(page_text)
164
+ locators.append(f"page={index}")
165
+
166
+ save_source_texts(user_id, "pdf", file.name, texts, locators, PDF_CHUNK_SIZE)
167
+
168
+ def ingest_pptx_document(file):
169
+ user_id = st.session_state["user_id"]
170
+ logging.info(f"Reading PPTX '{file.name}'")
171
+
172
+ presentation = Presentation(file)
173
+ texts = []
174
+ locators = []
175
+
176
+ for index, slide in enumerate(presentation.slides, start=1):
177
+ slide_parts = []
178
+ for shape in slide.shapes:
179
+ if hasattr(shape, "text") and shape.text:
180
+ slide_parts.append(shape.text)
181
+
182
+ slide_text = "\n".join(part.strip() for part in slide_parts if part.strip())
183
+ if slide_text:
184
+ texts.append(slide_text)
185
+ locators.append(f"slide={index}")
186
+
187
+ save_source_texts(user_id, "pptx", file.name, texts, locators, PPTX_CHUNK_SIZE)
188
+
189
+ def ingest_code_files(files):
190
+ user_id = st.session_state["user_id"]
191
+
192
+ for file in files:
193
+ logging.info(f"Reading code file '{file.name}'")
194
+ text = file.read().decode("utf-8", errors="ignore")
195
+ save_source_texts(user_id, "code", file.name, [text], [file.name], CODE_CHUNK_SIZE)
196
+
197
+ def ingest_url(url):
198
+ user_id = st.session_state["user_id"]
199
+ logging.info(f"Fetching URL '{url}'")
200
+
201
+ g = Goose()
202
+ text = g.extract(url=url).cleaned_text
203
+ save_source_texts(user_id, "url", url, [text], [url], URL_CHUNK_SIZE)
204
+
205
+ def ingest_youtube_video(link):
206
+ user_id = st.session_state["user_id"]
207
+ logging.info(f"Starting video ingestion for '{link}'")
208
+
209
+ yt = YouTube(link).streams.get_highest_resolution()
210
+ yt.download(filename="video.mp4")
211
+
212
+ process.success("Downloading video")
213
+ logging.info("Video download completed")
214
+
215
+ while not os.path.exists("video.mp4"):
216
+ time.sleep(5)
217
+
218
+ video = VideoFileClip("video.mp4")
219
+
220
+ process.warning("Extracting audio")
221
+ logging.info("Extracting audio from video")
222
+ audio = video.audio
223
+ audio.write_audiofile("audio.mp3")
224
+
225
+ process.warning("Transcribing")
226
+ logging.info("Running Whisper transcription")
227
+ model = whisper.load_model("base")
228
+ result = model.transcribe("audio.mp3")
229
+
230
+ save_source_texts(user_id, "video", link, [result["text"]], [link], VIDEO_CHUNK_SIZE)
231
+
232
+ def get_vector_store(collection_name):
233
+ return Milvus(
234
+ embedding_function=get_embeddings(),
235
+ collection_name=collection_name,
236
+ connection_args=MILVUS_CONFIG
237
+ )
238
+
239
+ def collection_has_data(collection_name):
240
+ if not utility.has_collection(collection_name):
241
+ return False
242
+
243
+ return get_vector_store(collection_name).col.num_entities > 0
244
+
245
+ def get_source_inventory(collection_name):
246
+ if not utility.has_collection(collection_name):
247
+ return []
248
+
249
+ collection = Collection(collection_name)
250
+ collection.load()
251
+ rows = collection.query(
252
+ expr="pk >= 0",
253
+ output_fields=["source_type", "source_name", "locator"]
254
+ )
255
+
256
+ summary = {}
257
+ for row in rows:
258
+ key = (row.get("source_type", "unknown"), row.get("source_name", "unknown"))
259
+ if key not in summary:
260
+ summary[key] = {
261
+ "source_type": key[0],
262
+ "source_name": key[1],
263
+ "chunks": 0,
264
+ "locators": set()
265
+ }
266
+
267
+ summary[key]["chunks"] += 1
268
+ if row.get("locator"):
269
+ summary[key]["locators"].add(row["locator"])
270
+
271
+ inventory = []
272
+ for item in summary.values():
273
+ inventory.append(
274
+ {
275
+ "source_type": item["source_type"],
276
+ "source_name": item["source_name"],
277
+ "chunks": item["chunks"],
278
+ "locators": sorted(item["locators"]) if item["locators"] else []
279
+ }
280
+ )
281
+
282
+ return sorted(inventory, key=lambda item: (item["source_type"], item["source_name"]))
283
+
284
+ def render_evidence_inventory():
285
+ user_id = st.session_state["user_id"]
286
+ collection_name = f"multigpt_{user_id}"
287
+
288
+ st.subheader("Evidence Inventory")
289
+
290
+ if not utility.has_collection(collection_name):
291
+ logging.info(f"No collection found yet for '{collection_name}'")
292
+ st.info("No project data has been uploaded for this user yet.")
293
+ return
294
+
295
+ inventory = get_source_inventory(collection_name)
296
+ total_chunks = sum(item["chunks"] for item in inventory)
297
+ logging.info(
298
+ f"Loaded inventory for '{collection_name}' with {len(inventory)} sources and {total_chunks} chunks"
299
+ )
300
+
301
+ st.caption(f"{len(inventory)} sources indexed across {total_chunks} chunks")
302
+
303
+ if not inventory:
304
+ st.info("The collection exists, but no source records were found.")
305
+ return
306
+
307
+ table_rows = []
308
+ for item in inventory:
309
+ table_rows.append(
310
+ {
311
+ "Type": item["source_type"].upper(),
312
+ "Source": item["source_name"],
313
+ "Chunks": item["chunks"],
314
+ "Locators": len(item["locators"])
315
+ }
316
+ )
317
+
318
+ st.table(table_rows)
319
+
320
+ def format_context(documents):
321
+ entries = []
322
+
323
+ for index, doc in enumerate(documents, start=1):
324
+ metadata = doc.metadata or {}
325
+ source_type = metadata.get("source_type", "unknown")
326
+ source_name = metadata.get("source_name", "unknown")
327
+ locator_text = metadata.get("locator", "locator=unknown")
328
+ entries.append(
329
+ f"[Evidence {index}] source_type={source_type}; "
330
+ f"source_name={source_name}; locator={locator_text}\n"
331
+ f"{doc.page_content}"
332
+ )
333
+
334
+ return "\n\n".join(entries)
335
+
336
+ def get_rubric_criteria():
337
+ return [
338
+ "Problem Understanding",
339
+ "Technical Approach",
340
+ "Implementation Quality",
341
+ "Innovation / Originality",
342
+ "Communication & Demo Clarity",
343
+ "Claim vs Reality Alignment",
344
+ "Prototype Functionality"
345
+ ]
346
+
347
+ def parse_json_response(raw_response):
348
+ try:
349
+ return json.loads(raw_response)
350
+ except json.JSONDecodeError:
351
+ start = raw_response.find("{")
352
+ end = raw_response.rfind("}")
353
+ if start != -1 and end != -1 and end > start:
354
+ return json.loads(raw_response[start:end + 1])
355
+ raise
356
+
357
+ def normalize_evaluation_response(data):
358
+ defaults = {
359
+ "project_summary": {
360
+ "purpose": "",
361
+ "high_level_description": ""
362
+ },
363
+ "sources_used": [],
364
+ "claims_detected": [],
365
+ "capabilities_detected": [],
366
+ "evidence": [],
367
+ "gaps_or_risks": [],
368
+ "scores": [],
369
+ "overall_assessment": {
370
+ "verdict": "",
371
+ "confidence": "low",
372
+ "reason": ""
373
+ }
374
+ }
375
+
376
+ if not isinstance(data, dict):
377
+ return defaults
378
+
379
+ normalized = defaults.copy()
380
+ normalized.update({key: value for key, value in data.items() if key in normalized})
381
+
382
+ if not isinstance(normalized["project_summary"], dict):
383
+ normalized["project_summary"] = defaults["project_summary"]
384
+ else:
385
+ normalized["project_summary"] = {
386
+ "purpose": normalized["project_summary"].get("purpose", ""),
387
+ "high_level_description": normalized["project_summary"].get("high_level_description", "")
388
+ }
389
+
390
+ if not isinstance(normalized["overall_assessment"], dict):
391
+ normalized["overall_assessment"] = defaults["overall_assessment"]
392
+ else:
393
+ normalized["overall_assessment"] = {
394
+ "verdict": normalized["overall_assessment"].get("verdict", ""),
395
+ "confidence": normalized["overall_assessment"].get("confidence", "low"),
396
+ "reason": normalized["overall_assessment"].get("reason", "")
397
+ }
398
+
399
+ for key in ["sources_used", "claims_detected", "capabilities_detected", "evidence", "gaps_or_risks", "scores"]:
400
+ if not isinstance(normalized[key], list):
401
+ normalized[key] = []
402
+
403
+ score_lookup = {}
404
+ for item in normalized["scores"]:
405
+ if not isinstance(item, dict):
406
+ continue
407
+
408
+ criterion = item.get("criterion")
409
+ if criterion:
410
+ score_lookup[criterion] = {
411
+ "criterion": criterion,
412
+ "score": max(1, min(5, int(item.get("score", 1)))) if str(item.get("score", "")).isdigit() else 1,
413
+ "reasoning": item.get("reasoning", ""),
414
+ "citations": item.get("citations", []) if isinstance(item.get("citations", []), list) else [],
415
+ "confidence": max(0.0, min(1.0, float(item.get("confidence", 0.0)))) if isinstance(item.get("confidence", 0.0), (int, float)) else 0.0
416
+ }
417
+
418
+ normalized["scores"] = []
419
+ for criterion in get_rubric_criteria():
420
+ normalized["scores"].append(
421
+ score_lookup.get(
422
+ criterion,
423
+ {
424
+ "criterion": criterion,
425
+ "score": 1,
426
+ "reasoning": "",
427
+ "citations": [],
428
+ "confidence": 0.0
429
+ }
430
+ )
431
+ )
432
+
433
+ return normalized
434
+
435
+ def run_evaluation():
436
+ user_id = st.session_state["user_id"]
437
+ collection_name = f"multigpt_{user_id}"
438
+ logging.info(f"Starting evaluation for collection '{collection_name}'")
439
+
440
+ if not collection_has_data(collection_name):
441
+ logging.info("Evaluation skipped because no uploaded project data was found")
442
+ st.warning("No uploaded project data found for this user yet.")
443
+ return
444
+
445
+ process.warning("Retrieving project evidence")
446
+ logging.info("Retrieving project evidence from Milvus")
447
+ db = get_vector_store(collection_name)
448
+ documents = db.similarity_search(
449
+ "Evaluate this software project using all available uploaded evidence. "
450
+ "Summarize capabilities, evidence, gaps, and overall assessment.",
451
+ k=16
452
+ )
453
+
454
+ if not documents:
455
+ logging.info("Evaluation stopped because no retrievable evidence was found")
456
+ st.warning("No retrievable evidence was found for evaluation.")
457
+ return
458
+
459
+ prompt = build_evaluation_prompt(format_context(documents), get_rubric_criteria())
460
+
461
+ process.warning("Running evaluation")
462
+ logging.info(f"Running evaluator on {len(documents)} retrieved evidence chunks")
463
+ raw_response = run_llm(prompt)
464
+
465
+ try:
466
+ parsed_response = normalize_evaluation_response(parse_json_response(raw_response))
467
+ except json.JSONDecodeError:
468
+ logging.info("Model response was not valid JSON")
469
+ st.error("The model response was not valid JSON.")
470
+ st.code(raw_response, language="json")
471
+ return
472
+
473
+ logging.info("Evaluation completed successfully")
474
+ process.success("Evaluation ready")
475
+ st.json(parsed_response)
476
+
477
+ def add_evidence_page():
478
+ placeholder.title("Add Evidence")
479
+
480
+ choice = st.sidebar.radio("Evidence Type", ['', 'DOCUMENT', 'CODE', 'URL', 'VIDEO'])
481
+
482
+ if choice == 'DOCUMENT':
483
+ st.caption("Upload decks, notes, specs, or README-style documents.")
484
+ file = st.file_uploader("Upload document", type=["txt", "md", "pdf", "pptx"])
485
+ if file:
486
+ extension = os.path.splitext(file.name)[1].lower()
487
+
488
+ if extension in [".txt", ".md"]:
489
+ ingest_text_document(file)
490
+ elif extension == ".pdf":
491
+ ingest_pdf_document(file)
492
+ elif extension == ".pptx":
493
+ ingest_pptx_document(file)
494
+ else:
495
+ st.error("Unsupported document type.")
496
+
497
+ elif choice == 'CODE':
498
+ st.caption("Upload source or configuration files that represent the implementation.")
499
+ files = st.file_uploader(
500
+ "Upload code files",
501
+ type=CODE_FILE_TYPES,
502
+ accept_multiple_files=True
503
+ )
504
+ if files:
505
+ ingest_code_files(files)
506
+
507
+ elif choice == 'URL':
508
+ st.caption("Add a product page, documentation page, or prototype URL.")
509
+ url = st.text_input("Enter URL")
510
+ if url:
511
+ ingest_url(url)
512
+
513
+ elif choice == 'VIDEO':
514
+ st.caption("Add a YouTube demo or walkthrough link.")
515
+ link = st.text_input("YouTube link")
516
+ if link:
517
+ ingest_youtube_video(link)
518
+
519
+ def evaluate_page():
520
+ placeholder.title("Run Evaluation")
521
+ st.write("Generate a structured evaluation using all uploaded evidence for this submission.")
522
+ render_evidence_inventory()
523
+
524
+ if st.button("Run Evaluation"):
525
+ run_evaluation()
526
+
527
+ def main():
528
+ global placeholder, process
529
+
530
+ placeholder = st.empty()
531
+ process = st.empty()
532
+
533
+ if "user_id" not in st.session_state:
534
+ login()
535
+ return
536
+
537
+ st.sidebar.write(f"👤 {st.session_state['user_id']}")
538
+
539
+ page = st.sidebar.radio("Navigate", ['Add Evidence', 'Evaluate', 'Logout'])
540
+
541
+ if page == "Add Evidence":
542
+ add_evidence_page()
543
+ elif page == "Evaluate":
544
+ evaluate_page()
545
+ elif page == "Logout":
546
+ logging.info("Logging out and clearing session")
547
+ st.session_state.clear()
548
+ st.rerun()
549
+
550
+ if __name__ == "__main__":
551
+ main()
description.text ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluator-core System Description
2
+
3
+ ## 1. Overview
4
+
5
+ Evaluator-core is a lightweight AI-assisted evaluation MVP built with:
6
+
7
+ - Streamlit
8
+ - Hugging Face Inference APIs
9
+ - Milvus
10
+ - Whisper
11
+
12
+ The system is designed to ingest multiple submission artefacts, store them in a shared evidence layer, and generate a structured evaluation output grounded in retrieved evidence.
13
+
14
+ ## 2. Current Goal
15
+
16
+ The current MVP aims to:
17
+
18
+ > ingest multiple artefacts, build a unified submission context, and return evidence-backed evaluation JSON
19
+
20
+ ## 3. Supported Inputs
21
+
22
+ The current system supports:
23
+
24
+ 1. Documents
25
+ - `.txt`
26
+ - `.md`
27
+ - `.pdf`
28
+ - `.pptx`
29
+ 2. Code files
30
+ 3. URLs
31
+ 4. YouTube demo videos
32
+
33
+ All artefacts uploaded under one username are stored in a single Milvus collection and evaluated together.
34
+
35
+ ## 4. Core Flow
36
+
37
+ 1. User logs in with a username.
38
+ 2. Artefacts are uploaded or linked through the UI.
39
+ 3. Text is extracted from each artefact.
40
+ 4. Extracted text is chunked and embedded.
41
+ 5. Chunks are stored in Milvus with source metadata.
42
+ 6. Evaluation retrieves evidence from the unified collection.
43
+ 7. A Hugging Face-hosted model returns structured JSON.
44
+
45
+ ## 5. What The Evaluator Produces
46
+
47
+ The current output includes:
48
+
49
+ - `project_summary`
50
+ - `sources_used`
51
+ - `claims_detected`
52
+ - `capabilities_detected`
53
+ - `evidence`
54
+ - `gaps_or_risks`
55
+ - `scores`
56
+ - `overall_assessment`
57
+
58
+ The scoring rubric currently includes:
59
+
60
+ - Problem Understanding
61
+ - Technical Approach
62
+ - Implementation Quality
63
+ - Innovation / Originality
64
+ - Communication & Demo Clarity
65
+ - Claim vs Reality Alignment
66
+ - Prototype Functionality
67
+
68
+ ## 6. Current Strengths
69
+
70
+ - Unified evidence storage across source types
71
+ - Retrieval-backed evaluation
72
+ - Structured JSON output
73
+ - Basic claim extraction
74
+ - Rubric-based scoring
75
+ - Source inventory before evaluation
76
+
77
+ ## 7. Current Limitations
78
+
79
+ - Prototype URL validation is still text-based, not interaction-based
80
+ - Claim validation is prompt-driven, not a dedicated cross-artifact engine
81
+ - Code ingestion is file-upload based, not full repository ingestion
82
+ - Code chunking is still text-based rather than syntax-aware
83
+ - Scores and confidence are model-generated rather than calibrated
84
+
85
+ ## 8. Architecture Direction
86
+
87
+ This MVP is no longer a source-specific chatbot. It is now closer to an evidence-layer evaluator:
88
+
89
+ > multi-source ingestion -> shared vector store -> retrieved evidence -> structured evaluation
90
+
91
+ That makes it a practical early version of the assignment’s intended system, while still leaving prototype validation and stronger cross-checking as future work.
description.txt ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluator-core System Description
2
+
3
+ ## 1. Overview
4
+
5
+ Evaluator-core is a lightweight AI-assisted evaluation MVP built with:
6
+
7
+ - Streamlit
8
+ - Hugging Face Inference APIs
9
+ - Milvus
10
+ - Whisper
11
+
12
+ The system is designed to ingest multiple submission artefacts, store them in a shared evidence layer, and generate a structured evaluation output grounded in retrieved evidence.
13
+
14
+ ## 2. Current Goal
15
+
16
+ The current MVP aims to:
17
+
18
+ > ingest multiple artefacts, build a unified submission context, and return evidence-backed evaluation JSON
19
+
20
+ ## 3. Supported Inputs
21
+
22
+ The current system supports:
23
+
24
+ 1. Documents
25
+ - `.txt`
26
+ - `.md`
27
+ - `.pdf`
28
+ - `.pptx`
29
+ 2. Code files
30
+ 3. URLs
31
+ 4. YouTube demo videos
32
+
33
+ All artefacts uploaded under one username are stored in a single Milvus collection and evaluated together.
34
+
35
+ ## 4. Core Flow
36
+
37
+ 1. User logs in with a username.
38
+ 2. Artefacts are uploaded or linked through the UI.
39
+ 3. Text is extracted from each artefact.
40
+ 4. Extracted text is chunked and embedded.
41
+ 5. Chunks are stored in Milvus with source metadata.
42
+ 6. Evaluation retrieves evidence from the unified collection.
43
+ 7. A Hugging Face-hosted model returns structured JSON.
44
+
45
+ ## 5. What The Evaluator Produces
46
+
47
+ The current output includes:
48
+
49
+ - `project_summary`
50
+ - `sources_used`
51
+ - `claims_detected`
52
+ - `capabilities_detected`
53
+ - `evidence`
54
+ - `gaps_or_risks`
55
+ - `scores`
56
+ - `overall_assessment`
57
+
58
+ The scoring rubric currently includes:
59
+
60
+ - Problem Understanding
61
+ - Technical Approach
62
+ - Implementation Quality
63
+ - Innovation / Originality
64
+ - Communication & Demo Clarity
65
+ - Claim vs Reality Alignment
66
+ - Prototype Functionality
67
+
68
+ ## 6. Current Strengths
69
+
70
+ - Unified evidence storage across source types
71
+ - Retrieval-backed evaluation
72
+ - Structured JSON output
73
+ - Basic claim extraction
74
+ - Rubric-based scoring
75
+ - Source inventory before evaluation
76
+
77
+ ## 7. Current Limitations
78
+
79
+ - Prototype URL validation is still text-based, not interaction-based
80
+ - Claim validation is prompt-driven, not a dedicated cross-artifact engine
81
+ - Code ingestion is file-upload based, not full repository ingestion
82
+ - Code chunking is still text-based rather than syntax-aware
83
+ - Scores and confidence are model-generated rather than calibrated
84
+
85
+ ## 8. Architecture Direction
86
+
87
+ This MVP is no longer a source-specific chatbot. It is now closer to an evidence-layer evaluator:
88
+
89
+ > multi-source ingestion -> shared vector store -> retrieved evidence -> structured evaluation
90
+
91
+ That makes it a practical early version of the assignment’s intended system, while still leaving prototype validation and stronger cross-checking as future work.
docker-compose.yml ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.5'
2
+
3
+ services:
4
+ etcd:
5
+ container_name: milvus-etcd
6
+ image: quay.io/coreos/etcd:v3.5.5
7
+ environment:
8
+ - ETCD_AUTO_COMPACTION_MODE=revision
9
+ - ETCD_AUTO_COMPACTION_RETENTION=1000
10
+ - ETCD_QUOTA_BACKEND_BYTES=4294967296
11
+ - ETCD_SNAPSHOT_COUNT=50000
12
+ volumes:
13
+ - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
14
+ command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
15
+ healthcheck:
16
+ test: ["CMD", "etcdctl", "endpoint", "health"]
17
+ interval: 30s
18
+ timeout: 20s
19
+ retries: 3
20
+
21
+ minio:
22
+ container_name: milvus-minio
23
+ image: minio/minio:RELEASE.2023-03-20T20-16-18Z
24
+ environment:
25
+ MINIO_ACCESS_KEY: minioadmin
26
+ MINIO_SECRET_KEY: minioadmin
27
+ volumes:
28
+ - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
29
+ command: minio server /minio_data
30
+ healthcheck:
31
+ test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
32
+ interval: 30s
33
+ timeout: 20s
34
+ retries: 3
35
+
36
+ standalone:
37
+ container_name: milvus-standalone
38
+ image: milvusdb/milvus:v2.2.16
39
+ command: ["milvus", "run", "standalone"]
40
+ environment:
41
+ ETCD_ENDPOINTS: etcd:2379
42
+ MINIO_ADDRESS: minio:9000
43
+ volumes:
44
+ - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
45
+ healthcheck:
46
+ test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
47
+ interval: 30s
48
+ start_period: 90s
49
+ timeout: 20s
50
+ retries: 3
51
+ ports:
52
+ - "19530:19530"
53
+ - "9091:9091"
54
+ depends_on:
55
+ - "etcd"
56
+ - "minio"
57
+
58
+ networks:
59
+ default:
60
+ name: milvus
output.json ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "project_summary": {
3
+ "purpose": "",
4
+ "high_level_description": ""
5
+ },
6
+ "sources_used": [
7
+ {
8
+ "source_type": "text",
9
+ "source_name": "description.txt",
10
+ "notes": ""
11
+ },
12
+ {
13
+ "source_type": "code",
14
+ "source_name": "app.py",
15
+ "notes": ""
16
+ }
17
+ ],
18
+ "claims_detected": [],
19
+ "capabilities_detected": [
20
+ {
21
+ "capability": "Supports multiple artefact types: Documents (.txt, .md, .pdf, .pptx), Code files, URLs, YouTube demo videos",
22
+ "status": "supported",
23
+ "evidence_refs": [
24
+ "Evidence 3"
25
+ ]
26
+ },
27
+ {
28
+ "capability": "Text is extracted from artefacts, chunked and embedded",
29
+ "status": "supported",
30
+ "evidence_refs": [
31
+ "Evidence 1",
32
+ "Evidence 3"
33
+ ]
34
+ },
35
+ {
36
+ "capability": "Chunks with metadata are stored in Milvus",
37
+ "status": "supported",
38
+ "evidence_refs": [
39
+ "Evidence 1"
40
+ ]
41
+ },
42
+ {
43
+ "capability": "Evaluates based on retrieved evidence from a unified collection",
44
+ "status": "supported",
45
+ "evidence_refs": [
46
+ "Evidence 1"
47
+ ]
48
+ },
49
+ {
50
+ "capability": "Generates structured JSON output including project_summary, sources_used, claims_detected, capabilities_detected, evidence, gaps_or_risks, scores, overall_assessment",
51
+ "status": "supported",
52
+ "evidence_refs": [
53
+ "Evidence 1"
54
+ ]
55
+ },
56
+ {
57
+ "capability": "Evaluation uses a Hugging Face-hosted model",
58
+ "status": "supported",
59
+ "evidence_refs": [
60
+ "Evidence 1"
61
+ ]
62
+ },
63
+ {
64
+ "capability": "Provides a source inventory before evaluation",
65
+ "status": "supported",
66
+ "evidence_refs": [
67
+ "Evidence 9"
68
+ ]
69
+ }
70
+ ],
71
+ "evidence": [
72
+ {
73
+ "claim_or_observation": "The system is a lightweight AI-assisted evaluation MVP built with Streamlit, Hugging Face Inference APIs, Milvus, Whisper",
74
+ "support_level": "supported",
75
+ "evidence_refs": [
76
+ "Evidence 3"
77
+ ]
78
+ },
79
+ {
80
+ "claim_or_observation": "Current MVP aims to ingest multiple artefacts, build a unified submission context, and return evidence-backed evaluation JSON",
81
+ "support_level": "supported",
82
+ "evidence_refs": [
83
+ "Evidence 3"
84
+ ]
85
+ },
86
+ {
87
+ "claim_or_observation": "Artefacts uploaded under one username are stored in a single Milvus collection",
88
+ "support_level": "supported",
89
+ "evidence_refs": [
90
+ "Evidence 3"
91
+ ]
92
+ },
93
+ {
94
+ "claim_or_observation": "Current scoring rubric includes Problem Understanding, Technical Approach, Implementation Quality, Innovation / Originality, Communication & Demo Clarity, Claim vs Reality Alignment, Prototype Functionality",
95
+ "support_level": "supported",
96
+ "evidence_refs": [
97
+ "Evidence 1",
98
+ "Evidence 6"
99
+ ]
100
+ },
101
+ {
102
+ "claim_or_observation": "Current strengths include Unified evidence storage across source types, Retrieval-backed evaluation, Structured JSON output, Basic claim extraction, Rubric-based scoring, Source inventory before evaluation",
103
+ "support_level": "supported",
104
+ "evidence_refs": [
105
+ "Evidence 1",
106
+ "Evidence 2"
107
+ ]
108
+ },
109
+ {
110
+ "claim_or_observation": "Current limitations include Prototype URL validation is still text-based, not interaction-based, Claim validation is prompt-driven, not a dedicated cross-artifact engine, Code ingestion is file-upload based, not full repository ingestion, Code chunking is still text-based rather than syntax-aware, Scores and confidence are model-generated rather than calibrated",
111
+ "support_level": "supported",
112
+ "evidence_refs": [
113
+ "Evidence 2"
114
+ ]
115
+ }
116
+ ],
117
+ "gaps_or_risks": [
118
+ {
119
+ "issue": "Evaluation depends on an LLM-generated JSON response; parsing may fail if response is invalid",
120
+ "reason": "Code shows a try-catch block for JSONDecodeError, and the system logs and displays error if JSON invalid",
121
+ "evidence_refs": [
122
+ "Evidence 4"
123
+ ]
124
+ },
125
+ {
126
+ "issue": "No actual prototype URL validation or interaction",
127
+ "reason": "Limitations text states prototype URL validation is still text-based, not interaction-based",
128
+ "evidence_refs": [
129
+ "Evidence 2"
130
+ ]
131
+ },
132
+ {
133
+ "issue": "Claim validation is prompt-driven, not a dedicated cross-artifact engine",
134
+ "reason": "Limitations text states claim validation is prompt-driven",
135
+ "evidence_refs": [
136
+ "Evidence 2"
137
+ ]
138
+ },
139
+ {
140
+ "issue": "Code chunking is text-based, not syntax-aware",
141
+ "reason": "Limitations text states code chunking is still text-based rather than syntax-aware",
142
+ "evidence_refs": [
143
+ "Evidence 2"
144
+ ]
145
+ },
146
+ {
147
+ "issue": "Scores and confidence are model-generated, not calibrated",
148
+ "reason": "Limitations text states scores and confidence are model-generated rather than calibrated",
149
+ "evidence_refs": [
150
+ "Evidence 2"
151
+ ]
152
+ }
153
+ ],
154
+ "scores": [
155
+ {
156
+ "criterion": "Problem Understanding",
157
+ "score": 4,
158
+ "reasoning": "System architecture is described as evidence-layer evaluator with clear purpose; limitations acknowledged",
159
+ "citations": [
160
+ "Evidence 1",
161
+ "Evidence 2",
162
+ "Evidence 3"
163
+ ],
164
+ "confidence": 0.8
165
+ },
166
+ {
167
+ "criterion": "Technical Approach",
168
+ "score": 3,
169
+ "reasoning": "Approach uses multi-source ingestion, shared vector store, retrieval, and structured evaluation; but limitations exist in claim validation, code chunking, and prototype validation",
170
+ "citations": [
171
+ "Evidence 1",
172
+ "Evidence 2",
173
+ "Evidence 3",
174
+ "Evidence 4"
175
+ ],
176
+ "confidence": 0.75
177
+ },
178
+ {
179
+ "criterion": "Implementation Quality",
180
+ "score": 3,
181
+ "reasoning": "Code shows concrete implementation for artefact ingestion, storage, retrieval, and evaluation; supports multiple file types; but error handling and dependency on LLM JSON are present",
182
+ "citations": [
183
+ "Evidence 4",
184
+ "Evidence 10",
185
+ "Evidence 11",
186
+ "Evidence 12",
187
+ "Evidence 14"
188
+ ],
189
+ "confidence": 0.8
190
+ },
191
+ {
192
+ "criterion": "Innovation / Originality",
193
+ "score": 2,
194
+ "reasoning": "Unified evidence storage and retrieval-backed evaluation are strengths; however, the approach is described as an MVP and lacks sophisticated validation",
195
+ "citations": [
196
+ "Evidence 1",
197
+ "Evidence 2",
198
+ "Evidence 8"
199
+ ],
200
+ "confidence": 0.6
201
+ },
202
+ {
203
+ "criterion": "Communication & Demo Clarity",
204
+ "score": 3,
205
+ "reasoning": "System description and code structure are clear; strengths and limitations are documented; UI components shown (Streamlit)",
206
+ "citations": [
207
+ "Evidence 1",
208
+ "Evidence 2",
209
+ "Evidence 3",
210
+ "Evidence 7"
211
+ ],
212
+ "confidence": 0.7
213
+ },
214
+ {
215
+ "criterion": "Claim vs Reality Alignment",
216
+ "score": 3,
217
+ "reasoning": "Supported capabilities and limitations are explicitly listed, aligning with implementation; claim validation noted as prompt-driven",
218
+ "citations": [
219
+ "Evidence 1",
220
+ "Evidence 2",
221
+ "Evidence 3",
222
+ "Evidence 9"
223
+ ],
224
+ "confidence": 0.8
225
+ },
226
+ {
227
+ "criterion": "Prototype Functionality",
228
+ "score": 2,
229
+ "reasoning": "Evidence shows a working system for artefact ingestion, storage, retrieval, and structured evaluation; but limitations indicate lack of interactive prototype validation and reliance on text-based URL processing",
230
+ "citations": [
231
+ "Evidence 2",
232
+ "Evidence 4",
233
+ "Evidence 5",
234
+ "Evidence 7"
235
+ ],
236
+ "confidence": 0.7
237
+ }
238
+ ],
239
+ "overall_assessment": {
240
+ "verdict": "The project is a functional MVP for evidence-backed software project evaluation using multi-source ingestion and retrieval, with clear strengths and acknowledged limitations.",
241
+ "confidence": "high",
242
+ "reason": "Evidence from both description and code files provides consistent and detailed support for core functionalities, flow, and current state."
243
+ }
244
+ }
prompts.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+
4
+ def build_evaluation_prompt(context, rubric_criteria):
5
+ rubric_json = json.dumps(rubric_criteria)
6
+ return f"""
7
+ You are evaluating one software project using retrieved evidence from mixed uploaded sources.
8
+ Use only the supplied evidence. Do not invent facts. If something is unclear, say it is uncertain.
9
+ Extract concrete product or implementation claims when possible and label each one as supported, partial, or uncertain based only on the evidence.
10
+ Score the submission using the rubric criteria provided below. Use retrieved evidence only.
11
+
12
+ Return valid JSON only. No markdown, no code fences, no explanation outside the JSON.
13
+
14
+ Use exactly this top-level structure:
15
+ {{
16
+ "project_summary": {{
17
+ "purpose": "",
18
+ "high_level_description": ""
19
+ }},
20
+ "sources_used": [
21
+ {{
22
+ "source_type": "",
23
+ "source_name": "",
24
+ "notes": ""
25
+ }}
26
+ ],
27
+ "claims_detected": [
28
+ {{
29
+ "claim": "",
30
+ "status": "supported|partial|uncertain",
31
+ "reason": "",
32
+ "evidence_refs": ["Evidence 1"]
33
+ }}
34
+ ],
35
+ "capabilities_detected": [
36
+ {{
37
+ "capability": "",
38
+ "status": "supported|partial|uncertain",
39
+ "evidence_refs": ["Evidence 1"]
40
+ }}
41
+ ],
42
+ "evidence": [
43
+ {{
44
+ "claim_or_observation": "",
45
+ "support_level": "supported|partial|uncertain",
46
+ "evidence_refs": ["Evidence 1"]
47
+ }}
48
+ ],
49
+ "gaps_or_risks": [
50
+ {{
51
+ "issue": "",
52
+ "reason": "",
53
+ "evidence_refs": ["Evidence 1"]
54
+ }}
55
+ ],
56
+ "scores": [
57
+ {{
58
+ "criterion": "",
59
+ "score": 1,
60
+ "reasoning": "",
61
+ "citations": ["Evidence 1"],
62
+ "confidence": 0.5
63
+ }}
64
+ ],
65
+ "overall_assessment": {{
66
+ "verdict": "",
67
+ "confidence": "low|medium|high",
68
+ "reason": ""
69
+ }}
70
+ }}
71
+
72
+ Rules:
73
+ - Keep claims specific and checkable.
74
+ - Prefer 3 to 8 claims when enough evidence exists.
75
+ - Mark a claim as "supported" only when the evidence directly backs it.
76
+ - Mark a claim as "partial" when the evidence suggests the claim but does not fully prove it.
77
+ - Mark a claim as "uncertain" when the claim is plausible but not verified by the retrieved evidence.
78
+ - Every claim, capability, evidence item, and risk must include at least one evidence reference when possible.
79
+ - Create one score item for each rubric criterion in this exact list: {rubric_json}
80
+ - Score each criterion on an integer scale from 1 to 5.
81
+ - `citations` must reference evidence ids such as "Evidence 1".
82
+ - `confidence` must be a numeric value from 0 to 1.
83
+ - If no URL or prototype evidence exists, score "Prototype Functionality" conservatively and explain the limited evidence.
84
+
85
+ Evidence:
86
+ {context}
87
+ """.strip()
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ goose3
2
+ pydantic==1.10.12
3
+ langchain==0.0.278
4
+ langchain-community
5
+ PyPDF2
6
+ python-pptx
7
+ python-dotenv
8
+ streamlit
9
+ moviepy
10
+ pytube
11
+ pymilvus
12
+ huggingface_hub
13
+ git+https://github.com/openai/whisper.git