Spaces:

HindHammad
/

NutriBud

Sleeping

App Files Files Community

HindHammad commited on Nov 30, 2025

Commit

2c13490

1 Parent(s): 5c2a339

added app.py, requirements, and cleaned code

Browse files

Files changed (3) hide show

.gitignore +3 -0
app.py +187 -0
requirements.txt +5 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+venv
+data
+README.md

app.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import os
+import glob
+import numpy as np
+import gradio as gr
+from sentence_transformers import SentenceTransformer
+# At the top, we set up some basic configuration for our RAG system.
+# We decided to keep all our trusted nutrition documents as plain .txt files inside a folder called "data".
+# That way, if we want to update NutriBud later, we can just drop more files into that folder without touching the code.
+DATA_DIR = "data"
+TOP_K = 3  # this controls how many chunks we retrieve for each question
+# Here we load the sentence embedding model that we are using for retrieval.
+# We chose "all-MiniLM-L6-v2" because it is light enough to run on CPU but still gives good-quality embeddings.
+# This model is what lets us convert both our document chunks and the user’s question into vectors in the same space.
+print("Loading embedding model...")
+embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+# In this function, we are loading all the text files from the data folder and turning them into smaller chunks.
+# We chose to split on double newlines so that we roughly stay at the paragraph level instead of entire pages.
+# We also skip very short chunks so the retrieval stays focused on meaningful pieces of text.
+def load_corpus_and_chunks(data_dir: str):
+    texts = []
+    file_paths = glob.glob(os.path.join(data_dir, "*.txt"))
+    print(f"Found {len(file_paths)} files in {data_dir}")
+    for path in file_paths:
+        try:
+            with open(path, "r", encoding="utf-8") as f:
+                content = f.read()
+        except UnicodeDecodeError:
+            # If UTF-8 fails, we fall back to latin-1 just to be safe, because some PDFs export in odd encodings.
+            with open(path, "r", encoding="latin-1") as f:
+                content = f.read()
+        # Here we actually split the file into chunks.
+        # We keep it simple and split on blank lines, which works nicely for guidelines that are written in short sections.
+        for chunk in content.split("\n\n"):
+            chunk = chunk.strip()
+            # We decided to ignore very short chunks because they usually do not carry enough context.
+            if len(chunk) < 100:
+                continue
+            texts.append(chunk)
+    print(f"Total chunks: {len(texts)}")
+    return texts
+# When the app starts, we load all the chunks and precompute their embeddings.
+# We do this once at startup so the user does not have to wait for embedding every document on every question.
+corpus_chunks = load_corpus_and_chunks(DATA_DIR)
+corpus_embeddings = embed_model.encode(corpus_chunks, convert_to_numpy=True, show_progress_bar=True)
+# After we get the embeddings, we normalize them.
+# We decided to normalize so that cosine similarity becomes a simple dot product, which makes the retrieval simpler.
+corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
+# This helper function is the core of the retrieval step in our RAG pipeline.
+# We take the user’s question, embed it, normalize it, and then compare it to every chunk using a dot product.
+# Then we sort all the scores and pick the top k chunks as our context.
+def retrieve_relevant_chunks(question: str, k: int = TOP_K):
+    q_emb = embed_model.encode([question], convert_to_numpy=True)[0]
+    q_emb = q_emb / np.linalg.norm(q_emb)
+    scores = np.dot(corpus_embeddings, q_emb)
+    top_indices = np.argsort(scores)[::-1][:k]
+    results = [corpus_chunks[i] for i in top_indices]
+    return results
+# Before we generate an answer, we added a small safety layer around the questions.
+# Our idea here is that NutriBud should not try to act like a doctor or dietitian,
+# especially for conditions like diabetes or questions about rapid weight loss.
+# So we wrote a simple keyword-based filter that flags questions as high-risk.
+def is_high_risk_question(question: str) -> bool:
+    q = question.lower()
+    # We decided to keep this list of risky keywords simple and readable,
+    # since in our assignment the main goal is to show our thinking about safety, not to build a perfect classifier.
+    risky_keywords = [
+        "exact calories",
+        "calorie meal plan",
+        "meal plan",
+        "lose 20 pounds",
+        "lose 10 pounds",
+        "rapid weight loss",
+        "crash diet",
+        "diabetes",
+        "diabetic",
+        "blood sugar",
+        "keto",
+        "intermittent fasting",
+        "dizzy",
+        "faint",
+        "fainting",
+        "lightheaded",
+        "eating disorder",
+        "anorexia",
+        "bulimia",
+    ]
+    return any(word in q for word in risky_keywords)
+# This is the message we return whenever our safety check decides that the question is too high-risk.
+# We wrote it in a way that clearly says what NutriBud can and cannot do, and encourages the user
+# to talk to a health professional instead of relying on an AI for personal medical issues.
+def safety_response(question: str) -> str:
+    return (
+        "I’m NutriBud, a general nutrition helper based on public health guidelines. "
+        "I can’t give medical advice, personalized meal plans, or recommendations for specific "
+        "conditions like diabetes, dizziness with fasting, or rapid weight loss. "
+        "It’s really important to talk to a doctor or a registered dietitian for guidance "
+        "that is safe for your health. "
+        "If you’d like, you can ask me more general questions about healthy eating patterns, "
+        "like ways to eat more vegetables, choose healthier drinks, or limit highly processed foods."
+    )
+# This function is responsible for building the final answer to non-risky questions using our RAG setup.
+# Instead of calling a large generative model, we decided to keep it more transparent and deterministic.
+# We retrieve the most relevant chunks and then stitch them into a friendly, short answer.
+def build_rag_answer(question: str) -> str:
+    # First we get the top K chunks from our corpus based on similarity.
+    contexts = retrieve_relevant_chunks(question, k=TOP_K)
+    # We start the answer with a short intro so the user knows the answer is coming from general guidelines.
+    intro = (
+        "Here’s a general answer based on the trusted nutrition sources we loaded "
+        "(like Canada’s Food Guide and similar public health guidance):\n\n"
+    )
+    # Then we join the retrieved chunks with spacing so they are readable.
+    body = "\n\n".join(contexts)
+    # We also decided to limit the total length so that NutriBud’s responses stay compact inside the chat window.
+    full_text = intro + body
+    max_len = 1200
+    if len(full_text) > max_len:
+        truncated = full_text[:max_len]
+        if "." in truncated:
+            truncated = truncated.rsplit(".", 1)[0] + "."
+        full_text = truncated
+    return full_text
+# This is the main function that Gradio calls every time the user sends a new message.
+# The "history" parameter is part of the ChatInterface API, but in our design we decided not to use it
+# directly inside the retrieval step because we are focusing on single-turn questions for this assignment.
+def nutri_chat(message: str, history: list):
+    # First we check if the question looks high-risk according to our keyword filter.
+    if is_high_risk_question(message):
+        return safety_response(message)
+    # If not high-risk, we go through the normal RAG pipeline and return a context-based answer.
+    return build_rag_answer(message)
+# For the user interface, at first we tried using more customized layouts with Blocks and custom CSS.
+# While we were exploring that, we read the Gradio documentation on custom CSS and JS here:
+# https://www.gradio.app/guides/custom-CSS-and-JS
+# However, because Gradio 6 changed some of the arguments (like removing css and theme in some places),
+# those experiments started causing errors during our local testing.
+# To make sure NutriBud is stable and easy to deploy on Hugging Face Spaces,
+# we decided to switch to ChatInterface, which is much simpler but still looks clean.
+demo = gr.ChatInterface(
+    fn=nutri_chat,
+    title="NutriBud — Healthy Nutrition RAG Chatbot",
+    description=(
+        "Ask general questions about healthy eating, like:\n"
+        "• How can I eat more vegetables every day?\n"
+        "• What are healthier drink choices instead of sugary drinks?\n"
+        "NutriBud uses a retrieval-augmented approach on trusted public health documents.\n"
+        "It does not give medical advice or personalized meal plans."
+    ),
+)
+# At the end, we launch the app in the usual Gradio way.
+# When we are testing locally on our laptops, we run `python app.py` and open the local URL
+# (for example http://127.0.0.1:7860) in our browser.
+# On Hugging Face Spaces, we keep this structure because the Space will call this file as the main entry point.
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+gradio
+torch
+transformers
+sentence-transformers
+numpy