Spaces:

nesanchezo
/

petProject

Sleeping

@@ -1,13 +1,88 @@
----
-title: PetProject
-emoji: 📊
-colorFrom: red
-colorTo: green
-sdk: streamlit
-sdk_version: 1.42.2
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Dog Food Product QA System
+A hybrid search and question-answering system for dog food products using BM25 and vector search (ChromaDB).
+## Core Components
+### Essential Scripts
+1. `qa_backend.py`
+   - Main backend implementation
+   - Contains DogFoodQASystem class
+   - Implements hybrid search and answer generation
+   - Key features: BM25 search, vector search, result combination, multilingual support
+2. `app.py`
+   - Streamlit frontend interface
+   - Displays search results with source indicators
+   - Shows statistics and product details
+   - Handles user queries in English and Spanish
+### Important Notebooks
+1. `test_retrieval.ipynb`
+   - Reference implementation for hybrid search
+   - Used for testing and validating search functionality
+   - Contains working examples of both BM25 and ChromaDB searches
+   ```python:Initial_trial_RAG/test_retrieval.ipynb
+   startLine: 64
+   endLine: 117
+   ```
+2. `diagnose_qa_system.ipynb`
+   - Diagnostic tool for system components
+   - Tests vector store connectivity
+   - Validates search result combination
+   - Useful for debugging and system verification
+### Supporting Files
+- `bm25_index.pkl`: Serialized BM25 index and data
+- `chroma_db/`: Directory containing ChromaDB vector store
+- `.env`: Environment variables (OpenAI API key)
+### Less Critical Components
+1. `trial_enriching_description.ipynb`
+   - Used for initial data enrichment
+   - Not needed for regular system operation
+   - Reference for future data processing
+   ```python:Initial_trial_RAG/trial_enriching_decription.ipynb
+   startLine: 26
+   endLine: 37
+   ```
+## System Architecture
+1. **Search Components**
+   - BM25 for keyword matching
+   - ChromaDB for semantic search
+   - Smart result combination with duplicate detection
+2. **Result Processing**
+   - Source tracking (BM25, Vector, or Both)
+   - Score preservation for transparency
+   - Metadata-aware result presentation
+3. **User Interface**
+   - Color-coded results by source:
+     - 🔵 Blue: BM25 results
+     - 🟢 Green: Vector results
+     - 🔄 Purple: Found by both sources
+   - Detailed statistics display
+   - Bilingual support (EN/ES)
+## Usage
+1. Start the application:
+2. Enter queries in English or Spanish
+3. View combined results with source indicators
+4. Check statistics for result distribution
+## Development Notes
+- BM25 and Vector searches each return top 5 results
+- Duplicates are automatically detected and merged
+- All unique results are passed to the LLM for context
+- Scores are displayed but not used for filtering

__pycache__/qa_backend.cpython-311.pyc ADDED Viewed

Binary file (16.4 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,189 @@

+import streamlit as st
+from qa_backend import DogFoodQASystem
+import time
+from typing import Dict, Any, List
+import logging
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+# Configure page settings
+st.set_page_config(
+    page_title="Dog Food Advisor",
+    page_icon="🐕",
+    layout="wide"
+)
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+    .stAlert {
+        padding: 1rem;
+        margin: 1rem 0;
+        border-radius: 0.5rem;
+    }
+    .search-result {
+        padding: 1rem;
+        margin: 0.5rem 0;
+        border: 1px solid #ddd;
+        border-radius: 0.5rem;
+    }
+    .debug-info {
+        font-size: small;
+        color: gray;
+        padding: 0.5rem;
+        background-color: #f0f0f0;
+        border-radius: 0.3rem;
+    }
+    </style>
+    """, unsafe_allow_html=True)
+@st.cache_resource
+def load_qa_system() -> DogFoodQASystem:
+    """Initialize and cache the QA system."""
+    qa_system = DogFoodQASystem()
+    # Run diagnostics
+    vector_store_status = qa_system.diagnose_vector_store()
+    return qa_system, vector_store_status
+def display_search_result(result: Dict[str, Any], index: int) -> None:
+    """Display a single search result with enhanced source and score information."""
+    with st.container():
+        # Source indicator and styling
+        sources = result['sources']
+        if len(sources) > 1:
+            source_color = "#9C27B0"  # Purple for both sources
+            source_badge = "🔄 Found in Both Sources"
+            scores_text = f"BM25: {result['original_scores']['BM25']:.3f}, Vector: {result['original_scores']['Vector']:.3f}"
+        elif 'Vector' in sources:
+            source_color = "#2E7D32"  # Green for Vector
+            source_badge = "🟢 Vector Search"
+            scores_text = f"Score: {result['original_scores']['Vector']:.3f}"
+        else:
+            source_color = "#1565C0"  # Blue for BM25
+            source_badge = "🔵 BM25 Search"
+            scores_text = f"Score: {result['original_scores']['BM25']:.3f}"
+        # Display header with source and score information
+        st.markdown(f"""
+        <div class="search-result">
+            <h4 style="color: {source_color}">
+                Result {index + 1} | {source_badge} | {scores_text}
+            </h4>
+        </div>
+        """, unsafe_allow_html=True)
+        # Display product details
+        col1, col2 = st.columns(2)
+        with col1:
+            st.write("**Product Details:**")
+            st.write(f"• Brand: {result['metadata']['brand']}")
+            st.write(f"• Product: {result['metadata']['product_name']}")
+            st.write(f"• Price: ${result['metadata']['price']:.2f}")
+        with col2:
+            st.write("**Additional Information:**")
+            st.write(f"• Weight: {result['metadata']['weight']}kg")
+            st.write(f"• Dog Type: {result['metadata']['dog_type']}")
+            if 'reviews' in result['metadata']:
+                st.write(f"• Reviews: {result['metadata']['reviews']}")
+        st.markdown("**Description:**")
+        st.write(result['text'])
+        st.markdown("---")
+def display_search_stats(results: List[Dict[str, Any]]) -> None:
+    """Display detailed statistics about search results."""
+    total_results = len(results)
+    duplicates = sum(1 for r in results if len(r['sources']) > 1)
+    vector_only = sum(1 for r in results if r['sources'] == ['Vector'])
+    bm25_only = sum(1 for r in results if r['sources'] == ['BM25'])
+    st.markdown("#### Search Results Statistics")
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric("Total Unique Results", total_results)
+    with col2:
+        st.metric("Found in Both Sources", duplicates, "🔄")
+    with col3:
+        st.metric("Vector Only", vector_only, "🟢")
+    with col4:
+        st.metric("BM25 Only", bm25_only, "🔵")
+def main():
+    # Header
+    st.title("🐕 Dog Food Advisor")
+    st.markdown("""
+    Ask questions about dog food products in English or Spanish.
+    The system will provide relevant recommendations based on your query.
+    """)
+    # Initialize QA system with diagnostics
+    qa_system, vector_store_status = load_qa_system()
+    # Display system status
+    with st.sidebar:
+        st.markdown("### System Status")
+        if vector_store_status:
+            st.success("Vector Store: Connected")
+        else:
+            st.error("Vector Store: Not Connected")
+            st.warning("Only BM25 search will be available")
+    # Query input
+    query = st.text_input(
+        "Enter your question:",
+        placeholder="e.g., 'What's the best food for puppies?' or '¿Cuál es la mejor comida para perros adultos?'"
+    )
+    # Add a search button
+    search_button = st.button("Search")
+    if query and search_button:
+        with st.spinner("Processing your query..."):
+            try:
+                # Process query
+                start_time = time.time()
+                result = qa_system.process_query(query)
+                processing_time = time.time() - start_time
+                # Display answer
+                st.markdown("### Answer")
+                st.write(result["answer"])
+                # Display search stats
+                display_search_stats(result["search_results"])
+                # Display processing information
+                st.markdown(f"""
+                <div class='debug-info'>
+                    Language detected: {result['language']} |
+                    Processing time: {processing_time:.2f}s
+                </div>
+                """, unsafe_allow_html=True)
+                # Display search results in an expander
+                with st.expander("View Relevant Products", expanded=False):
+                    st.markdown("### Search Results")
+                    for i, search_result in enumerate(result["search_results"]):
+                        display_search_result(search_result, i)
+            except Exception as e:
+                st.error(f"An error occurred: {str(e)}")
+                logging.error(f"Error processing query: {str(e)}", exc_info=True)
+    # Add footer with instructions
+    st.markdown("---")
+    with st.expander("Usage Tips"):
+        st.markdown("""
+        - Ask questions in English or Spanish
+        - Be specific about your dog's needs (age, size, special requirements)
+        - Include price preferences (e.g., 'affordable', 'premium')
+        - Results are ranked by relevance and include price, brand, and product details
+        - Results are color-coded:
+          - 🔵 Blue: BM25 Search Results
+          - 🟢 Green: Vector Search Results
+        """)
+if __name__ == "__main__":
+    main()

bm25_index.pkl ADDED Viewed

Binary file (280 kB). View file

checkpoints/checkpoint_batch_5.pkl ADDED Viewed

Binary file (98.6 kB). View file

create_vector_stores.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import logging
+from typing import List, Dict, Any
+import pickle
+import nltk
+from nltk.tokenize import word_tokenize
+from rank_bm25 import BM25Okapi
+import chromadb
+from chromadb.config import Settings
+from openai import OpenAI
+import pandas as pd
+from tqdm import tqdm
+from dotenv import load_dotenv
+import os
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+class VectorStoreCreator:
+    """Class to create and manage vector stores for dog food product search."""
+    def __init__(self, data_path: str):
+        """
+        Initialize the VectorStoreCreator.
+        Args:
+            data_path: Path to the pickle file containing the product data
+        """
+        # Load environment variables
+        load_dotenv()
+        # Initialize OpenAI client
+        self.client = OpenAI()
+        # Download NLTK resources
+        nltk.download('punkt', quiet=True)
+        # Load data
+        self.df = pd.read_pickle(data_path)
+        # Initialize stores
+        self.bm25_model = None
+        self.chroma_collection = None
+        self.chunks = []
+        self.metadata = []
+    def prepare_data(self) -> None:
+        """Prepare data for BM25 and embeddings."""
+        logging.info("Preparing data for vector stores...")
+        # Log initial dataframe info
+        total_rows = len(self.df)
+        logging.info(f"Total rows in DataFrame: {total_rows}")
+        for _, row in self.df.iterrows():
+            # Combine English and Spanish descriptions
+            combined_text = f"{row['description_en']} {row['description_es']}"
+            self.chunks.append(combined_text)
+            # Create metadata
+            metadata = {
+                "product_name": row["product_name"],
+                "brand": row["brand"],
+                "dog_type": row["dog_type"],
+                "food_type": row["food_type"],
+                "weight": float(row["weight"]),
+                "price": float(row["price"]),
+                "reviews": float(row["reviews"]) if pd.notna(row["reviews"]) else 0.0
+            }
+            self.metadata.append(metadata)
+        # Log final chunks info
+        logging.info(f"Total chunks created: {len(self.chunks)}")
+        if len(self.chunks) != total_rows:
+            logging.warning(f"Mismatch between DataFrame rows ({total_rows}) and chunks created ({len(self.chunks)})")
+        # Log sample of first chunk
+        if self.chunks:
+            logging.info(f"Sample of first chunk: {self.chunks[0][:200]}...")
+    def create_bm25_index(self, save_path: str = "bm25_index.pkl") -> None:
+        """
+        Create and save BM25 index.
+        Args:
+            save_path: Path to save the BM25 index
+        """
+        logging.info("Creating BM25 index...")
+        # Tokenize chunks
+        tokenized_chunks = [word_tokenize(chunk.lower()) for chunk in self.chunks]
+        # Create BM25 model
+        self.bm25_model = BM25Okapi(tokenized_chunks)
+        # Save the model and related data
+        with open(save_path, 'wb') as f:
+            pickle.dump({
+                'model': self.bm25_model,
+                'chunks': self.chunks,
+                'metadata': self.metadata
+            }, f)
+        logging.info(f"BM25 index saved to {save_path}")
+    def create_chroma_db(self, db_path: str = "chroma_db") -> None:
+        """
+        Create ChromaDB database.
+        Args:
+            db_path: Path to save the ChromaDB
+        """
+        logging.info("Creating ChromaDB database...")
+        # Initialize ChromaDB with new client syntax
+        client = chromadb.PersistentClient(path=db_path)
+        # Create or get collection
+        self.chroma_collection = client.get_or_create_collection(
+            name="dog_food_descriptions"
+        )
+        # Add documents in batches
+        batch_size = 10
+        for i in tqdm(range(0, len(self.chunks), batch_size)):
+            batch_chunks = self.chunks[i:i + batch_size]
+            batch_metadata = self.metadata[i:i + batch_size]
+            batch_ids = [str(idx) for idx in range(i, min(i + batch_size, len(self.chunks)))]
+            # Get embeddings for batch
+            embeddings = []
+            for chunk in batch_chunks:
+                response = self.client.embeddings.create(
+                    model="text-embedding-ada-002",
+                    input=chunk
+                )
+                embeddings.append(response.data[0].embedding)
+            # Add to collection
+            self.chroma_collection.add(
+                embeddings=embeddings,
+                metadatas=batch_metadata,
+                documents=batch_chunks,
+                ids=batch_ids
+            )
+        logging.info(f"ChromaDB saved to {db_path}")
+def main():
+    """Main execution function."""
+    try:
+        # Initialize creator
+        creator = VectorStoreCreator("3rd_clean_comida_dogs_enriched_multilingual_2.pkl")
+        # Prepare data
+        creator.prepare_data()
+        # Create indices
+        creator.create_bm25_index()
+        creator.create_chroma_db()
+        logging.info("Vector stores created successfully!")
+    except Exception as e:
+        logging.error(f"An error occurred: {e}")
+        raise
+if __name__ == "__main__":
+    main()

data_cleansing/2nd_clean_comida_dogs_filtered.pkl ADDED Viewed

Binary file (26.9 kB). View file

data_cleansing/clean_comida_dogs_categoria_cleaned.pkl ADDED Viewed

Binary file (89.4 kB). View file

diagnostic_notebook.ipynb ADDED Viewed

	@@ -0,0 +1,202 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "from qa_backend import DogFoodQASystem\n",
+    "\n",
+    "# Configure logging to show everything\n",
+    "logging.basicConfig(\n",
+    "    level=logging.INFO,\n",
+    "    format='%(asctime)s - %(levelname)s - %(message)s'\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-01-19 17:56:19,823 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Initializing QA System...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-01-19 17:56:20,074 - INFO - \n",
+      "Diagnosing Vector Store:\n",
+      "2025-01-19 17:56:20,082 - INFO - Collection name: dog_food_descriptions\n",
+      "2025-01-19 17:56:20,082 - INFO - Number of documents: 84\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Running Vector Store Diagnostics...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-01-19 17:56:21,233 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
+      "2025-01-19 17:56:21,259 - INFO - ✅ Vector store test query successful\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Initialize the QA system\n",
+    "print(\"Initializing QA System...\")\n",
+    "qa_system = DogFoodQASystem()\n",
+    "\n",
+    "# Run diagnostics\n",
+    "print(\"\\nRunning Vector Store Diagnostics...\")\n",
+    "vector_store_status = qa_system.diagnose_vector_store()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Testing with query: What's the best premium food for adult dogs?\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-01-19 17:56:37,332 - INFO - \n",
+      "==================================================\n",
+      "Starting hybrid search for query: What's the best premium food for adult dogs?\n",
+      "2025-01-19 17:56:37,335 - INFO - ChromaDB collection info:\n",
+      "2025-01-19 17:56:37,336 - INFO - - Number of documents: 84\n",
+      "2025-01-19 17:56:37,336 - INFO - - Collection name: dog_food_descriptions\n",
+      "2025-01-19 17:56:37,341 - INFO - \n",
+      "BM25 Search Results:\n",
+      "2025-01-19 17:56:37,342 - INFO - Found 5 results\n",
+      "2025-01-19 17:56:37,342 - INFO - \n",
+      "Generating embedding for query...\n",
+      "2025-01-19 17:56:38,091 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n",
+      "2025-01-19 17:56:38,093 - INFO - Embedding generated successfully. Dimension: 1536\n",
+      "2025-01-19 17:56:38,094 - INFO - \n",
+      "Performing ChromaDB search...\n",
+      "2025-01-19 17:56:38,099 - INFO - ChromaDB raw results:\n",
+      "2025-01-19 17:56:38,100 - INFO - - Number of results: 5\n",
+      "2025-01-19 17:56:38,100 - INFO - - Keys in results: dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents', 'uris', 'data'])\n",
+      "2025-01-19 17:56:38,100 - INFO - \n",
+      "Vector result 1:\n",
+      "2025-01-19 17:56:38,100 - INFO - - Score: 0.6637\n",
+      "2025-01-19 17:56:38,100 - INFO - - Text preview: **Introducing Dowolf Snack Para Perro Galletas - The Premium Treat for Your Adult Dog!**\n",
+      "\n",
+      "**Brand:**...\n",
+      "2025-01-19 17:56:38,101 - INFO - \n",
+      "Vector result 2:\n",
+      "2025-01-19 17:56:38,101 - INFO - - Score: 0.6391\n",
+      "2025-01-19 17:56:38,101 - INFO - - Text preview: ### Dogourmet Alimento Seco Para Perro Adulto Carne Parrilla 4kg\n",
+      "\n",
+      "**Elevate Your Dog’s Dining Experi...\n",
+      "2025-01-19 17:56:38,102 - INFO - \n",
+      "Vector result 3:\n",
+      "2025-01-19 17:56:38,102 - INFO - - Score: 0.6388\n",
+      "2025-01-19 17:56:38,102 - INFO - - Text preview: ### Discover the Ultimate in Canine Nutrition with Chunky Alimento Seco Para Perro Adulto Nuggets De...\n",
+      "2025-01-19 17:56:38,103 - INFO - \n",
+      "Vector result 4:\n",
+      "2025-01-19 17:56:38,103 - INFO - - Score: 0.6338\n",
+      "2025-01-19 17:56:38,103 - INFO - - Text preview: **Unleash the Gourmet Experience with Dogourmet Alimento Seco Para Perros Pavo Y Pollo**\n",
+      "\n",
+      "Elevate yo...\n",
+      "2025-01-19 17:56:38,104 - INFO - \n",
+      "Vector result 5:\n",
+      "2025-01-19 17:56:38,104 - INFO - - Score: 0.6328\n",
+      "2025-01-19 17:56:38,104 - INFO - - Text preview: **Introducing Chunky Snack Para Perro Bombonera Deli Dent – The Ultimate Gourmet Snack for Adult Dog...\n",
+      "2025-01-19 17:56:38,105 - INFO - \n",
+      "Processed 5 vector results\n",
+      "2025-01-19 17:56:38,105 - INFO - \n",
+      "Final results distribution:\n",
+      "2025-01-19 17:56:38,105 - INFO - - BM25 results: 5\n",
+      "2025-01-19 17:56:38,106 - INFO - - Vector results: 0\n",
+      "2025-01-19 17:56:38,106 - INFO - ==================================================\n",
+      "2025-01-19 17:56:39,662 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Results Distribution:\n",
+      "- BM25 Results: 5\n",
+      "- Vector Results: 0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Test with a sample query\n",
+    "test_query = \"What's the best premium food for adult dogs?\"\n",
+    "print(f\"\\nTesting with query: {test_query}\")\n",
+    "\n",
+    "result = qa_system.process_query(test_query)\n",
+    "\n",
+    "# Display results statistics\n",
+    "bm25_count = sum(1 for r in result['search_results'] if r['source'] == 'BM25')\n",
+    "vector_count = sum(1 for r in result['search_results'] if r['source'] == 'Vector')\n",
+    "\n",
+    "print(f\"\\nResults Distribution:\")\n",
+    "print(f\"- BM25 Results: {bm25_count}\")\n",
+    "print(f\"- Vector Results: {vector_count}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "chats_langchain",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

enriching_description.py ADDED Viewed

	@@ -0,0 +1,237 @@

+import logging
+from typing import Dict, Any
+import time
+from tqdm import tqdm
+import openai
+import pandas as pd
+from dotenv import load_dotenv, find_dotenv
+import os
+import glob
+from langsmith import traceable
+from langsmith import Client
+from langsmith.wrappers import wrap_openai
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+# Load environment variables
+_ = load_dotenv(find_dotenv())
+# Initialize LangSmith client
+langsmith_client = Client()
+# Wrap OpenAI client with LangSmith
+openai = wrap_openai(openai)
+@traceable(run_type="chain")
+def create_product_prompt(row: pd.Series, language: str) -> str:
+    """
+    Create a detailed prompt for product description generation.
+    Args:
+        row: DataFrame row containing product information
+        language: Target language ('en' or 'es')
+    Returns:
+        str: Formatted prompt for the LLM
+    """
+    base_prompts = {
+        'en': """Create a compelling and detailed marketing description for a premium dog food product.
+Include the following information and expand with your knowledge:
+• Brand: {brand}
+• Product Name: {product_name}
+• Specifically designed for: {dog_type}
+• Type: {food_type}
+• Package Size: {weight} kg
+• Price Point: ${price:.2f}
+Focus on:
+1. Key nutritional benefits
+2. Quality of ingredients
+3. Health advantages
+4. Why it's perfect for the specified dog type
+5. Value proposition
+Make it engaging and persuasive while maintaining accuracy.""",
+        'es': """Crea una descripción comercial detallada y convincente para un producto premium de alimentación canina.
+Incluye la siguiente información y expándela con tu conocimiento:
+• Marca: {brand}
+• Nombre del Producto: {product_name}
+• Diseñado específicamente para: {dog_type}
+• Tipo: {food_type}
+• Tamaño del Paquete: {weight} kg
+• Precio: ${price:.2f}
+Enfócate en:
+1. Beneficios nutricionales clave
+2. Calidad de los ingredientes
+3. Ventajas para la salud
+4. Por qué es perfecto para el tipo de perro especificado
+5. Propuesta de valor
+Hazlo atractivo y persuasivo mientras mantienes la precisión."""
+    }
+    prompt = base_prompts[language].format(**row.to_dict())
+    return prompt
+@traceable(run_type="chain")
+def generate_description(row: pd.Series, language: str, retry_attempts: int = 3) -> str:
+    """
+    Generate a product description using OpenAI's API with retry logic.
+    Args:
+        row: DataFrame row containing product information
+        language: Target language ('en' or 'es')
+        retry_attempts: Number of retry attempts on failure
+    Returns:
+        str: Generated description or error message
+    """
+    prompt = create_product_prompt(row, language)
+    for attempt in range(retry_attempts):
+        try:
+            response = openai.chat.completions.create(
+                model="gpt-4o-mini",  # Using GPT-4 for better quality
+                messages=[{"role": "user", "content": prompt}],
+                max_tokens=150,
+                temperature=0.7,
+                presence_penalty=0.3,
+                frequency_penalty=0.3
+            )
+            return response.choices[0].message.content.strip()
+        except Exception as e:
+            logging.error(f"Attempt {attempt + 1} failed: {str(e)}")  # Added more detailed logging
+            if attempt == retry_attempts - 1:
+                return f"Error generating {language} description: {e}"
+            time.sleep(2 ** attempt)  # Exponential backoff
+def save_checkpoint(df: pd.DataFrame, batch_num: int, checkpoint_dir: str = 'checkpoints') -> None:
+    """
+    Save a checkpoint of the current DataFrame.
+    Args:
+        df: DataFrame to checkpoint
+        batch_num: Current batch number
+        checkpoint_dir: Directory to store checkpoints
+    """
+    # Create checkpoint directory if it doesn't exist
+    os.makedirs(checkpoint_dir, exist_ok=True)
+    checkpoint_path = os.path.join(checkpoint_dir, f'checkpoint_batch_{batch_num}.pkl')
+    df.to_pickle(checkpoint_path)
+    logging.info(f"Saved checkpoint at batch {batch_num}")
+def load_latest_checkpoint(checkpoint_dir: str = 'checkpoints') -> tuple[pd.DataFrame | None, int]:
+    """
+    Load the most recent checkpoint if it exists.
+    Args:
+        checkpoint_dir: Directory containing checkpoints
+    Returns:
+        tuple: (DataFrame or None, last completed batch number)
+    """
+    if not os.path.exists(checkpoint_dir):
+        return None, 0
+    checkpoint_files = glob.glob(os.path.join(checkpoint_dir, 'checkpoint_batch_*.pkl'))
+    if not checkpoint_files:
+        return None, 0
+    latest_checkpoint = max(checkpoint_files)
+    batch_num = int(latest_checkpoint.split('_')[-1].split('.')[0])
+    logging.info(f"Loading checkpoint from batch {batch_num}")
+    return pd.read_pickle(latest_checkpoint), batch_num
+@traceable(run_type="chain")
+def enrich_descriptions(df: pd.DataFrame, batch_size: int = 10, checkpoint_frequency: int = 5) -> pd.DataFrame:
+    """
+    Enrich DataFrame with product descriptions in both languages.
+    Args:
+        df: Input DataFrame
+        batch_size: Number of items to process in each batch
+        checkpoint_frequency: Number of batches between checkpoints
+    Returns:
+        pd.DataFrame: Enriched DataFrame with new description columns
+    """
+    logging.info("Starting description generation process...")
+    initial_row_count = len(df)
+    df = df.copy()
+    # Try to load from checkpoint
+    checkpoint_df, last_batch = load_latest_checkpoint()
+    if checkpoint_df is not None:
+        df = checkpoint_df
+        start_idx = (last_batch + 1) * batch_size
+        logging.info(f"Resuming from batch {last_batch + 1}")
+    else:
+        start_idx = 0
+    total_batches = (len(df) + batch_size - 1) // batch_size
+    for batch_num, i in enumerate(tqdm(range(start_idx, len(df), batch_size)), start=last_batch + 1):
+        batch = df.iloc[i:i + batch_size]
+        df.loc[batch.index, 'description_en'] = batch.apply(
+            lambda row: generate_description(row, 'en'), axis=1
+        )
+        df.loc[batch.index, 'description_es'] = batch.apply(
+            lambda row: generate_description(row, 'es'), axis=1
+        )
+        if batch_num % checkpoint_frequency == 0:
+            save_checkpoint(df, batch_num)
+        time.sleep(1)  # Rate limiting
+    # Validate row counts and description completeness
+    final_row_count = len(df)
+    if final_row_count != initial_row_count:
+        raise ValueError(f"Row count mismatch: Started with {initial_row_count} rows, ended with {final_row_count} rows")
+    # Check for missing descriptions
+    missing_en = df['description_en'].isna().sum()
+    missing_es = df['description_es'].isna().sum()
+    if missing_en > 0 or missing_es > 0:
+        logging.warning(f"Missing descriptions detected: English: {missing_en}, Spanish: {missing_es}")
+    return df
+def main():
+    """Main execution function."""
+    try:
+        # Load the dataset
+        file_path = '2nd_clean_comida_dogs_filtered.pkl'
+        data = pd.read_pickle(file_path)
+        initial_count = len(data)
+        logging.info(f"Loaded dataset with {initial_count} records")
+        # Enrich with descriptions
+        enriched_data = enrich_descriptions(data)
+        # Final validation before saving
+        if len(enriched_data) != initial_count:
+            raise ValueError(f"Row count mismatch: Original had {initial_count} rows, enriched has {len(enriched_data)} rows")
+        # Save the enriched dataset
+        output_path = '3rd_clean_comida_dogs_enriched_multilingual_2.pkl'
+        enriched_data.to_pickle(output_path)
+        logging.info(f"Enriched dataset saved to {output_path}")
+    except Exception as e:
+        logging.error(f"An error occurred: {e}")
+        raise
+if __name__ == "__main__":
+    main()

qa_backend.py ADDED Viewed

	@@ -0,0 +1,296 @@

+import logging
+from typing import Dict, List, Any
+import pickle
+import chromadb
+from chromadb.config import Settings
+from openai import OpenAI
+import numpy as np
+from nltk.tokenize import word_tokenize
+from dotenv import load_dotenv
+import os
+from langsmith import traceable
+from langsmith import Client
+from langsmith.wrappers import wrap_openai
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+# Load environment variables
+load_dotenv()
+# Initialize LangSmith client
+langsmith_client = Client()
+# Wrap OpenAI client with LangSmith
+openai = wrap_openai(OpenAI())
+def detect_language(text: str) -> str:
+    """
+    Simple language detection for English/Spanish based on common words.
+    Args:
+        text: Input text to detect language
+    Returns:
+        str: 'es' for Spanish, 'en' for English
+    """
+    # Common Spanish words/characters
+    spanish_indicators = {'qué', 'cuál', 'cómo', 'dónde', 'por', 'para', 'perro', 'comida',
+                         'mejor', 'precio', 'barato', 'caro', 'cachorro', 'adulto'}
+    # Convert to lowercase for comparison
+    text_lower = text.lower()
+    # Count Spanish indicators
+    spanish_count = sum(1 for word in spanish_indicators if word in text_lower)
+    # If we find Spanish indicators, classify as Spanish, otherwise default to English
+    return 'es' if spanish_count > 0 else 'en'
+class DogFoodQASystem:
+    def __init__(self):
+        """Initialize the QA system with vector stores and models."""
+        self.load_stores()
+    def load_stores(self) -> None:
+        """Load BM25 and ChromaDB stores."""
+        with open('bm25_index.pkl', 'rb') as f:
+            self.bm25_data = pickle.load(f)
+        self.chroma_client = chromadb.PersistentClient(path="chroma_db")
+        self.collection = self.chroma_client.get_collection("dog_food_descriptions")
+    @traceable(run_type="chain")
+    def hybrid_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
+        """
+        Hybrid search that gets top_k results from each source and combines unique results.
+        """
+        logging.info(f"\n{'='*50}\nStarting hybrid search for query: {query}")
+        # BM25 search - get top_k results
+        tokenized_query = word_tokenize(query.lower())
+        bm25_scores = self.bm25_data['model'].get_scores(tokenized_query)
+        bm25_indices = np.argsort(bm25_scores)[::-1][:top_k]  # Get top_k results
+        bm25_results = [
+            {
+                'score': float(bm25_scores[idx]),
+                'text': self.bm25_data['chunks'][idx],
+                'metadata': self.bm25_data['metadata'][idx],
+                'source': 'BM25'
+            }
+            for idx in bm25_indices
+        ]
+        logging.info(f"Retrieved {len(bm25_results)} results from BM25")
+        # Vector search - get top_k results
+        try:
+            embedding_response = openai.embeddings.create(
+                model="text-embedding-ada-002",
+                input=query
+            )
+            query_embedding = embedding_response.data[0].embedding
+            chroma_results = self.collection.query(
+                query_embeddings=[query_embedding],
+                n_results=top_k,  # Get top_k results
+                include=["documents", "metadatas", "distances"]
+            )
+            processed_vector_results = [
+                {
+                    'score': float(1 - distance),
+                    'text': doc,
+                    'metadata': meta,
+                    'source': 'Vector'
+                }
+                for doc, meta, distance in zip(
+                    chroma_results['documents'][0],
+                    chroma_results['metadatas'][0],
+                    chroma_results['distances'][0]
+                )
+            ]
+            logging.info(f"Retrieved {len(processed_vector_results)} results from Vector search")
+        except Exception as e:
+            logging.error(f"Error in vector search: {str(e)}", exc_info=True)
+            processed_vector_results = []
+        # Combine results
+        all_results = self._smart_combine_results(bm25_results, processed_vector_results, query)
+        return all_results
+    def _smart_combine_results(self, bm25_results: List[Dict], vector_results: List[Dict], query: str) -> List[Dict]:
+        """
+        Combine results from both sources, tracking duplicates and sources.
+        """
+        logging.info("\nCombining search results...")
+        # Initialize combined results with tracking
+        combined_dict = {}  # Use text as key to track duplicates
+        # Process vector results
+        for result in vector_results:
+            text = result['text']
+            if text not in combined_dict:
+                result['sources'] = ['Vector']
+                result['original_scores'] = {'Vector': result['score']}
+                combined_dict[text] = result
+                logging.info(f"Added Vector result (score: {result['score']:.4f})")
+            else:
+                combined_dict[text]['sources'].append('Vector')
+                combined_dict[text]['original_scores']['Vector'] = result['score']
+                logging.info(f"Marked existing result as found by Vector (score: {result['score']:.4f})")
+        # Process BM25 results
+        for result in bm25_results:
+            text = result['text']
+            if text not in combined_dict:
+                result['sources'] = ['BM25']
+                result['original_scores'] = {'BM25': result['score']}
+                combined_dict[text] = result
+                logging.info(f"Added BM25 result (score: {result['score']:.4f})")
+            else:
+                combined_dict[text]['sources'].append('BM25')
+                combined_dict[text]['original_scores']['BM25'] = result['score']
+                logging.info(f"Marked existing result as found by BM25 (score: {result['score']:.4f})")
+        # Convert to list
+        all_results = list(combined_dict.values())
+        # Calculate statistics
+        total_results = len(all_results)
+        duplicates = sum(1 for r in all_results if len(r['sources']) > 1)
+        vector_only = sum(1 for r in all_results if r['sources'] == ['Vector'])
+        bm25_only = sum(1 for r in all_results if r['sources'] == ['BM25'])
+        logging.info(f"\nResults Statistics:")
+        logging.info(f"- Total unique results: {total_results}")
+        logging.info(f"- Duplicates (found by both): {duplicates}")
+        logging.info(f"- Vector only: {vector_only}")
+        logging.info(f"- BM25 only: {bm25_only}")
+        return all_results
+    def _adjust_score_with_metadata(self, result: Dict, query: str) -> float:
+        """Adjust search score based on metadata relevance."""
+        base_score = result['score']
+        metadata = result['metadata']
+        # Initialize boost factors
+        boost = 1.0
+        # Boost based on reviews (social proof)
+        if metadata.get('reviews', 0) > 20:
+            boost *= 1.2
+        # Boost based on price range mentions
+        query_lower = query.lower()
+        if ('affordable' in query_lower or 'barato' in query_lower) and metadata.get('price', 0) < 50:
+            boost *= 1.3
+        elif ('premium' in query_lower or 'premium' in query_lower) and metadata.get('price', 0) > 100:
+            boost *= 1.3
+        # Boost based on specific dog type matches
+        dog_types = ['puppy', 'adult', 'senior', 'cachorro', 'adulto']
+        for dog_type in dog_types:
+            if dog_type in query_lower and dog_type in metadata.get('dog_type', '').lower():
+                boost *= 1.25
+                break
+        return base_score * boost
+    @traceable(run_type="chain")
+    def generate_answer(self, query: str, search_results: List[Dict]) -> str:
+        """Generate a natural language answer based on search results."""
+        # Detect query language
+        query_lang = detect_language(query)
+        # Prepare context from search results
+        context = self._prepare_context(search_results)
+        # Create prompt based on language
+        system_prompt = {
+            'es': """Eres un experto en nutrición canina. Responde a la pregunta utilizando solo el contexto proporcionado.
+                    Si no puedes responder con el contexto dado, indícalo. Incluye información sobre precios y características
+                    específicas de los productos cuando sea relevante.""",
+            'en': """You are a dog nutrition expert. Answer the question using only the provided context.
+                    If you cannot answer from the given context, say so. Include pricing and specific product
+                    features when relevant."""
+        }.get(query_lang, 'en')
+        response = openai.chat.completions.create(
+            model="gpt-4o-mini",
+            messages=[
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
+            ],
+            temperature=0.7,
+            max_tokens=300
+        )
+        return response.choices[0].message.content.strip()
+    def _prepare_context(self, search_results: List[Dict]) -> str:
+        """Prepare search results as context for the LLM."""
+        context_parts = []
+        for result in search_results:
+            metadata = result['metadata']
+            context_parts.append(
+                f"Product: {metadata['product_name']}\n"
+                f"Brand: {metadata['brand']}\n"
+                f"Price: ${metadata['price']}\n"
+                f"Weight: {metadata['weight']}kg\n"
+                f"Dog Type: {metadata['dog_type']}\n"
+                f"Description: {result['text']}\n"
+            )
+        return "\n---\n".join(context_parts)
+    @traceable(run_type="chain")
+    def process_query(self, query: str) -> Dict[str, Any]:
+        """Process a user query and return both search results and answer."""
+        search_results = self.hybrid_search(query)
+        answer = self.generate_answer(query, search_results)
+        return {
+            "answer": answer,
+            "search_results": search_results,
+            "language": detect_language(query)
+        }
+    def diagnose_vector_store(self):
+        """Diagnose the vector store setup."""
+        try:
+            logging.info("\nDiagnosing Vector Store:")
+            collection_info = self.collection.get()
+            # Basic collection info
+            doc_count = len(collection_info['ids'])
+            logging.info(f"Collection name: {self.collection.name}")
+            logging.info(f"Number of documents: {doc_count}")
+            # Sample query test
+            if doc_count > 0:
+                test_query = "test query for diagnosis"
+                test_embedding = openai.embeddings.create(
+                    model="text-embedding-ada-002",
+                    input=test_query
+                ).data[0].embedding
+                test_results = self.collection.query(
+                    query_embeddings=[test_embedding],
+                    n_results=1
+                )
+                if len(test_results['ids'][0]) > 0:
+                    logging.info("✅ Vector store test query successful")
+                    return True
+                else:
+                    logging.error("❌ Vector store returned no results for test query")
+                    return False
+            else:
+                logging.error("❌ Vector store is empty")
+                return False
+        except Exception as e:
+            logging.error(f"❌ Error accessing vector store: {str(e)}")
+            return False

raw_data/clean_comida_dogs_categoria.pkl ADDED Viewed

Binary file (100 kB). View file

raw_data/veterinarias_processed.pkl ADDED Viewed

Binary file (12.6 kB). View file