Spaces:

fguryel
/

scikit-rag

Sleeping

App Files Files Community

fguryel commited on Sep 28, 2025

Commit

95d9c92

1 Parent(s): 34e78be

db fixed

Browse files

Files changed (8) hide show

.gitattributes +4 -0
README.md +19 -0
ab7fa527-b151-425e-9f81-9aa3f7b65f1d/data_level0.bin +3 -0
ab7fa527-b151-425e-9f81-9aa3f7b65f1d/header.bin +3 -0
ab7fa527-b151-425e-9f81-9aa3f7b65f1d/length.bin +3 -0
ab7fa527-b151-425e-9f81-9aa3f7b65f1d/link_lists.bin +0 -0
app.py +135 -5
chroma.sqlite3 +1 -1

.gitattributes CHANGED Viewed

@@ -33,4 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 chroma.sqlite3 filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+# Large database and data files
+*.sqlite3 filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
 chroma.sqlite3 filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -10,10 +10,29 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 # Scikit-learn Documentation Q&A Bot 🤖
 A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.
 ## Features
 - **🔍 Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+---
+title: Scikit-learn Documentation Q&A Bot
+emoji: 🤖
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.50.0
+app_file: app.py
+pinned: false
+license: mit
+---
 # Scikit-learn Documentation Q&A Bot 🤖
 A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Scikit-learn using the official documentation.
+## How to Use on Hugging Face Spaces
+1. **Enter OpenAI API Key**: In the sidebar, enter your OpenAI API key
+2. **Ask Questions**: Type any question about Scikit-learn functionality
+3. **Get Answers**: Receive detailed responses with source documentation links
+4. **Explore**: Use the example questions or browse chat history
 ## Features
 - **🔍 Smart Retrieval**: Searches through 1,249+ documentation chunks using semantic similarity

ab7fa527-b151-425e-9f81-9aa3f7b65f1d/data_level0.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f97547c2466889737fdadcd740478420160f9c7094c36b6ae29c71d75887824e
+size 167600

ab7fa527-b151-425e-9f81-9aa3f7b65f1d/header.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0e81c3b22454233bc12d0762f06dcca48261a75231cf87c79b75e69a6c00150
+size 100

ab7fa527-b151-425e-9f81-9aa3f7b65f1d/length.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a12e561363385e9dfeeab326368731c030ed4b374e7f5897ac819159d2884c5
+size 400

ab7fa527-b151-425e-9f81-9aa3f7b65f1d/link_lists.bin ADDED Viewed

File without changes

app.py CHANGED Viewed

@@ -11,6 +11,8 @@ Date: September 2025
 """
 import os
 import logging
 from typing import List, Dict, Any, Optional, Tuple
 import streamlit as st
@@ -65,16 +67,29 @@ class RAGChatbot:
         Initialize ChromaDB client and embedding model for retrieval.
         """
         try:
             # Initialize ChromaDB client
             self.chroma_client = chromadb.PersistentClient(
                 path=self.db_path,
                 settings=Settings(anonymized_telemetry=False)
             )
-            # Get collection
-            self.collection = self.chroma_client.get_collection(
-                name=self.collection_name
-            )
             # Load embedding model (same as used for building the database)
             self.embedding_model = SentenceTransformer(self.embedding_model_name)
@@ -83,6 +98,85 @@ class RAGChatbot:
         except Exception as e:
             logger.error(f"Failed to initialize retrieval system: {e}")
             raise
     def set_openai_client(self, api_key: str) -> bool:
@@ -297,9 +391,35 @@ def initialize_session_state():
     """Initialize Streamlit session state variables."""
     if 'chatbot' not in st.session_state:
         try:
             st.session_state.chatbot = RAGChatbot()
         except Exception as e:
-            st.error(f"Failed to initialize chatbot: {e}")
             st.stop()
     if 'openai_initialized' not in st.session_state:
@@ -325,6 +445,14 @@ def main():
     # Main title and description
     st.title("🤖 Scikit-learn Documentation Q&A Bot")
     st.markdown("""
     Welcome to the **Scikit-learn Documentation Q&A Bot**! This intelligent assistant can answer your questions about Scikit-learn using the official documentation.
@@ -332,6 +460,8 @@ def main():
     1. 🔍 **Retrieval**: Searches through 1,249+ documentation chunks
     2. 📝 **Augmentation**: Provides relevant context to the AI
     3. 🤖 **Generation**: Uses OpenAI to generate accurate answers
     """)
     # Sidebar for API key and settings

 """
 import os
+import sys
+import json
 import logging
 from typing import List, Dict, Any, Optional, Tuple
 import streamlit as st
         Initialize ChromaDB client and embedding model for retrieval.
         """
         try:
+            # Check if we're in Hugging Face Spaces environment
+            if os.path.exists('chroma.sqlite3'):
+                # We're likely in HF Spaces - use current directory
+                self.db_path = '.'
             # Initialize ChromaDB client
             self.chroma_client = chromadb.PersistentClient(
                 path=self.db_path,
                 settings=Settings(anonymized_telemetry=False)
             )
+            # Get or create collection
+            try:
+                self.collection = self.chroma_client.get_collection(
+                    name=self.collection_name
+                )
+            except Exception:
+                # If collection doesn't exist, try to recreate it from chunks
+                if os.path.exists('chunks.json'):
+                    st.warning("Database collection not found. Rebuilding from chunks...")
+                    self._rebuild_collection_from_chunks()
+                else:
+                    raise Exception("Neither database collection nor chunks.json found. Please build the database first.")
             # Load embedding model (same as used for building the database)
             self.embedding_model = SentenceTransformer(self.embedding_model_name)
         except Exception as e:
             logger.error(f"Failed to initialize retrieval system: {e}")
+            # In Streamlit, show user-friendly error
+            if 'streamlit' in sys.modules:
+                st.error(f"❌ Database initialization failed: {e}")
+                st.info("💡 This might be the first run. The database needs to be built from the scraped content.")
+            raise
+    def _rebuild_collection_from_chunks(self) -> None:
+        """
+        Rebuild the ChromaDB collection from chunks.json file.
+        This is useful for Hugging Face Spaces deployment.
+        """
+        try:
+            st.info("🔄 Rebuilding database collection from chunks...")
+            # Load chunks
+            with open('chunks.json', 'r', encoding='utf-8') as f:
+                chunks = json.load(f)
+            # Create collection
+            try:
+                self.chroma_client.delete_collection(name=self.collection_name)
+            except:
+                pass  # Collection might not exist
+            self.collection = self.chroma_client.create_collection(
+                name=self.collection_name,
+                metadata={"description": "Scikit-learn documentation embeddings"}
+            )
+            # Load embedding model if not loaded
+            if not hasattr(self, 'embedding_model') or self.embedding_model is None:
+                self.embedding_model = SentenceTransformer(self.embedding_model_name)
+            # Process chunks in batches
+            batch_size = 100
+            progress_bar = st.progress(0)
+            status_text = st.empty()
+            for i in range(0, len(chunks), batch_size):
+                batch_chunks = chunks[i:i + batch_size]
+                # Prepare data
+                texts = [chunk['page_content'] for chunk in batch_chunks]
+                metadatas = []
+                for chunk in batch_chunks:
+                    metadata = {
+                        'url': chunk['metadata']['url'],
+                        'chunk_index': str(chunk['metadata']['chunk_index']),
+                        'source': chunk['metadata'].get('source', 'scikit-learn-docs'),
+                        'content_length': str(len(chunk['page_content']))
+                    }
+                    metadatas.append(metadata)
+                # Create embeddings
+                embeddings = self.embedding_model.encode(texts).tolist()
+                # Generate IDs
+                ids = [f"chunk_{i+j}" for j in range(len(batch_chunks))]
+                # Add to collection
+                self.collection.add(
+                    ids=ids,
+                    documents=texts,
+                    metadatas=metadatas,
+                    embeddings=embeddings
+                )
+                # Update progress
+                progress = (i + batch_size) / len(chunks)
+                progress_bar.progress(min(progress, 1.0))
+                status_text.text(f"Processing chunks: {min(i + batch_size, len(chunks))}/{len(chunks)}")
+            progress_bar.empty()
+            status_text.empty()
+            st.success(f"✅ Successfully rebuilt collection with {len(chunks)} chunks!")
+        except Exception as e:
+            st.error(f"❌ Failed to rebuild collection: {e}")
             raise
     def set_openai_client(self, api_key: str) -> bool:
     """Initialize Streamlit session state variables."""
     if 'chatbot' not in st.session_state:
         try:
+            # Show initialization message
+            init_placeholder = st.empty()
+            init_placeholder.info("🔄 Initializing RAG system...")
             st.session_state.chatbot = RAGChatbot()
+            init_placeholder.empty()
         except Exception as e:
+            st.error(f"❌ Failed to initialize chatbot: {e}")
+            # Provide helpful instructions
+            st.markdown("""
+            ### 🔧 Troubleshooting
+            This error typically occurs when:
+            1. **First deployment**: The database hasn't been built yet
+            2. **Missing files**: Required data files are not available
+            ### 📋 Required Files
+            Make sure these files are present:
+            - `chunks.json` (processed text chunks)
+            - `chroma.sqlite3` (database file) OR `chroma_db/` directory
+            ### 🚀 Quick Fix for Hugging Face Spaces
+            If you're running this on Hugging Face Spaces, make sure you've uploaded:
+            1. All Python files (`app.py`, `build_vector_db.py`, etc.)
+            2. Data files (`chunks.json`, `scraped_content.json`)
+            3. Database files (`chroma.sqlite3` or the `chroma_db/` folder)
+            """)
             st.stop()
     if 'openai_initialized' not in st.session_state:
     # Main title and description
     st.title("🤖 Scikit-learn Documentation Q&A Bot")
+    # Show database status
+    try:
+        collection_count = st.session_state.chatbot.collection.count()
+        st.success(f"✅ Database ready with {collection_count:,} documentation chunks")
+    except:
+        st.warning("⚠️ Database status unknown")
     st.markdown("""
     Welcome to the **Scikit-learn Documentation Q&A Bot**! This intelligent assistant can answer your questions about Scikit-learn using the official documentation.
     1. 🔍 **Retrieval**: Searches through 1,249+ documentation chunks
     2. 📝 **Augmentation**: Provides relevant context to the AI
     3. 🤖 **Generation**: Uses OpenAI to generate accurate answers
+    **👈 To get started**: Enter your OpenAI API key in the sidebar!
     """)
     # Sidebar for API key and settings

chroma.sqlite3 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9f0872c151b5912b9d3bfc3b9d1aef9b3d8770366e7d42ffc3f2a1044407e181
 size 13283328

 version https://git-lfs.github.com/spec/v1
+oid sha256:5641e3ed4b6a48b08f13e2b125000fe62c3eec109367b5c2c40799c25517e0ff
 size 13283328