Spaces:

PHOROTHA913
/

Scrape-Anythings

Sleeping

App Files Files Community

PHOROTHA913 commited on Aug 4, 2025

Commit

5c3dc0d

verified ·

1 Parent(s): a5c1e5d

Upload 9 files

Browse files

For Testing use and explore the web data as you like

Files changed (9) hide show

DEPLOYMENT.md +100 -0
README.md +74 -14
app.py +396 -0
instagram_scraper.py +378 -0
instagram_scraper_v2.py +184 -0
requirements.txt +6 -2
requirements_hf.txt +5 -0
scraper.py +320 -0
youtube_scraper.py +215 -0

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,100 @@

+# 🚀 Deploy to Hugging Face Spaces
+## Step-by-Step Guide
+### 1. Create GitHub Repository
+1. Go to [GitHub](https://github.com) and create a new repository
+2. Name it something like `crawl4ai-streamlit-app`
+3. Make it public (required for free Hugging Face Spaces)
+4. Don't initialize with README (we already have one)
+### 2. Push Code to GitHub
+```bash
+# Add your GitHub repository as remote
+git remote add origin https://github.com/YOUR_USERNAME/crawl4ai-streamlit-app.git
+# Push to GitHub
+git branch -M main
+git push -u origin main
+```
+### 3. Create Hugging Face Space
+1. Go to [Hugging Face Spaces](https://huggingface.co/spaces)
+2. Click "Create new Space"
+3. Fill in the details:
+   - **Owner**: Your username
+   - **Space name**: `crawl4ai-web-scraper` (or any name you like)
+   - **License**: MIT
+   - **SDK**: Select "Streamlit"
+   - **Space hardware**: CPU (free tier)
+4. Click "Create Space"
+### 4. Connect GitHub Repository
+1. In your new Space, click "Settings"
+2. Under "Repository", click "Connect to existing repository"
+3. Select your GitHub repository
+4. Set the path to `app.py`
+5. Click "Connect"
+### 5. Configure Environment
+The Space will automatically:
+- Install dependencies from `requirements.txt`
+- Run `streamlit run app.py`
+- Deploy your app
+### 6. Access Your App
+Your app will be available at:
+`https://huggingface.co/spaces/YOUR_USERNAME/crawl4ai-web-scraper`
+## 🛠️ Troubleshooting
+### Common Issues:
+1. **Dependencies not found**: Make sure `requirements.txt` is in the root directory
+2. **App not loading**: Check the logs in the Space settings
+3. **Selenium issues**: Hugging Face Spaces may have limitations with browser automation
+### Alternative: Manual Upload
+If GitHub connection doesn't work:
+1. Download your repository as ZIP
+2. Upload files manually to the Space
+3. Commit changes in the Space interface
+## 📋 Files Required for Deployment
+- ✅ `app.py` - Main Streamlit app
+- ✅ `requirements.txt` - Dependencies
+- ✅ `scraper.py` - Web scraping module
+- ✅ `youtube_scraper.py` - YouTube scraping module
+- ✅ `README.md` - Documentation
+- ✅ `.gitignore` - Git ignore rules
+## 🎯 Your App URL
+Once deployed, your app will be live at:
+`https://huggingface.co/spaces/YOUR_USERNAME/crawl4ai-web-scraper`
+## 🔧 Customization
+You can customize your Space:
+- **Title**: Change in Space settings
+- **Description**: Add in README.md
+- **Tags**: Add relevant tags for discoverability
+- **Hardware**: Upgrade if needed (paid tier)
+## 📊 Monitoring
+- **Logs**: Check Space settings for runtime logs
+- **Usage**: Monitor in Space analytics
+- **Updates**: Push to GitHub to auto-deploy
+---
+**Need help?** Check the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces) or ask in the community!

README.md CHANGED Viewed

@@ -1,20 +1,80 @@
 ---
 title: Scrape Anythings
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: A powerful Streamlit app to scrape data from any website, in
-license: mit
 ---
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

 ---
 title: Scrape Anythings
+emoji: ✨
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: "1.35.0"
+python_version: "3.9"
+app_file: app.py
 ---
+# ✨ Scrape Anythings
+A user-friendly Streamlit web application for extracting data from any website, including special support for YouTube and Instagram.
+## 🌟 Features
+- **Scrape Any URL**: Paste any website, YouTube, or Instagram URL to start.
+- **Multiple Data Types**: Extract text, images, links, tables, numbers, and metadata.
+- **Social Media Support**: Scrape YouTube video info & comments, and Instagram profile details & posts.
+- **Rich Data Export**: Download your data in JSON, CSV, TXT, and structured Excel (.xlsx) formats.
+- **Modern UI**: A clean and simple interface for a smooth user experience.
+## 🚀 How to Deploy on Hugging Face Spaces
+1.  **Create a Hugging Face Account**: If you don't have one, sign up at [huggingface.co](https://huggingface.co/).
+2.  **Create a New Space**:
+    *   Go to [huggingface.co/new-space](https://huggingface.co/new-space).
+    *   Enter a **Space name** (e.g., `scrape-anythings`).
+    *   Select **Streamlit** as the Space SDK.
+    *   Choose **Create a new repository for this Space**.
+    *   Click **Create Space**.
+3.  **Upload Your Files**:
+    *   In your new Space, go to the **Files** tab.
+    *   Click **Upload files**.
+    *   Drag and drop all the files from your project folder:
+        *   `app.py`
+        *   `scraper.py`
+        *   `youtube_scraper.py`
+        *   `instagram_scraper.py`
+        *   `instagram_scraper_v2.py`
+        *   `requirements.txt`
+        *   `README.md`
+    *   Commit the files directly to the `main` branch.
+4.  **Done!** Hugging Face will automatically build and launch your application. You can share the URL of your Space with anyone.
+## 📋 How to Use the App
+1.  **Enter a URL**: Paste the URL of the website, YouTube video, or Instagram profile you want to scrape.
+2.  **Select Data Types**: Choose the data you want to extract.
+3.  **Click Scrape!**: Let the app do the work.
+4.  **View & Download**: See the results directly in the app and download them in your preferred format.
+- [ ] Real-time scraping status
+- [ ] Custom CSS selectors
+- [ ] Proxy support
+- [ ] Multi-language support
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test thoroughly
+5. Submit a pull request
+## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## 🙏 Acknowledgments
+- Streamlit team for the amazing web app framework
+- BeautifulSoup and Selenium communities
+- Hugging Face for hosting capabilities
+---
+**Made with ❤️ for the AI/ML community**

app.py ADDED Viewed

	@@ -0,0 +1,396 @@

+import streamlit as st
+import pandas as pd
+import json
+import time
+from datetime import datetime
+import requests
+from urllib.parse import urlparse
+import io
+import base64
+from scraper import scraper
+from youtube_scraper import youtube_scraper
+from instagram_scraper import instagram_scraper
+from instagram_scraper_v2 import instagram_scraper_v2
+# Page configuration
+st.set_page_config(
+    page_title="Scrape Anythings",
+    page_icon="🕷️",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# Custom CSS for better styling
+st.markdown("""
+<style>
+    .main-header {
+        font-size: 2.5rem;
+        font-weight: bold;
+        color: #1f77b4;
+        text-align: center;
+        margin-bottom: 2rem;
+    }
+    .sub-header {
+        font-size: 1.2rem;
+        color: #666;
+        text-align: center;
+        margin-bottom: 2rem;
+    }
+    .metric-card {
+        background-color: #f0f2f6;
+        padding: 1rem;
+        border-radius: 0.5rem;
+        border-left: 4px solid #1f77b4;
+    }
+    .success-box {
+        background-color: #d4edda;
+        border: 1px solid #c3e6cb;
+        border-radius: 0.5rem;
+        padding: 1rem;
+        margin: 1rem 0;
+    }
+    .error-box {
+        background-color: #f8d7da;
+        border: 1px solid #f5c6cb;
+        border-radius: 0.5rem;
+        padding: 1rem;
+        margin: 1rem 0;
+    }
+</style>
+""", unsafe_allow_html=True)
+def validate_url(url):
+    """Validate if the URL is properly formatted"""
+    try:
+        result = urlparse(url)
+        return all([result.scheme, result.netloc])
+    except:
+        return False
+def perform_web_scraping(url, data_types, max_pages=1, rate_limit=2):
+    """
+    Perform actual web scraping using the WebScraper class
+    """
+    st.info("🔍 Starting web scraping...")
+    data_types_lower = [dt.lower() for dt in data_types]
+    with st.spinner("Crawling website..."):
+        scraped_data = scraper.scrape_website(url, data_types_lower, max_pages, rate_limit)
+    return scraped_data
+def display_results(scraped_data, is_youtube=False, is_instagram=False):
+    """Display the scraped data in a user-friendly format"""
+    if is_youtube:
+        display_youtube_results(scraped_data)
+    elif is_instagram:
+        display_instagram_results(scraped_data)
+    else:
+        display_regular_results(scraped_data)
+def display_text_results(text_data):
+    st.write(f"**Title:** {text_data.get('title', 'N/A')}")
+    with st.expander("Headings"):
+        for heading in text_data.get("headings", []):
+            st.write(f"- **{heading.get('level', 'h?')}**: {heading.get('text', '')}")
+    with st.expander("Paragraphs"):
+        for para in text_data.get("paragraphs", []):
+            st.write(f"- {para}")
+def display_image_results(images):
+    cols = st.columns(min(4, len(images)))
+    for i, img in enumerate(images):
+        with cols[i % 4]:
+            st.image(img.get("src", ""), caption=f"{img.get('alt', 'Image')[:50]}...", use_column_width=True)
+def display_table_results(tables):
+    for i, table in enumerate(tables):
+        with st.expander(f"Table {i+1} (Header: {table.get('header', [])})"):
+            df = pd.DataFrame(table.get('rows', []))
+            st.dataframe(df)
+def display_link_results(links):
+    for link in links:
+        st.write(f"- [{link.get('text', 'N/A')}]({link.get('href', '#')})")
+def display_metadata_results(metadata):
+    st.json(metadata)
+def display_regular_results(scraped_data):
+    """Display regular website scraping results in a structured format."""
+    st.subheader("📝 Text Content")
+    if scraped_data.get("text_content"):
+        display_text_results(scraped_data["text_content"])
+    else:
+        st.info("No text content was extracted.")
+    st.subheader("🖼️ Images")
+    if scraped_data.get("images"):
+        display_image_results(scraped_data["images"])
+    else:
+        st.info("No images were extracted.")
+    st.subheader("🔢 Numbers")
+    if scraped_data.get("numbers"):
+        with st.expander("Extracted Numbers", expanded=False):
+            st.write(scraped_data["numbers"])
+    else:
+        st.info("No numbers were extracted.")
+    st.subheader("📊 Tables")
+    if scraped_data.get("tables"):
+        display_table_results(scraped_data["tables"])
+    else:
+        st.info("No tables were extracted.")
+    st.subheader("🔗 Links")
+    if scraped_data.get("links"):
+        display_link_results(scraped_data["links"])
+    else:
+        st.info("No links were extracted.")
+    st.subheader("📄 Metadata")
+    if scraped_data.get("metadata"):
+        display_metadata_results(scraped_data["metadata"])
+    else:
+        st.info("No metadata was extracted.")
+def to_excel(data):
+    """Converts a dictionary of scraped data to an Excel file in memory."""
+    output = io.BytesIO()
+    with pd.ExcelWriter(output, engine='openpyxl') as writer:
+        # Handle simple lists (links, images, numbers)
+        for key in ["links", "images", "numbers"]:
+            if data.get(key):
+                pd.DataFrame({key.capitalize(): data[key]}).to_excel(writer, sheet_name=key.capitalize(), index=False)
+        # Handle text content
+        if data.get("text_content"):
+            pd.DataFrame({'Text': [data["text_content"]]}).to_excel(writer, sheet_name='Text', index=False)
+        # Handle dictionaries (metadata, video_info, profile_info)
+        for key in ["metadata", "video_info", "profile_info"]:
+            if data.get(key):
+                pd.DataFrame(data[key].items(), columns=['Property', 'Value']).to_excel(writer, sheet_name=key.replace('_', ' ').capitalize(), index=False)
+        # Handle list of dictionaries (comments)
+        if data.get("comments"):
+            pd.DataFrame(data["comments"]).to_excel(writer, sheet_name='Comments', index=False)
+        # Handle list of DataFrames (tables)
+        if data.get("tables"):
+            for i, table_df in enumerate(data["tables"]):
+                table_df.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)
+    processed_data = output.getvalue()
+    return processed_data
+def create_download_links(scraped_data):
+    """Create download links for different formats"""
+    st.header("Download Data")
+    col1, col2, col3, col4 = st.columns(4)
+    # JSON download
+    with col1:
+        json_str = json.dumps(scraped_data or {}, indent=2, default=str)
+        st.download_button(
+            label="Download JSON",
+            data=json_str,
+            file_name="scraped_data.json",
+            mime="application/json",
+            use_container_width=True
+        )
+    # CSV download
+    with col2:
+        if scraped_data.get("tables"):
+            # For simplicity, we'll offer the first table as a CSV download
+            csv = scraped_data["tables"][0].to_csv(index=False)
+            st.download_button(
+                label="Download CSV",
+                data=csv,
+                file_name="scraped_table.csv",
+                mime="text/csv",
+                use_container_width=True
+            )
+        else:
+            st.button("Download CSV", disabled=True, help="No tables found to download.", use_container_width=True)
+    # TXT download
+    with col3:
+        text_content = scraped_data.get("text_content", "")
+        st.download_button(
+            label="Download TXT",
+            data=text_content,
+            file_name="scraped_text.txt",
+            mime="text/plain",
+            use_container_width=True
+        )
+    # Excel download
+    with col4:
+        try:
+            excel_data = to_excel(scraped_data)
+            st.download_button(
+                label="Download Excel",
+                data=excel_data,
+                file_name="scraped_data.xlsx",
+                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+                use_container_width=True
+            )
+        except Exception as e:
+            st.button("Download Excel", disabled=True, help=f"Excel export failed: {e}", use_container_width=True)
+            for heading in text_data.get("headings", []):
+                txt_content += f"- {heading}\n"
+            txt_content += "\nParagraphs:\n"
+            for i, para in enumerate(text_data.get("paragraphs", []), 1):
+                txt_content += f"{i}. {para}\n"
+            b64_txt = base64.b64encode(txt_content.encode()).decode()
+            href = f'<a href="data:file/txt;base64,{b64_txt}" download="scraped_data.txt">📝 Download TXT</a>'
+            st.markdown(href, unsafe_allow_html=True)
+    # Excel download
+    with col4:
+        try:
+            excel_data = to_excel(scraped_data)
+            st.download_button(
+                label="Download data as Excel",
+                data=excel_data,
+                file_name="scraped_data.xlsx",
+                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
+            )
+        except Exception as e:
+            st.error(f"Failed to generate Excel file: {e}")
+def display_youtube_results(scraped_data):
+    """Display YouTube scraping results"""
+    if not scraped_data.get("video_info"):
+        st.error("Could not extract YouTube video information.")
+        return
+    video_info = scraped_data["video_info"]
+    st.subheader(f'{video_info.get("title", "Untitled")}')
+    st.write(f'**Channel:** {video_info.get("channel", "N/A")}')
+    st.write(f'**Views:** {video_info.get("views", "N/A")}')
+    with st.expander("Video Description"):
+        st.write(video_info.get("description", "No description."))
+    if "comments" in scraped_data and scraped_data["comments"]:
+        with st.expander(f'Comments ({len(scraped_data["comments"])})'):
+            for comment in scraped_data["comments"]:
+                st.markdown(f"**{comment.get('author', 'Unknown')}** - {comment.get('timestamp', 'Unknown')}")
+                st.write(comment.get('text', ''))
+                if comment.get('likes', '0') != '0':
+                    st.caption(f"👍 {comment.get('likes', '0')} likes")
+                st.divider()
+def display_instagram_results(scraped_data):
+    """Display Instagram scraping results"""
+    if not scraped_data.get("profile_info"):
+        st.error("Could not extract Instagram profile information.")
+        return
+    profile_info = scraped_data["profile_info"]
+    with st.expander("Profile Information", expanded=True):
+        st.write(f'**Username:** {profile_info.get("username", "N/A")}')
+        st.write(f'**Display Name:** {profile_info.get("display_name", "N/A")}')
+        st.write(f'**Bio:** {profile_info.get("bio", "N/A")}')
+        st.write(f'**Followers:** {profile_info.get("followers", "N/A")}')
+def main():
+    # Header
+    st.markdown('<h1 class="main-header">✨ Scrape Anythings</h1>', unsafe_allow_html=True)
+    st.markdown('<p class="sub-header">Extract data from any website with ease</p>', unsafe_allow_html=True)
+    # Sidebar for configuration
+    with st.sidebar:
+        st.header("Configuration")
+        url = st.text_input("Enter Website URL", placeholder="https://example.com")
+        is_youtube = "youtube.com" in url.lower() or "youtu.be" in url.lower() if url else False
+        is_instagram = "instagram.com" in url.lower() if url else False
+        data_types, youtube_data_types, instagram_data_types, max_comments = [], [], [], 50
+        if is_youtube:
+            st.info("YouTube URL detected!")
+            youtube_data_types = st.multiselect("YouTube Data Types", ["video_info", "comments"], default=["video_info", "comments"])
+            if "comments" in youtube_data_types:
+                max_comments = st.slider("Max Comments", 10, 200, 50)
+        elif is_instagram:
+            st.info("Instagram URL detected!")
+            instagram_data_types = st.multiselect("Instagram Data Types", ["profile_info", "images", "posts"], default=["profile_info", "images"])
+        else:
+            data_types = st.multiselect("Data Types", ["Text", "Images", "Links", "Tables", "Metadata", "Numbers"], default=["Text", "Links"])
+        st.subheader("Advanced Options")
+        max_pages = st.slider("Max Pages", 1, 10, 1)
+        rate_limit = st.slider("Rate Limit (s)", 1, 10, 2)
+        scrape_button = st.button("Start Scraping", type="primary", use_container_width=True)
+    # Main content area
+    if scrape_button:
+        if not url or not validate_url(url):
+            st.error("Please enter a valid URL.")
+            return
+        # Validate that at least one data type is selected for the given URL type
+        if is_youtube and not youtube_data_types:
+            st.error("Please select at least one YouTube data type to extract.")
+            return
+        elif is_instagram and not instagram_data_types:
+            st.error("Please select at least one Instagram data type to extract.")
+            return
+        elif not is_youtube and not is_instagram and not data_types:
+            st.error("Please select at least one data type to extract.")
+            return
+        with st.spinner("Scraping in progress... Please wait."):
+            try:
+                scraped_data = {}
+                if is_youtube:
+                    scraped_data = youtube_scraper.scrape_youtube_video(url, "comments" in youtube_data_types, max_comments)
+                elif is_instagram:
+                    try:
+                        scraped_data = instagram_scraper_v2.extract_instagram_data(url)
+                    except Exception:
+                        st.warning("Improved scraper failed, trying fallback...")
+                        scraped_data = instagram_scraper.extract_instagram_data(url)
+                else:
+                    data_types_lower = [dt.lower() for dt in data_types]
+                    scraped_data = perform_web_scraping(url, data_types_lower, max_pages, rate_limit)
+                if scraped_data.get("errors"):
+                    st.error(f'Errors: {scraped_data["errors"]}')
+                # Check if any data was actually scraped before showing success
+                has_data = any(scraped_data.get(key) for key in ["text_content", "images", "numbers", "tables", "links", "metadata", "video_info", "profile_info"])
+                if has_data:
+                    st.success("Scraping completed successfully!")
+                    st.header("Scraping Results")
+                    display_results(scraped_data, is_youtube, is_instagram)
+                    st.header("Download Data")
+                    create_download_links(scraped_data)
+                else:
+                    st.warning("No data was extracted. The website might be blocking scrapers or the content is not available.")
+            except Exception as e:
+                st.error(f"An unexpected error occurred: {e}")
+    else:
+        st.markdown("""
+        ### How to Use
+        1. **Enter URL** and **select data types** in the sidebar.
+        2. Click **Start Scraping** to begin.
+        3. View and **download the results** below.
+        """)
+if __name__ == "__main__":
+    main()

instagram_scraper.py ADDED Viewed

	@@ -0,0 +1,378 @@

+import streamlit as st
+import requests
+from bs4 import BeautifulSoup
+import json
+import re
+import time
+from datetime import datetime
+from urllib.parse import urljoin, urlparse
+class InstagramScraper:
+    def __init__(self):
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
+            'Accept-Language': 'en-US,en;q=0.5',
+            'Accept-Encoding': 'gzip, deflate',
+            'Connection': 'keep-alive',
+            'Upgrade-Insecure-Requests': '1',
+        })
+    def extract_instagram_data(self, url):
+        """Extract data from Instagram profile or post"""
+        scraped_data = {
+            "url": url,
+            "timestamp": datetime.now().isoformat(),
+            "platform": "instagram",
+            "images": [],
+            "posts": [],
+            "profile_info": {},
+            "errors": []
+        }
+        try:
+            # Determine if it's a profile or post URL
+            if "/p/" in url or "/reel/" in url:
+                # Single post
+                scraped_data.update(self.extract_post_data(url))
+            else:
+                # Profile
+                scraped_data.update(self.extract_profile_data(url))
+        except Exception as e:
+            scraped_data["errors"].append(f"Instagram scraping error: {str(e)}")
+        # Check if we found any data
+        if not scraped_data.get("images") and not scraped_data.get("posts") and not scraped_data.get("profile_info", {}).get("username"):
+            scraped_data["errors"].append("No Instagram data found. This might be due to:")
+            scraped_data["errors"].append("- Private or protected account")
+            scraped_data["errors"].append("- Instagram's anti-scraping measures")
+            scraped_data["errors"].append("- Network connectivity issues")
+            scraped_data["errors"].append("- URL format issues")
+        return scraped_data
+    def extract_post_data(self, url):
+        """Extract data from a single Instagram post"""
+        post_data = {
+            "post_type": "single_post",
+            "images": [],
+            "post_info": {}
+        }
+        try:
+            response = self.session.get(url, timeout=10)
+            response.raise_for_status()
+            soup = BeautifulSoup(response.text, 'html.parser')
+            # Look for image URLs in the page
+            # Instagram loads images dynamically, so we need to look for patterns
+            page_text = response.text
+            # Find image URLs in the page source
+            image_patterns = [
+                # Instagram post images (high quality)
+                r'"display_url":"([^"]+)"',
+                r'"display_src":"([^"]+)"',
+                r'"src":"([^"]*\.jpg[^"]*)"',
+                r'"src":"([^"]*\.jpeg[^"]*)"',
+                r'"src":"([^"]*\.png[^"]*)"',
+                # Direct image URLs
+                r'https://[^"]*\.jpg[^"]*',
+                r'https://[^"]*\.jpeg[^"]*',
+                r'https://[^"]*\.png[^"]*',
+                # Instagram CDN URLs (high quality)
+                r'https://scontent[^"]*\.jpg[^"]*',
+                r'https://scontent[^"]*\.jpeg[^"]*',
+                r'https://scontent[^"]*\.png[^"]*',
+                # Additional Instagram patterns
+                r'"url":"([^"]*\.jpg[^"]*)"',
+                r'"url":"([^"]*\.jpeg[^"]*)"',
+                r'"url":"([^"]*\.png[^"]*)"'
+            ]
+            found_images = set()
+            for pattern in image_patterns:
+                matches = re.findall(pattern, page_text)
+                for match in matches:
+                    if match and ('instagram' in match.lower() or 'scontent' in match.lower()):
+                        # Clean up the URL
+                        clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
+                        found_images.add(clean_url)
+            # Convert to image objects
+            for i, img_url in enumerate(list(found_images)):
+                post_data["images"].append({
+                    "src": img_url,
+                    "alt": f"Instagram post image {i+1}",
+                    "title": f"Instagram post image {i+1}",
+                    "width": "",
+                    "height": ""
+                })
+            # Extract post information
+            post_data["post_info"] = {
+                "url": url,
+                "images_count": len(post_data["images"]),
+                "scraped_at": datetime.now().isoformat()
+            }
+        except Exception as e:
+            post_data["errors"] = [f"Failed to extract post data: {str(e)}"]
+        return post_data
+    def extract_profile_data(self, url):
+        """Extract data from Instagram profile"""
+        profile_data = {
+            "profile_type": "account",
+            "images": [],
+            "profile_info": {},
+            "posts": []
+        }
+        try:
+            response = self.session.get(url, timeout=10)
+            response.raise_for_status()
+            soup = BeautifulSoup(response.text, 'html.parser')
+            page_text = response.text
+            # Extract profile information
+            profile_data["profile_info"] = self.extract_profile_info(soup, page_text)
+            # Extract recent posts first
+            profile_data["posts"] = self.extract_recent_posts(page_text)
+            # Extract images from profile page
+            profile_data["images"] = self.extract_profile_images(page_text)
+            # Extract images from individual posts (higher quality)
+            if profile_data["posts"]:
+                post_images = self.extract_images_from_posts(profile_data["posts"], max_posts=3)
+                if post_images:
+                    profile_data["images"].extend(post_images)
+        except Exception as e:
+            profile_data["errors"] = [f"Failed to extract profile data: {str(e)}"]
+        return profile_data
+    def extract_profile_info(self, soup, page_text):
+        """Extract profile information"""
+        profile_info = {
+            "username": "",
+            "display_name": "",
+            "bio": "",
+            "followers": "",
+            "following": "",
+            "posts_count": ""
+        }
+        try:
+            # Look for profile information in the page source
+            # Instagram loads this data dynamically, so we need to parse JSON
+            # Find JSON data in the page
+            json_patterns = [
+                r'window\._sharedData\s*=\s*({[^}]+})',
+                r'"profile_page":\s*({[^}]+})',
+                r'"user":\s*({[^}]+})'
+            ]
+            for pattern in json_patterns:
+                matches = re.findall(pattern, page_text)
+                if matches:
+                    try:
+                        data = json.loads(matches[0])
+                        # Extract profile info from JSON
+                        if "user" in data:
+                            user_data = data["user"]
+                            profile_info["username"] = user_data.get("username", "")
+                            profile_info["display_name"] = user_data.get("full_name", "")
+                            profile_info["bio"] = user_data.get("biography", "")
+                            profile_info["followers"] = user_data.get("followed_by", {}).get("count", "")
+                            profile_info["following"] = user_data.get("follows", {}).get("count", "")
+                            profile_info["posts_count"] = user_data.get("media", {}).get("count", "")
+                    except:
+                        continue
+            # Fallback: try to extract from HTML
+            if not profile_info["username"]:
+                title_tag = soup.find('title')
+                if title_tag:
+                    title_text = title_tag.get_text()
+                    if '(' in title_text and ')' in title_text:
+                        username = title_text.split('(')[1].split(')')[0]
+                        profile_info["username"] = username
+        except Exception as e:
+            profile_info["error"] = f"Failed to extract profile info: {str(e)}"
+        return profile_info
+    def extract_profile_images(self, page_text):
+        """Extract images from profile page"""
+        images = []
+        try:
+            # Look for Instagram post images in the page source
+            # Instagram stores post images in JSON data
+            image_patterns = [
+                # Instagram post images (high quality)
+                r'"display_url":"([^"]+)"',
+                r'"display_src":"([^"]+)"',
+                r'"src":"([^"]*\.jpg[^"]*)"',
+                r'"src":"([^"]*\.jpeg[^"]*)"',
+                r'"src":"([^"]*\.png[^"]*)"',
+                # Direct image URLs
+                r'https://[^"]*\.jpg[^"]*',
+                r'https://[^"]*\.jpeg[^"]*',
+                r'https://[^"]*\.png[^"]*',
+                # Instagram CDN URLs
+                r'https://scontent[^"]*\.jpg[^"]*',
+                r'https://scontent[^"]*\.jpeg[^"]*',
+                r'https://scontent[^"]*\.png[^"]*',
+                # Additional Instagram patterns
+                r'"url":"([^"]*\.jpg[^"]*)"',
+                r'"url":"([^"]*\.jpeg[^"]*)"',
+                r'"url":"([^"]*\.png[^"]*)"'
+            ]
+            found_images = set()
+            for pattern in image_patterns:
+                matches = re.findall(pattern, page_text)
+                for match in matches:
+                    if match and ('instagram' in match.lower() or 'scontent' in match.lower()):
+                        # Clean up the URL
+                        clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
+                        found_images.add(clean_url)
+            # Convert to image objects
+            for i, img_url in enumerate(list(found_images)):
+                images.append({
+                    "src": img_url,
+                    "alt": f"Instagram post image {i+1}",
+                    "title": f"Instagram post image {i+1}",
+                    "width": "",
+                    "height": ""
+                })
+        except Exception as e:
+            st.error(f"Failed to extract profile images: {str(e)}")
+        return images
+    def extract_recent_posts(self, page_text):
+        """Extract recent posts from profile"""
+        posts = []
+        try:
+            # Look for post URLs in the page source
+            post_patterns = [
+                r'"shortcode":"([^"]+)"',
+                r'/p/([^/"]+)',
+                r'/reel/([^/"]+)'
+            ]
+            found_posts = set()
+            for pattern in post_patterns:
+                matches = re.findall(pattern, page_text)
+                for match in matches:
+                    if match:
+                        found_posts.add(match)
+            # Convert to post objects
+            for i, post_code in enumerate(list(found_posts)[:10]):  # Convert set to list and limit to 10 posts
+                posts.append({
+                    "shortcode": post_code,
+                    "url": f"https://www.instagram.com/p/{post_code}/",
+                    "index": i + 1
+                })
+        except Exception as e:
+            st.error(f"Failed to extract recent posts: {str(e)}")
+        return posts
+    def extract_images_from_posts(self, posts, max_posts=5):
+        """Extract images from individual posts"""
+        all_images = []
+        try:
+            for i, post in enumerate(posts[:max_posts]):
+                try:
+                    # Get the post page
+                    post_url = post["url"]
+                    response = self.session.get(post_url, timeout=10)
+                    response.raise_for_status()
+                    # Extract images from this post
+                    post_images = self.extract_post_images(response.text)
+                    # Add post context to images
+                    for img in post_images:
+                        img["post_url"] = post_url
+                        img["post_index"] = i + 1
+                        all_images.append(img)
+                    # Small delay to be respectful
+                    time.sleep(1)
+                except Exception as e:
+                    st.warning(f"Failed to extract images from post {post['shortcode']}: {str(e)}")
+                    continue
+        except Exception as e:
+            st.error(f"Failed to extract images from posts: {str(e)}")
+        return all_images
+    def extract_post_images(self, page_text):
+        """Extract images from a single post page"""
+        images = []
+        try:
+            # Look for high-quality Instagram post images
+            image_patterns = [
+                # Instagram post images (high quality)
+                r'"display_url":"([^"]+)"',
+                r'"display_src":"([^"]+)"',
+                # Instagram CDN URLs (highest quality)
+                r'https://scontent[^"]*\.jpg[^"]*',
+                r'https://scontent[^"]*\.jpeg[^"]*',
+                r'https://scontent[^"]*\.png[^"]*',
+                # Additional patterns
+                r'"src":"([^"]*\.jpg[^"]*)"',
+                r'"src":"([^"]*\.jpeg[^"]*)"',
+                r'"src":"([^"]*\.png[^"]*)"'
+            ]
+            found_images = set()
+            for pattern in image_patterns:
+                matches = re.findall(pattern, page_text)
+                for match in matches:
+                    if match and ('scontent' in match.lower() or 'instagram' in match.lower()):
+                        # Clean up the URL
+                        clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
+                        found_images.add(clean_url)
+            # Convert to image objects
+            for i, img_url in enumerate(list(found_images)):
+                images.append({
+                    "src": img_url,
+                    "alt": f"Instagram post image {i+1}",
+                    "title": f"Instagram post image {i+1}",
+                    "width": "",
+                    "height": ""
+                })
+        except Exception as e:
+            st.error(f"Failed to extract post images: {str(e)}")
+        return images
+# Global Instagram scraper instance
+instagram_scraper = InstagramScraper()

instagram_scraper_v2.py ADDED Viewed

	@@ -0,0 +1,184 @@

+import streamlit as st
+import requests
+import json
+import re
+import time
+import random
+from datetime import datetime
+class InstagramScraperV2:
+    def __init__(self):
+        self.session = requests.Session()
+        self.setup_session()
+    def setup_session(self):
+        """Setup session with better anti-detection measures"""
+        user_agents = [
+            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+        ]
+        self.session.headers.update({
+            'User-Agent': random.choice(user_agents),
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
+            'Accept-Language': 'en-US,en;q=0.5',
+            'Connection': 'keep-alive',
+            'Upgrade-Insecure-Requests': '1'
+        })
+    def get_page_with_retry(self, url, max_retries=3):
+        """Get page with retry mechanism"""
+        for attempt in range(max_retries):
+            try:
+                time.sleep(random.uniform(2, 4))
+                response = self.session.get(url, timeout=20)
+                response.raise_for_status()
+                return response.text
+            except Exception as e:
+                st.warning(f"Attempt {attempt + 1} failed: {str(e)}")
+                if attempt == max_retries - 1:
+                    raise
+        return None
+    def extract_instagram_data(self, url):
+        """Extract data from Instagram with improved error handling"""
+        scraped_data = {
+            "url": url,
+            "timestamp": datetime.now().isoformat(),
+            "platform": "instagram",
+            "images": [],
+            "posts": [],
+            "profile_info": {},
+            "errors": []
+        }
+        try:
+            page_text = self.get_page_with_retry(url)
+            if not page_text:
+                scraped_data["errors"].append("Failed to load Instagram page")
+                return scraped_data
+            # Extract images
+            scraped_data["images"] = self.extract_images_from_page(page_text)
+            # Extract profile info
+            scraped_data["profile_info"] = self.extract_profile_info(page_text)
+            # Extract posts
+            scraped_data["posts"] = self.extract_recent_posts(page_text)
+        except Exception as e:
+            scraped_data["errors"].append(f"Instagram scraping error: {str(e)}")
+        return scraped_data
+    def extract_images_from_page(self, page_text):
+        """Extract images with improved patterns"""
+        images = []
+        try:
+            # Enhanced patterns for Instagram images
+            patterns = [
+                r'https://scontent[^"]*\.jpg[^"]*',
+                r'https://scontent[^"]*\.jpeg[^"]*',
+                r'https://scontent[^"]*\.png[^"]*',
+                r'"display_url":"([^"]+)"',
+                r'"display_src":"([^"]+)"'
+            ]
+            found_images = set()
+            for pattern in patterns:
+                matches = re.findall(pattern, page_text)
+                for match in matches:
+                    if match and ('scontent' in match.lower() or 'instagram' in match.lower()):
+                        clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
+                        found_images.add(clean_url)
+            for i, img_url in enumerate(list(found_images)):
+                images.append({
+                    "src": img_url,
+                    "alt": f"Instagram image {i+1}",
+                    "title": f"Instagram image {i+1}",
+                    "width": "",
+                    "height": ""
+                })
+        except Exception as e:
+            st.error(f"Failed to extract images: {str(e)}")
+        return images
+    def extract_profile_info(self, page_text):
+        """Extract profile information"""
+        profile_info = {
+            "username": "",
+            "display_name": "",
+            "bio": "",
+            "followers": "",
+            "following": "",
+            "posts_count": ""
+        }
+        try:
+            # Extract username from title
+            title_match = re.search(r'<title>([^<]+)</title>', page_text)
+            if title_match:
+                title = title_match.group(1)
+                if '(' in title and ')' in title:
+                    username = title.split('(')[1].split(')')[0]
+                    profile_info["username"] = username
+            # Look for JSON data
+            json_patterns = [
+                r'"username":"([^"]+)"',
+                r'"full_name":"([^"]+)"',
+                r'"biography":"([^"]+)"'
+            ]
+            for pattern in json_patterns:
+                matches = re.findall(pattern, page_text)
+                if matches:
+                    if "username" in pattern:
+                        profile_info["username"] = matches[0]
+                    elif "full_name" in pattern:
+                        profile_info["display_name"] = matches[0]
+                    elif "biography" in pattern:
+                        profile_info["bio"] = matches[0]
+        except Exception as e:
+            profile_info["error"] = f"Failed to extract profile info: {str(e)}"
+        return profile_info
+    def extract_recent_posts(self, page_text):
+        """Extract recent posts"""
+        posts = []
+        try:
+            post_patterns = [
+                r'"shortcode":"([^"]+)"',
+                r'/p/([^/"]+)'
+            ]
+            found_posts = set()
+            for pattern in post_patterns:
+                matches = re.findall(pattern, page_text)
+                for match in matches:
+                    if match:
+                        found_posts.add(match)
+            for i, post_code in enumerate(list(found_posts)[:10]):
+                posts.append({
+                    "shortcode": post_code,
+                    "url": f"https://www.instagram.com/p/{post_code}/",
+                    "index": i + 1
+                })
+        except Exception as e:
+            st.error(f"Failed to extract posts: {str(e)}")
+        return posts
+# Global instance
+instagram_scraper_v2 = InstagramScraperV2()

requirements.txt CHANGED Viewed

@@ -1,3 +1,7 @@
-altair
 pandas
-streamlit

+streamlit
 pandas
+requests
+beautifulsoup4
+selenium
+lxml
+openpyxl

requirements_hf.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+streamlit>=1.28.0
+pandas>=1.5.0
+requests>=2.28.0
+beautifulsoup4>=4.11.0
+lxml>=4.9.0

scraper.py ADDED Viewed

	@@ -0,0 +1,320 @@

+import requests
+from bs4 import BeautifulSoup
+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from webdriver_manager.chrome import ChromeDriverManager
+import time
+import re
+from urllib.parse import urljoin, urlparse
+import json
+from datetime import datetime
+class WebScraper:
+    def __init__(self):
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+        })
+        self.driver = None
+    def setup_selenium(self):
+        """Setup Selenium WebDriver for dynamic content"""
+        try:
+            chrome_options = Options()
+            chrome_options.add_argument("--headless")
+            chrome_options.add_argument("--no-sandbox")
+            chrome_options.add_argument("--disable-dev-shm-usage")
+            chrome_options.add_argument("--disable-gpu")
+            chrome_options.add_argument("--window-size=1920,1080")
+            self.driver = webdriver.Chrome(
+                service=webdriver.chrome.service.Service(ChromeDriverManager().install()),
+                options=chrome_options
+            )
+            return True
+        except Exception as e:
+            print(f"Failed to setup Selenium: {e}")
+            return False
+    def close_selenium(self):
+        """Close Selenium WebDriver"""
+        if self.driver:
+            self.driver.quit()
+            self.driver = None
+    def get_page_content(self, url, use_selenium=False):
+        """Get page content using requests or Selenium"""
+        try:
+            if use_selenium and self.driver:
+                self.driver.get(url)
+                time.sleep(2)  # Wait for dynamic content
+                return self.driver.page_source
+            else:
+                response = self.session.get(url, timeout=10)
+                response.raise_for_status()
+                return response.text
+        except Exception as e:
+            print(f"Error fetching page: {e}")
+            return None
+    def extract_text_content(self, soup):
+        """Extract text content from BeautifulSoup object"""
+        text_data = {
+            "title": "",
+            "headings": [],
+            "paragraphs": [],
+            "lists": []
+        }
+        # Extract title
+        title_tag = soup.find('title')
+        if title_tag:
+            text_data["title"] = title_tag.get_text().strip()
+        # Extract headings
+        for tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
+            headings = soup.find_all(tag)
+            for heading in headings:
+                text = heading.get_text().strip()
+                if text:
+                    text_data["headings"].append({
+                        "level": tag,
+                        "text": text
+                    })
+        # Extract paragraphs
+        paragraphs = soup.find_all('p')
+        for p in paragraphs:
+            text = p.get_text().strip()
+            if text and len(text) > 20:  # Filter out short text
+                text_data["paragraphs"].append(text)
+        # Extract lists
+        lists = soup.find_all(['ul', 'ol'])
+        for lst in lists:
+            items = []
+            for item in lst.find_all('li'):
+                text = item.get_text().strip()
+                if text:
+                    items.append(text)
+            if items:
+                text_data["lists"].append({
+                    "type": lst.name,
+                    "items": items
+                })
+        return text_data
+    def extract_numbers(self, soup):
+        """Extract all numbers (integers and floats) from the text content"""
+        text = soup.get_text()
+        # Regex to find integers and floats
+        numbers = re.findall(r'\b\d+\.?\d*\b', text)
+        # Convert to float for consistency, and remove duplicates
+        return sorted(list(set([float(n) for n in numbers if n.strip()])))
+    def extract_images(self, soup, base_url):
+        """Extract images from BeautifulSoup object"""
+        images = []
+        img_tags = soup.find_all('img')
+        for img in img_tags:
+            src = img.get('src', '')
+            alt = img.get('alt', '')
+            title = img.get('title', '')
+            if src:
+                # Make relative URLs absolute
+                if not src.startswith(('http://', 'https://')):
+                    src = urljoin(base_url, src)
+                images.append({
+                    "src": src,
+                    "alt": alt,
+                    "title": title,
+                    "width": img.get('width', ''),
+                    "height": img.get('height', '')
+                })
+        return images
+    def extract_links(self, soup, base_url):
+        """Extract links from BeautifulSoup object"""
+        links = []
+        link_tags = soup.find_all('a', href=True)
+        for link in link_tags:
+            href = link.get('href')
+            text = link.get_text().strip()
+            if href and text:
+                # Make relative URLs absolute
+                if not href.startswith(('http://', 'https://')):
+                    href = urljoin(base_url, href)
+                # Only include external and internal links, skip anchors
+                if not href.startswith('#'):
+                    links.append({
+                        "href": href,
+                        "text": text,
+                        "title": link.get('title', ''),
+                        "is_external": not href.startswith(base_url)
+                    })
+        return links
+    def extract_tables(self, soup):
+        """Extract tables from BeautifulSoup object"""
+        tables = []
+        table_tags = soup.find_all('table')
+        for table in table_tags:
+            table_data = {
+                "headers": [],
+                "rows": [],
+                "caption": ""
+            }
+            # Extract caption
+            caption = table.find('caption')
+            if caption:
+                table_data["caption"] = caption.get_text().strip()
+            # Extract headers
+            thead = table.find('thead')
+            if thead:
+                header_row = thead.find('tr')
+                if header_row:
+                    headers = header_row.find_all(['th', 'td'])
+                    table_data["headers"] = [h.get_text().strip() for h in headers]
+            # Extract rows
+            tbody = table.find('tbody') or table
+            rows = tbody.find_all('tr')
+            for row in rows:
+                cells = row.find_all(['td', 'th'])
+                if cells:
+                    row_data = [cell.get_text().strip() for cell in cells]
+                    table_data["rows"].append(row_data)
+            if table_data["rows"]:
+                tables.append(table_data)
+        return tables
+    def extract_metadata(self, soup):
+        """Extract metadata from BeautifulSoup object"""
+        metadata = {
+            "title": "",
+            "description": "",
+            "keywords": [],
+            "author": "",
+            "language": "en",
+            "robots": "",
+            "viewport": "",
+            "charset": ""
+        }
+        # Extract title
+        title_tag = soup.find('title')
+        if title_tag:
+            metadata["title"] = title_tag.get_text().strip()
+        # Extract meta tags
+        meta_tags = soup.find_all('meta')
+        for meta in meta_tags:
+            name = meta.get('name', '').lower()
+            content = meta.get('content', '')
+            property_attr = meta.get('property', '').lower()
+            if name == 'description' or property_attr == 'og:description':
+                metadata["description"] = content
+            elif name == 'keywords':
+                metadata["keywords"] = [kw.strip() for kw in content.split(',')]
+            elif name == 'author':
+                metadata["author"] = content
+            elif name == 'robots':
+                metadata["robots"] = content
+            elif name == 'viewport':
+                metadata["viewport"] = content
+            elif property_attr == 'og:title':
+                metadata["title"] = content or metadata["title"]
+        # Extract charset
+        charset_meta = soup.find('meta', charset=True)
+        if charset_meta:
+            metadata["charset"] = charset_meta.get('charset')
+        # Extract language
+        html_tag = soup.find('html')
+        if html_tag:
+            lang = html_tag.get('lang', 'en')
+            metadata["language"] = lang
+        return metadata
+    def scrape_website(self, url, data_types, max_pages=1, rate_limit=2):
+        """Main scraping function"""
+        scraped_data = {
+            "url": url,
+            "timestamp": datetime.now().isoformat(),
+            "data_types": data_types,
+            "pages_crawled": 0,
+            "errors": []
+        }
+        try:
+            # Setup Selenium if needed for dynamic content
+            use_selenium = "images" in data_types or "tables" in data_types
+            if use_selenium:
+                if not self.setup_selenium():
+                    scraped_data["errors"].append("Failed to setup Selenium for dynamic content")
+            # Get page content
+            content = self.get_page_content(url, use_selenium)
+            if not content:
+                scraped_data["errors"].append("Failed to fetch page content")
+                return scraped_data
+            # Parse with BeautifulSoup
+            soup = BeautifulSoup(content, 'html.parser')
+            scraped_data["pages_crawled"] = 1
+            # Extract data based on selected types
+            if "text" in data_types:
+                scraped_data["text_content"] = self.extract_text_content(soup)
+            if "images" in data_types:
+                scraped_data["images"] = self.extract_images(soup, url)
+            if "links" in data_types:
+                scraped_data["links"] = self.extract_links(soup, url)
+            if "tables" in data_types:
+                scraped_data["tables"] = self.extract_tables(soup)
+            if "metadata" in data_types:
+                scraped_data["metadata"] = self.extract_metadata(soup)
+            if "numbers" in data_types:
+                scraped_data["numbers"] = self.extract_numbers(soup)
+            # Rate limiting
+            time.sleep(rate_limit)
+        except Exception as e:
+            scraped_data["errors"].append(f"Scraping error: {str(e)}")
+        finally:
+            # Clean up Selenium
+            if use_selenium:
+                self.close_selenium()
+        return scraped_data
+# Global scraper instance
+scraper = WebScraper()

youtube_scraper.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import streamlit as st
+import requests
+from bs4 import BeautifulSoup
+from selenium import webdriver
+from selenium.webdriver.chrome.options import Options
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+from webdriver_manager.chrome import ChromeDriverManager
+import time
+import re
+import json
+from datetime import datetime
+class YouTubeScraper:
+    def __init__(self):
+        self.driver = None
+        self.setup_selenium()
+    def setup_selenium(self):
+        """Setup Selenium WebDriver for YouTube"""
+        try:
+            chrome_options = Options()
+            chrome_options.add_argument("--headless")
+            chrome_options.add_argument("--no-sandbox")
+            chrome_options.add_argument("--disable-dev-shm-usage")
+            chrome_options.add_argument("--disable-gpu")
+            chrome_options.add_argument("--window-size=1920,1080")
+            chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
+            self.driver = webdriver.Chrome(
+                service=webdriver.chrome.service.Service(ChromeDriverManager().install()),
+                options=chrome_options
+            )
+            return True
+        except Exception as e:
+            st.error(f"Failed to setup Selenium: {e}")
+            return False
+    def close_selenium(self):
+        """Close Selenium WebDriver"""
+        if self.driver:
+            self.driver.quit()
+            self.driver = None
+    def extract_video_info(self, url):
+        """Extract basic video information"""
+        try:
+            self.driver.get(url)
+            time.sleep(3)  # Wait for page to load
+            # Extract video title
+            title = ""
+            try:
+                title_element = self.driver.find_element(By.CSS_SELECTOR, "h1.ytd-video-primary-info-renderer")
+                title = title_element.text
+            except:
+                try:
+                    title_element = self.driver.find_element(By.CSS_SELECTOR, "h1")
+                    title = title_element.text
+                except:
+                    title = "Title not found"
+            # Extract channel name
+            channel = ""
+            try:
+                channel_element = self.driver.find_element(By.CSS_SELECTOR, "ytd-channel-name yt-formatted-string a")
+                channel = channel_element.text
+            except:
+                channel = "Channel not found"
+            # Extract view count
+            views = ""
+            try:
+                views_element = self.driver.find_element(By.CSS_SELECTOR, "span.view-count")
+                views = views_element.text
+            except:
+                views = "Views not found"
+            # Extract description
+            description = ""
+            try:
+                desc_element = self.driver.find_element(By.CSS_SELECTOR, "ytd-expandable-video-description-body-text")
+                description = desc_element.text
+            except:
+                description = "Description not found"
+            return {
+                "title": title,
+                "channel": channel,
+                "views": views,
+                "description": description,
+                "url": url
+            }
+        except Exception as e:
+            return {"error": f"Failed to extract video info: {str(e)}"}
+    def extract_comments(self, url, max_comments=50):
+        """Extract comments from YouTube video"""
+        try:
+            self.driver.get(url)
+            time.sleep(3)
+            # Scroll down to load comments
+            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
+            time.sleep(2)
+            # Try to find comments section
+            comments = []
+            # Method 1: Try to find comment elements
+            try:
+                comment_elements = self.driver.find_elements(By.CSS_SELECTOR, "ytd-comment-thread-renderer")
+                for i, comment in enumerate(comment_elements[:max_comments]):
+                    try:
+                        # Extract author
+                        author_element = comment.find_element(By.CSS_SELECTOR, "a#author-text")
+                        author = author_element.text.strip()
+                        # Extract comment text
+                        text_element = comment.find_element(By.CSS_SELECTOR, "#content-text")
+                        text = text_element.text.strip()
+                        # Extract timestamp
+                        time_element = comment.find_element(By.CSS_SELECTOR, "a.yt-simple-endpoint")
+                        timestamp = time_element.text.strip()
+                        # Extract likes
+                        likes = "0"
+                        try:
+                            likes_element = comment.find_element(By.CSS_SELECTOR, "span#vote-count-middle")
+                            likes = likes_element.text.strip()
+                        except:
+                            pass
+                        if text:  # Only add if comment has text
+                            comments.append({
+                                "author": author,
+                                "text": text,
+                                "timestamp": timestamp,
+                                "likes": likes,
+                                "comment_id": i
+                            })
+                    except Exception as e:
+                        continue
+            except Exception as e:
+                st.warning(f"Could not extract comments using primary method: {e}")
+            # Method 2: Alternative approach if first method fails
+            if not comments:
+                try:
+                    # Look for any text that might be comments
+                    page_text = self.driver.page_source
+                    soup = BeautifulSoup(page_text, 'html.parser')
+                    # Look for comment-like patterns
+                    comment_patterns = [
+                        "ytd-comment-renderer",
+                        "comment-text",
+                        "ytd-comment-thread-renderer"
+                    ]
+                    for pattern in comment_patterns:
+                        elements = soup.find_all(attrs={"class": re.compile(pattern)})
+                        for element in elements[:max_comments]:
+                            text = element.get_text().strip()
+                            if text and len(text) > 10:
+                                comments.append({
+                                    "author": "Unknown",
+                                    "text": text,
+                                    "timestamp": "Unknown",
+                                    "likes": "0",
+                                    "comment_id": len(comments)
+                                })
+                except Exception as e:
+                    st.error(f"Alternative comment extraction failed: {e}")
+            return comments
+        except Exception as e:
+            return [{"error": f"Failed to extract comments: {str(e)}"}]
+    def scrape_youtube_video(self, url, extract_comments=True, max_comments=50):
+        """Main function to scrape YouTube video data"""
+        result = {
+            "url": url,
+            "timestamp": datetime.now().isoformat(),
+            "video_info": {},
+            "comments": [],
+            "errors": []
+        }
+        try:
+            # Extract video information
+            result["video_info"] = self.extract_video_info(url)
+            # Extract comments if requested
+            if extract_comments:
+                result["comments"] = self.extract_comments(url, max_comments)
+        except Exception as e:
+            result["errors"].append(f"Scraping error: {str(e)}")
+        finally:
+            self.close_selenium()
+        return result
+# Global YouTube scraper instance
+youtube_scraper = YouTubeScraper()