Spaces:
Sleeping
Sleeping
Upload 9 files
Browse filesFor Testing use and explore the web data as you like
- DEPLOYMENT.md +100 -0
- README.md +74 -14
- app.py +396 -0
- instagram_scraper.py +378 -0
- instagram_scraper_v2.py +184 -0
- requirements.txt +6 -2
- requirements_hf.txt +5 -0
- scraper.py +320 -0
- youtube_scraper.py +215 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🚀 Deploy to Hugging Face Spaces
|
| 2 |
+
|
| 3 |
+
## Step-by-Step Guide
|
| 4 |
+
|
| 5 |
+
### 1. Create GitHub Repository
|
| 6 |
+
|
| 7 |
+
1. Go to [GitHub](https://github.com) and create a new repository
|
| 8 |
+
2. Name it something like `crawl4ai-streamlit-app`
|
| 9 |
+
3. Make it public (required for free Hugging Face Spaces)
|
| 10 |
+
4. Don't initialize with README (we already have one)
|
| 11 |
+
|
| 12 |
+
### 2. Push Code to GitHub
|
| 13 |
+
|
| 14 |
+
```bash
|
| 15 |
+
# Add your GitHub repository as remote
|
| 16 |
+
git remote add origin https://github.com/YOUR_USERNAME/crawl4ai-streamlit-app.git
|
| 17 |
+
|
| 18 |
+
# Push to GitHub
|
| 19 |
+
git branch -M main
|
| 20 |
+
git push -u origin main
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
### 3. Create Hugging Face Space
|
| 24 |
+
|
| 25 |
+
1. Go to [Hugging Face Spaces](https://huggingface.co/spaces)
|
| 26 |
+
2. Click "Create new Space"
|
| 27 |
+
3. Fill in the details:
|
| 28 |
+
- **Owner**: Your username
|
| 29 |
+
- **Space name**: `crawl4ai-web-scraper` (or any name you like)
|
| 30 |
+
- **License**: MIT
|
| 31 |
+
- **SDK**: Select "Streamlit"
|
| 32 |
+
- **Space hardware**: CPU (free tier)
|
| 33 |
+
4. Click "Create Space"
|
| 34 |
+
|
| 35 |
+
### 4. Connect GitHub Repository
|
| 36 |
+
|
| 37 |
+
1. In your new Space, click "Settings"
|
| 38 |
+
2. Under "Repository", click "Connect to existing repository"
|
| 39 |
+
3. Select your GitHub repository
|
| 40 |
+
4. Set the path to `app.py`
|
| 41 |
+
5. Click "Connect"
|
| 42 |
+
|
| 43 |
+
### 5. Configure Environment
|
| 44 |
+
|
| 45 |
+
The Space will automatically:
|
| 46 |
+
- Install dependencies from `requirements.txt`
|
| 47 |
+
- Run `streamlit run app.py`
|
| 48 |
+
- Deploy your app
|
| 49 |
+
|
| 50 |
+
### 6. Access Your App
|
| 51 |
+
|
| 52 |
+
Your app will be available at:
|
| 53 |
+
`https://huggingface.co/spaces/YOUR_USERNAME/crawl4ai-web-scraper`
|
| 54 |
+
|
| 55 |
+
## 🛠️ Troubleshooting
|
| 56 |
+
|
| 57 |
+
### Common Issues:
|
| 58 |
+
|
| 59 |
+
1. **Dependencies not found**: Make sure `requirements.txt` is in the root directory
|
| 60 |
+
2. **App not loading**: Check the logs in the Space settings
|
| 61 |
+
3. **Selenium issues**: Hugging Face Spaces may have limitations with browser automation
|
| 62 |
+
|
| 63 |
+
### Alternative: Manual Upload
|
| 64 |
+
|
| 65 |
+
If GitHub connection doesn't work:
|
| 66 |
+
1. Download your repository as ZIP
|
| 67 |
+
2. Upload files manually to the Space
|
| 68 |
+
3. Commit changes in the Space interface
|
| 69 |
+
|
| 70 |
+
## 📋 Files Required for Deployment
|
| 71 |
+
|
| 72 |
+
- ✅ `app.py` - Main Streamlit app
|
| 73 |
+
- ✅ `requirements.txt` - Dependencies
|
| 74 |
+
- ✅ `scraper.py` - Web scraping module
|
| 75 |
+
- ✅ `youtube_scraper.py` - YouTube scraping module
|
| 76 |
+
- ✅ `README.md` - Documentation
|
| 77 |
+
- ✅ `.gitignore` - Git ignore rules
|
| 78 |
+
|
| 79 |
+
## 🎯 Your App URL
|
| 80 |
+
|
| 81 |
+
Once deployed, your app will be live at:
|
| 82 |
+
`https://huggingface.co/spaces/YOUR_USERNAME/crawl4ai-web-scraper`
|
| 83 |
+
|
| 84 |
+
## 🔧 Customization
|
| 85 |
+
|
| 86 |
+
You can customize your Space:
|
| 87 |
+
- **Title**: Change in Space settings
|
| 88 |
+
- **Description**: Add in README.md
|
| 89 |
+
- **Tags**: Add relevant tags for discoverability
|
| 90 |
+
- **Hardware**: Upgrade if needed (paid tier)
|
| 91 |
+
|
| 92 |
+
## 📊 Monitoring
|
| 93 |
+
|
| 94 |
+
- **Logs**: Check Space settings for runtime logs
|
| 95 |
+
- **Usage**: Monitor in Space analytics
|
| 96 |
+
- **Updates**: Push to GitHub to auto-deploy
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
**Need help?** Check the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces) or ask in the community!
|
README.md
CHANGED
|
@@ -1,20 +1,80 @@
|
|
| 1 |
---
|
| 2 |
title: Scrape Anythings
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
-
sdk:
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
pinned: false
|
| 11 |
-
short_description: A powerful Streamlit app to scrape data from any website, in
|
| 12 |
-
license: mit
|
| 13 |
---
|
| 14 |
|
| 15 |
-
#
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: Scrape Anythings
|
| 3 |
+
emoji: ✨
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: streamlit
|
| 7 |
+
sdk_version: "1.35.0"
|
| 8 |
+
python_version: "3.9"
|
| 9 |
+
app_file: app.py
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# ✨ Scrape Anythings
|
| 13 |
|
| 14 |
+
A user-friendly Streamlit web application for extracting data from any website, including special support for YouTube and Instagram.
|
| 15 |
|
| 16 |
+
## 🌟 Features
|
| 17 |
+
|
| 18 |
+
- **Scrape Any URL**: Paste any website, YouTube, or Instagram URL to start.
|
| 19 |
+
- **Multiple Data Types**: Extract text, images, links, tables, numbers, and metadata.
|
| 20 |
+
- **Social Media Support**: Scrape YouTube video info & comments, and Instagram profile details & posts.
|
| 21 |
+
- **Rich Data Export**: Download your data in JSON, CSV, TXT, and structured Excel (.xlsx) formats.
|
| 22 |
+
- **Modern UI**: A clean and simple interface for a smooth user experience.
|
| 23 |
+
|
| 24 |
+
## 🚀 How to Deploy on Hugging Face Spaces
|
| 25 |
+
|
| 26 |
+
1. **Create a Hugging Face Account**: If you don't have one, sign up at [huggingface.co](https://huggingface.co/).
|
| 27 |
+
2. **Create a New Space**:
|
| 28 |
+
* Go to [huggingface.co/new-space](https://huggingface.co/new-space).
|
| 29 |
+
* Enter a **Space name** (e.g., `scrape-anythings`).
|
| 30 |
+
* Select **Streamlit** as the Space SDK.
|
| 31 |
+
* Choose **Create a new repository for this Space**.
|
| 32 |
+
* Click **Create Space**.
|
| 33 |
+
3. **Upload Your Files**:
|
| 34 |
+
* In your new Space, go to the **Files** tab.
|
| 35 |
+
* Click **Upload files**.
|
| 36 |
+
* Drag and drop all the files from your project folder:
|
| 37 |
+
* `app.py`
|
| 38 |
+
* `scraper.py`
|
| 39 |
+
* `youtube_scraper.py`
|
| 40 |
+
* `instagram_scraper.py`
|
| 41 |
+
* `instagram_scraper_v2.py`
|
| 42 |
+
* `requirements.txt`
|
| 43 |
+
* `README.md`
|
| 44 |
+
* Commit the files directly to the `main` branch.
|
| 45 |
+
|
| 46 |
+
4. **Done!** Hugging Face will automatically build and launch your application. You can share the URL of your Space with anyone.
|
| 47 |
+
|
| 48 |
+
## 📋 How to Use the App
|
| 49 |
+
|
| 50 |
+
1. **Enter a URL**: Paste the URL of the website, YouTube video, or Instagram profile you want to scrape.
|
| 51 |
+
2. **Select Data Types**: Choose the data you want to extract.
|
| 52 |
+
3. **Click Scrape!**: Let the app do the work.
|
| 53 |
+
4. **View & Download**: See the results directly in the app and download them in your preferred format.
|
| 54 |
+
|
| 55 |
+
- [ ] Real-time scraping status
|
| 56 |
+
- [ ] Custom CSS selectors
|
| 57 |
+
- [ ] Proxy support
|
| 58 |
+
- [ ] Multi-language support
|
| 59 |
+
|
| 60 |
+
## 🤝 Contributing
|
| 61 |
+
|
| 62 |
+
1. Fork the repository
|
| 63 |
+
2. Create a feature branch
|
| 64 |
+
3. Make your changes
|
| 65 |
+
4. Test thoroughly
|
| 66 |
+
5. Submit a pull request
|
| 67 |
+
|
| 68 |
+
## 📄 License
|
| 69 |
+
|
| 70 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
| 71 |
+
|
| 72 |
+
## 🙏 Acknowledgments
|
| 73 |
+
|
| 74 |
+
- Streamlit team for the amazing web app framework
|
| 75 |
+
- BeautifulSoup and Selenium communities
|
| 76 |
+
- Hugging Face for hosting capabilities
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
**Made with ❤️ for the AI/ML community**
|
app.py
ADDED
|
@@ -0,0 +1,396 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import pandas as pd
|
| 3 |
+
import json
|
| 4 |
+
import time
|
| 5 |
+
from datetime import datetime
|
| 6 |
+
import requests
|
| 7 |
+
from urllib.parse import urlparse
|
| 8 |
+
import io
|
| 9 |
+
import base64
|
| 10 |
+
from scraper import scraper
|
| 11 |
+
from youtube_scraper import youtube_scraper
|
| 12 |
+
from instagram_scraper import instagram_scraper
|
| 13 |
+
from instagram_scraper_v2 import instagram_scraper_v2
|
| 14 |
+
|
| 15 |
+
# Page configuration
|
| 16 |
+
st.set_page_config(
|
| 17 |
+
page_title="Scrape Anythings",
|
| 18 |
+
page_icon="🕷️",
|
| 19 |
+
layout="wide",
|
| 20 |
+
initial_sidebar_state="expanded"
|
| 21 |
+
)
|
| 22 |
+
|
| 23 |
+
# Custom CSS for better styling
|
| 24 |
+
st.markdown("""
|
| 25 |
+
<style>
|
| 26 |
+
.main-header {
|
| 27 |
+
font-size: 2.5rem;
|
| 28 |
+
font-weight: bold;
|
| 29 |
+
color: #1f77b4;
|
| 30 |
+
text-align: center;
|
| 31 |
+
margin-bottom: 2rem;
|
| 32 |
+
}
|
| 33 |
+
.sub-header {
|
| 34 |
+
font-size: 1.2rem;
|
| 35 |
+
color: #666;
|
| 36 |
+
text-align: center;
|
| 37 |
+
margin-bottom: 2rem;
|
| 38 |
+
}
|
| 39 |
+
.metric-card {
|
| 40 |
+
background-color: #f0f2f6;
|
| 41 |
+
padding: 1rem;
|
| 42 |
+
border-radius: 0.5rem;
|
| 43 |
+
border-left: 4px solid #1f77b4;
|
| 44 |
+
}
|
| 45 |
+
.success-box {
|
| 46 |
+
background-color: #d4edda;
|
| 47 |
+
border: 1px solid #c3e6cb;
|
| 48 |
+
border-radius: 0.5rem;
|
| 49 |
+
padding: 1rem;
|
| 50 |
+
margin: 1rem 0;
|
| 51 |
+
}
|
| 52 |
+
.error-box {
|
| 53 |
+
background-color: #f8d7da;
|
| 54 |
+
border: 1px solid #f5c6cb;
|
| 55 |
+
border-radius: 0.5rem;
|
| 56 |
+
padding: 1rem;
|
| 57 |
+
margin: 1rem 0;
|
| 58 |
+
}
|
| 59 |
+
</style>
|
| 60 |
+
""", unsafe_allow_html=True)
|
| 61 |
+
|
| 62 |
+
def validate_url(url):
|
| 63 |
+
"""Validate if the URL is properly formatted"""
|
| 64 |
+
try:
|
| 65 |
+
result = urlparse(url)
|
| 66 |
+
return all([result.scheme, result.netloc])
|
| 67 |
+
except:
|
| 68 |
+
return False
|
| 69 |
+
|
| 70 |
+
def perform_web_scraping(url, data_types, max_pages=1, rate_limit=2):
|
| 71 |
+
"""
|
| 72 |
+
Perform actual web scraping using the WebScraper class
|
| 73 |
+
"""
|
| 74 |
+
st.info("🔍 Starting web scraping...")
|
| 75 |
+
|
| 76 |
+
data_types_lower = [dt.lower() for dt in data_types]
|
| 77 |
+
with st.spinner("Crawling website..."):
|
| 78 |
+
scraped_data = scraper.scrape_website(url, data_types_lower, max_pages, rate_limit)
|
| 79 |
+
|
| 80 |
+
return scraped_data
|
| 81 |
+
|
| 82 |
+
def display_results(scraped_data, is_youtube=False, is_instagram=False):
|
| 83 |
+
"""Display the scraped data in a user-friendly format"""
|
| 84 |
+
|
| 85 |
+
if is_youtube:
|
| 86 |
+
display_youtube_results(scraped_data)
|
| 87 |
+
elif is_instagram:
|
| 88 |
+
display_instagram_results(scraped_data)
|
| 89 |
+
else:
|
| 90 |
+
display_regular_results(scraped_data)
|
| 91 |
+
|
| 92 |
+
def display_text_results(text_data):
|
| 93 |
+
st.write(f"**Title:** {text_data.get('title', 'N/A')}")
|
| 94 |
+
with st.expander("Headings"):
|
| 95 |
+
for heading in text_data.get("headings", []):
|
| 96 |
+
st.write(f"- **{heading.get('level', 'h?')}**: {heading.get('text', '')}")
|
| 97 |
+
with st.expander("Paragraphs"):
|
| 98 |
+
for para in text_data.get("paragraphs", []):
|
| 99 |
+
st.write(f"- {para}")
|
| 100 |
+
|
| 101 |
+
def display_image_results(images):
|
| 102 |
+
cols = st.columns(min(4, len(images)))
|
| 103 |
+
for i, img in enumerate(images):
|
| 104 |
+
with cols[i % 4]:
|
| 105 |
+
st.image(img.get("src", ""), caption=f"{img.get('alt', 'Image')[:50]}...", use_column_width=True)
|
| 106 |
+
|
| 107 |
+
def display_table_results(tables):
|
| 108 |
+
for i, table in enumerate(tables):
|
| 109 |
+
with st.expander(f"Table {i+1} (Header: {table.get('header', [])})"):
|
| 110 |
+
df = pd.DataFrame(table.get('rows', []))
|
| 111 |
+
st.dataframe(df)
|
| 112 |
+
|
| 113 |
+
def display_link_results(links):
|
| 114 |
+
for link in links:
|
| 115 |
+
st.write(f"- [{link.get('text', 'N/A')}]({link.get('href', '#')})")
|
| 116 |
+
|
| 117 |
+
def display_metadata_results(metadata):
|
| 118 |
+
st.json(metadata)
|
| 119 |
+
|
| 120 |
+
def display_regular_results(scraped_data):
|
| 121 |
+
"""Display regular website scraping results in a structured format."""
|
| 122 |
+
|
| 123 |
+
st.subheader("📝 Text Content")
|
| 124 |
+
if scraped_data.get("text_content"):
|
| 125 |
+
display_text_results(scraped_data["text_content"])
|
| 126 |
+
else:
|
| 127 |
+
st.info("No text content was extracted.")
|
| 128 |
+
|
| 129 |
+
st.subheader("🖼️ Images")
|
| 130 |
+
if scraped_data.get("images"):
|
| 131 |
+
display_image_results(scraped_data["images"])
|
| 132 |
+
else:
|
| 133 |
+
st.info("No images were extracted.")
|
| 134 |
+
|
| 135 |
+
st.subheader("🔢 Numbers")
|
| 136 |
+
if scraped_data.get("numbers"):
|
| 137 |
+
with st.expander("Extracted Numbers", expanded=False):
|
| 138 |
+
st.write(scraped_data["numbers"])
|
| 139 |
+
else:
|
| 140 |
+
st.info("No numbers were extracted.")
|
| 141 |
+
|
| 142 |
+
st.subheader("📊 Tables")
|
| 143 |
+
if scraped_data.get("tables"):
|
| 144 |
+
display_table_results(scraped_data["tables"])
|
| 145 |
+
else:
|
| 146 |
+
st.info("No tables were extracted.")
|
| 147 |
+
|
| 148 |
+
st.subheader("🔗 Links")
|
| 149 |
+
if scraped_data.get("links"):
|
| 150 |
+
display_link_results(scraped_data["links"])
|
| 151 |
+
else:
|
| 152 |
+
st.info("No links were extracted.")
|
| 153 |
+
|
| 154 |
+
st.subheader("📄 Metadata")
|
| 155 |
+
if scraped_data.get("metadata"):
|
| 156 |
+
display_metadata_results(scraped_data["metadata"])
|
| 157 |
+
else:
|
| 158 |
+
st.info("No metadata was extracted.")
|
| 159 |
+
|
| 160 |
+
def to_excel(data):
|
| 161 |
+
"""Converts a dictionary of scraped data to an Excel file in memory."""
|
| 162 |
+
output = io.BytesIO()
|
| 163 |
+
with pd.ExcelWriter(output, engine='openpyxl') as writer:
|
| 164 |
+
# Handle simple lists (links, images, numbers)
|
| 165 |
+
for key in ["links", "images", "numbers"]:
|
| 166 |
+
if data.get(key):
|
| 167 |
+
pd.DataFrame({key.capitalize(): data[key]}).to_excel(writer, sheet_name=key.capitalize(), index=False)
|
| 168 |
+
|
| 169 |
+
# Handle text content
|
| 170 |
+
if data.get("text_content"):
|
| 171 |
+
pd.DataFrame({'Text': [data["text_content"]]}).to_excel(writer, sheet_name='Text', index=False)
|
| 172 |
+
|
| 173 |
+
# Handle dictionaries (metadata, video_info, profile_info)
|
| 174 |
+
for key in ["metadata", "video_info", "profile_info"]:
|
| 175 |
+
if data.get(key):
|
| 176 |
+
pd.DataFrame(data[key].items(), columns=['Property', 'Value']).to_excel(writer, sheet_name=key.replace('_', ' ').capitalize(), index=False)
|
| 177 |
+
|
| 178 |
+
# Handle list of dictionaries (comments)
|
| 179 |
+
if data.get("comments"):
|
| 180 |
+
pd.DataFrame(data["comments"]).to_excel(writer, sheet_name='Comments', index=False)
|
| 181 |
+
|
| 182 |
+
# Handle list of DataFrames (tables)
|
| 183 |
+
if data.get("tables"):
|
| 184 |
+
for i, table_df in enumerate(data["tables"]):
|
| 185 |
+
table_df.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)
|
| 186 |
+
|
| 187 |
+
processed_data = output.getvalue()
|
| 188 |
+
return processed_data
|
| 189 |
+
|
| 190 |
+
def create_download_links(scraped_data):
|
| 191 |
+
"""Create download links for different formats"""
|
| 192 |
+
st.header("Download Data")
|
| 193 |
+
col1, col2, col3, col4 = st.columns(4)
|
| 194 |
+
|
| 195 |
+
# JSON download
|
| 196 |
+
with col1:
|
| 197 |
+
json_str = json.dumps(scraped_data or {}, indent=2, default=str)
|
| 198 |
+
st.download_button(
|
| 199 |
+
label="Download JSON",
|
| 200 |
+
data=json_str,
|
| 201 |
+
file_name="scraped_data.json",
|
| 202 |
+
mime="application/json",
|
| 203 |
+
use_container_width=True
|
| 204 |
+
)
|
| 205 |
+
|
| 206 |
+
# CSV download
|
| 207 |
+
with col2:
|
| 208 |
+
if scraped_data.get("tables"):
|
| 209 |
+
# For simplicity, we'll offer the first table as a CSV download
|
| 210 |
+
csv = scraped_data["tables"][0].to_csv(index=False)
|
| 211 |
+
st.download_button(
|
| 212 |
+
label="Download CSV",
|
| 213 |
+
data=csv,
|
| 214 |
+
file_name="scraped_table.csv",
|
| 215 |
+
mime="text/csv",
|
| 216 |
+
use_container_width=True
|
| 217 |
+
)
|
| 218 |
+
else:
|
| 219 |
+
st.button("Download CSV", disabled=True, help="No tables found to download.", use_container_width=True)
|
| 220 |
+
|
| 221 |
+
# TXT download
|
| 222 |
+
with col3:
|
| 223 |
+
text_content = scraped_data.get("text_content", "")
|
| 224 |
+
st.download_button(
|
| 225 |
+
label="Download TXT",
|
| 226 |
+
data=text_content,
|
| 227 |
+
file_name="scraped_text.txt",
|
| 228 |
+
mime="text/plain",
|
| 229 |
+
use_container_width=True
|
| 230 |
+
)
|
| 231 |
+
|
| 232 |
+
# Excel download
|
| 233 |
+
with col4:
|
| 234 |
+
try:
|
| 235 |
+
excel_data = to_excel(scraped_data)
|
| 236 |
+
st.download_button(
|
| 237 |
+
label="Download Excel",
|
| 238 |
+
data=excel_data,
|
| 239 |
+
file_name="scraped_data.xlsx",
|
| 240 |
+
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
| 241 |
+
use_container_width=True
|
| 242 |
+
)
|
| 243 |
+
except Exception as e:
|
| 244 |
+
st.button("Download Excel", disabled=True, help=f"Excel export failed: {e}", use_container_width=True)
|
| 245 |
+
for heading in text_data.get("headings", []):
|
| 246 |
+
txt_content += f"- {heading}\n"
|
| 247 |
+
txt_content += "\nParagraphs:\n"
|
| 248 |
+
for i, para in enumerate(text_data.get("paragraphs", []), 1):
|
| 249 |
+
txt_content += f"{i}. {para}\n"
|
| 250 |
+
|
| 251 |
+
b64_txt = base64.b64encode(txt_content.encode()).decode()
|
| 252 |
+
href = f'<a href="data:file/txt;base64,{b64_txt}" download="scraped_data.txt">📝 Download TXT</a>'
|
| 253 |
+
st.markdown(href, unsafe_allow_html=True)
|
| 254 |
+
|
| 255 |
+
# Excel download
|
| 256 |
+
with col4:
|
| 257 |
+
try:
|
| 258 |
+
excel_data = to_excel(scraped_data)
|
| 259 |
+
st.download_button(
|
| 260 |
+
label="Download data as Excel",
|
| 261 |
+
data=excel_data,
|
| 262 |
+
file_name="scraped_data.xlsx",
|
| 263 |
+
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
| 264 |
+
)
|
| 265 |
+
except Exception as e:
|
| 266 |
+
st.error(f"Failed to generate Excel file: {e}")
|
| 267 |
+
|
| 268 |
+
def display_youtube_results(scraped_data):
|
| 269 |
+
"""Display YouTube scraping results"""
|
| 270 |
+
if not scraped_data.get("video_info"):
|
| 271 |
+
st.error("Could not extract YouTube video information.")
|
| 272 |
+
return
|
| 273 |
+
|
| 274 |
+
video_info = scraped_data["video_info"]
|
| 275 |
+
st.subheader(f'{video_info.get("title", "Untitled")}')
|
| 276 |
+
st.write(f'**Channel:** {video_info.get("channel", "N/A")}')
|
| 277 |
+
st.write(f'**Views:** {video_info.get("views", "N/A")}')
|
| 278 |
+
|
| 279 |
+
with st.expander("Video Description"):
|
| 280 |
+
st.write(video_info.get("description", "No description."))
|
| 281 |
+
|
| 282 |
+
if "comments" in scraped_data and scraped_data["comments"]:
|
| 283 |
+
with st.expander(f'Comments ({len(scraped_data["comments"])})'):
|
| 284 |
+
for comment in scraped_data["comments"]:
|
| 285 |
+
st.markdown(f"**{comment.get('author', 'Unknown')}** - {comment.get('timestamp', 'Unknown')}")
|
| 286 |
+
st.write(comment.get('text', ''))
|
| 287 |
+
if comment.get('likes', '0') != '0':
|
| 288 |
+
st.caption(f"👍 {comment.get('likes', '0')} likes")
|
| 289 |
+
st.divider()
|
| 290 |
+
|
| 291 |
+
def display_instagram_results(scraped_data):
|
| 292 |
+
"""Display Instagram scraping results"""
|
| 293 |
+
if not scraped_data.get("profile_info"):
|
| 294 |
+
st.error("Could not extract Instagram profile information.")
|
| 295 |
+
return
|
| 296 |
+
|
| 297 |
+
profile_info = scraped_data["profile_info"]
|
| 298 |
+
with st.expander("Profile Information", expanded=True):
|
| 299 |
+
st.write(f'**Username:** {profile_info.get("username", "N/A")}')
|
| 300 |
+
st.write(f'**Display Name:** {profile_info.get("display_name", "N/A")}')
|
| 301 |
+
st.write(f'**Bio:** {profile_info.get("bio", "N/A")}')
|
| 302 |
+
st.write(f'**Followers:** {profile_info.get("followers", "N/A")}')
|
| 303 |
+
|
| 304 |
+
def main():
|
| 305 |
+
# Header
|
| 306 |
+
st.markdown('<h1 class="main-header">✨ Scrape Anythings</h1>', unsafe_allow_html=True)
|
| 307 |
+
st.markdown('<p class="sub-header">Extract data from any website with ease</p>', unsafe_allow_html=True)
|
| 308 |
+
|
| 309 |
+
# Sidebar for configuration
|
| 310 |
+
with st.sidebar:
|
| 311 |
+
st.header("Configuration")
|
| 312 |
+
|
| 313 |
+
url = st.text_input("Enter Website URL", placeholder="https://example.com")
|
| 314 |
+
|
| 315 |
+
is_youtube = "youtube.com" in url.lower() or "youtu.be" in url.lower() if url else False
|
| 316 |
+
is_instagram = "instagram.com" in url.lower() if url else False
|
| 317 |
+
|
| 318 |
+
data_types, youtube_data_types, instagram_data_types, max_comments = [], [], [], 50
|
| 319 |
+
|
| 320 |
+
if is_youtube:
|
| 321 |
+
st.info("YouTube URL detected!")
|
| 322 |
+
youtube_data_types = st.multiselect("YouTube Data Types", ["video_info", "comments"], default=["video_info", "comments"])
|
| 323 |
+
if "comments" in youtube_data_types:
|
| 324 |
+
max_comments = st.slider("Max Comments", 10, 200, 50)
|
| 325 |
+
elif is_instagram:
|
| 326 |
+
st.info("Instagram URL detected!")
|
| 327 |
+
instagram_data_types = st.multiselect("Instagram Data Types", ["profile_info", "images", "posts"], default=["profile_info", "images"])
|
| 328 |
+
else:
|
| 329 |
+
data_types = st.multiselect("Data Types", ["Text", "Images", "Links", "Tables", "Metadata", "Numbers"], default=["Text", "Links"])
|
| 330 |
+
|
| 331 |
+
st.subheader("Advanced Options")
|
| 332 |
+
max_pages = st.slider("Max Pages", 1, 10, 1)
|
| 333 |
+
rate_limit = st.slider("Rate Limit (s)", 1, 10, 2)
|
| 334 |
+
|
| 335 |
+
scrape_button = st.button("Start Scraping", type="primary", use_container_width=True)
|
| 336 |
+
|
| 337 |
+
# Main content area
|
| 338 |
+
if scrape_button:
|
| 339 |
+
if not url or not validate_url(url):
|
| 340 |
+
st.error("Please enter a valid URL.")
|
| 341 |
+
return
|
| 342 |
+
|
| 343 |
+
# Validate that at least one data type is selected for the given URL type
|
| 344 |
+
if is_youtube and not youtube_data_types:
|
| 345 |
+
st.error("Please select at least one YouTube data type to extract.")
|
| 346 |
+
return
|
| 347 |
+
elif is_instagram and not instagram_data_types:
|
| 348 |
+
st.error("Please select at least one Instagram data type to extract.")
|
| 349 |
+
return
|
| 350 |
+
elif not is_youtube and not is_instagram and not data_types:
|
| 351 |
+
st.error("Please select at least one data type to extract.")
|
| 352 |
+
return
|
| 353 |
+
|
| 354 |
+
with st.spinner("Scraping in progress... Please wait."):
|
| 355 |
+
try:
|
| 356 |
+
scraped_data = {}
|
| 357 |
+
if is_youtube:
|
| 358 |
+
scraped_data = youtube_scraper.scrape_youtube_video(url, "comments" in youtube_data_types, max_comments)
|
| 359 |
+
elif is_instagram:
|
| 360 |
+
try:
|
| 361 |
+
scraped_data = instagram_scraper_v2.extract_instagram_data(url)
|
| 362 |
+
except Exception:
|
| 363 |
+
st.warning("Improved scraper failed, trying fallback...")
|
| 364 |
+
scraped_data = instagram_scraper.extract_instagram_data(url)
|
| 365 |
+
else:
|
| 366 |
+
data_types_lower = [dt.lower() for dt in data_types]
|
| 367 |
+
scraped_data = perform_web_scraping(url, data_types_lower, max_pages, rate_limit)
|
| 368 |
+
|
| 369 |
+
if scraped_data.get("errors"):
|
| 370 |
+
st.error(f'Errors: {scraped_data["errors"]}')
|
| 371 |
+
|
| 372 |
+
# Check if any data was actually scraped before showing success
|
| 373 |
+
has_data = any(scraped_data.get(key) for key in ["text_content", "images", "numbers", "tables", "links", "metadata", "video_info", "profile_info"])
|
| 374 |
+
|
| 375 |
+
if has_data:
|
| 376 |
+
st.success("Scraping completed successfully!")
|
| 377 |
+
st.header("Scraping Results")
|
| 378 |
+
display_results(scraped_data, is_youtube, is_instagram)
|
| 379 |
+
st.header("Download Data")
|
| 380 |
+
create_download_links(scraped_data)
|
| 381 |
+
else:
|
| 382 |
+
st.warning("No data was extracted. The website might be blocking scrapers or the content is not available.")
|
| 383 |
+
|
| 384 |
+
except Exception as e:
|
| 385 |
+
st.error(f"An unexpected error occurred: {e}")
|
| 386 |
+
|
| 387 |
+
else:
|
| 388 |
+
st.markdown("""
|
| 389 |
+
### How to Use
|
| 390 |
+
1. **Enter URL** and **select data types** in the sidebar.
|
| 391 |
+
2. Click **Start Scraping** to begin.
|
| 392 |
+
3. View and **download the results** below.
|
| 393 |
+
""")
|
| 394 |
+
|
| 395 |
+
if __name__ == "__main__":
|
| 396 |
+
main()
|
instagram_scraper.py
ADDED
|
@@ -0,0 +1,378 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import requests
|
| 3 |
+
from bs4 import BeautifulSoup
|
| 4 |
+
import json
|
| 5 |
+
import re
|
| 6 |
+
import time
|
| 7 |
+
from datetime import datetime
|
| 8 |
+
from urllib.parse import urljoin, urlparse
|
| 9 |
+
|
| 10 |
+
class InstagramScraper:
|
| 11 |
+
def __init__(self):
|
| 12 |
+
self.session = requests.Session()
|
| 13 |
+
self.session.headers.update({
|
| 14 |
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
|
| 15 |
+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
|
| 16 |
+
'Accept-Language': 'en-US,en;q=0.5',
|
| 17 |
+
'Accept-Encoding': 'gzip, deflate',
|
| 18 |
+
'Connection': 'keep-alive',
|
| 19 |
+
'Upgrade-Insecure-Requests': '1',
|
| 20 |
+
})
|
| 21 |
+
|
| 22 |
+
def extract_instagram_data(self, url):
|
| 23 |
+
"""Extract data from Instagram profile or post"""
|
| 24 |
+
scraped_data = {
|
| 25 |
+
"url": url,
|
| 26 |
+
"timestamp": datetime.now().isoformat(),
|
| 27 |
+
"platform": "instagram",
|
| 28 |
+
"images": [],
|
| 29 |
+
"posts": [],
|
| 30 |
+
"profile_info": {},
|
| 31 |
+
"errors": []
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
try:
|
| 35 |
+
# Determine if it's a profile or post URL
|
| 36 |
+
if "/p/" in url or "/reel/" in url:
|
| 37 |
+
# Single post
|
| 38 |
+
scraped_data.update(self.extract_post_data(url))
|
| 39 |
+
else:
|
| 40 |
+
# Profile
|
| 41 |
+
scraped_data.update(self.extract_profile_data(url))
|
| 42 |
+
|
| 43 |
+
except Exception as e:
|
| 44 |
+
scraped_data["errors"].append(f"Instagram scraping error: {str(e)}")
|
| 45 |
+
|
| 46 |
+
# Check if we found any data
|
| 47 |
+
if not scraped_data.get("images") and not scraped_data.get("posts") and not scraped_data.get("profile_info", {}).get("username"):
|
| 48 |
+
scraped_data["errors"].append("No Instagram data found. This might be due to:")
|
| 49 |
+
scraped_data["errors"].append("- Private or protected account")
|
| 50 |
+
scraped_data["errors"].append("- Instagram's anti-scraping measures")
|
| 51 |
+
scraped_data["errors"].append("- Network connectivity issues")
|
| 52 |
+
scraped_data["errors"].append("- URL format issues")
|
| 53 |
+
|
| 54 |
+
return scraped_data
|
| 55 |
+
|
| 56 |
+
def extract_post_data(self, url):
|
| 57 |
+
"""Extract data from a single Instagram post"""
|
| 58 |
+
post_data = {
|
| 59 |
+
"post_type": "single_post",
|
| 60 |
+
"images": [],
|
| 61 |
+
"post_info": {}
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
try:
|
| 65 |
+
response = self.session.get(url, timeout=10)
|
| 66 |
+
response.raise_for_status()
|
| 67 |
+
|
| 68 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
| 69 |
+
|
| 70 |
+
# Look for image URLs in the page
|
| 71 |
+
# Instagram loads images dynamically, so we need to look for patterns
|
| 72 |
+
page_text = response.text
|
| 73 |
+
|
| 74 |
+
# Find image URLs in the page source
|
| 75 |
+
image_patterns = [
|
| 76 |
+
# Instagram post images (high quality)
|
| 77 |
+
r'"display_url":"([^"]+)"',
|
| 78 |
+
r'"display_src":"([^"]+)"',
|
| 79 |
+
r'"src":"([^"]*\.jpg[^"]*)"',
|
| 80 |
+
r'"src":"([^"]*\.jpeg[^"]*)"',
|
| 81 |
+
r'"src":"([^"]*\.png[^"]*)"',
|
| 82 |
+
# Direct image URLs
|
| 83 |
+
r'https://[^"]*\.jpg[^"]*',
|
| 84 |
+
r'https://[^"]*\.jpeg[^"]*',
|
| 85 |
+
r'https://[^"]*\.png[^"]*',
|
| 86 |
+
# Instagram CDN URLs (high quality)
|
| 87 |
+
r'https://scontent[^"]*\.jpg[^"]*',
|
| 88 |
+
r'https://scontent[^"]*\.jpeg[^"]*',
|
| 89 |
+
r'https://scontent[^"]*\.png[^"]*',
|
| 90 |
+
# Additional Instagram patterns
|
| 91 |
+
r'"url":"([^"]*\.jpg[^"]*)"',
|
| 92 |
+
r'"url":"([^"]*\.jpeg[^"]*)"',
|
| 93 |
+
r'"url":"([^"]*\.png[^"]*)"'
|
| 94 |
+
]
|
| 95 |
+
|
| 96 |
+
found_images = set()
|
| 97 |
+
for pattern in image_patterns:
|
| 98 |
+
matches = re.findall(pattern, page_text)
|
| 99 |
+
for match in matches:
|
| 100 |
+
if match and ('instagram' in match.lower() or 'scontent' in match.lower()):
|
| 101 |
+
# Clean up the URL
|
| 102 |
+
clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
|
| 103 |
+
found_images.add(clean_url)
|
| 104 |
+
|
| 105 |
+
# Convert to image objects
|
| 106 |
+
for i, img_url in enumerate(list(found_images)):
|
| 107 |
+
post_data["images"].append({
|
| 108 |
+
"src": img_url,
|
| 109 |
+
"alt": f"Instagram post image {i+1}",
|
| 110 |
+
"title": f"Instagram post image {i+1}",
|
| 111 |
+
"width": "",
|
| 112 |
+
"height": ""
|
| 113 |
+
})
|
| 114 |
+
|
| 115 |
+
# Extract post information
|
| 116 |
+
post_data["post_info"] = {
|
| 117 |
+
"url": url,
|
| 118 |
+
"images_count": len(post_data["images"]),
|
| 119 |
+
"scraped_at": datetime.now().isoformat()
|
| 120 |
+
}
|
| 121 |
+
|
| 122 |
+
except Exception as e:
|
| 123 |
+
post_data["errors"] = [f"Failed to extract post data: {str(e)}"]
|
| 124 |
+
|
| 125 |
+
return post_data
|
| 126 |
+
|
| 127 |
+
def extract_profile_data(self, url):
|
| 128 |
+
"""Extract data from Instagram profile"""
|
| 129 |
+
profile_data = {
|
| 130 |
+
"profile_type": "account",
|
| 131 |
+
"images": [],
|
| 132 |
+
"profile_info": {},
|
| 133 |
+
"posts": []
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
try:
|
| 137 |
+
response = self.session.get(url, timeout=10)
|
| 138 |
+
response.raise_for_status()
|
| 139 |
+
|
| 140 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
| 141 |
+
page_text = response.text
|
| 142 |
+
|
| 143 |
+
# Extract profile information
|
| 144 |
+
profile_data["profile_info"] = self.extract_profile_info(soup, page_text)
|
| 145 |
+
|
| 146 |
+
# Extract recent posts first
|
| 147 |
+
profile_data["posts"] = self.extract_recent_posts(page_text)
|
| 148 |
+
|
| 149 |
+
# Extract images from profile page
|
| 150 |
+
profile_data["images"] = self.extract_profile_images(page_text)
|
| 151 |
+
|
| 152 |
+
# Extract images from individual posts (higher quality)
|
| 153 |
+
if profile_data["posts"]:
|
| 154 |
+
post_images = self.extract_images_from_posts(profile_data["posts"], max_posts=3)
|
| 155 |
+
if post_images:
|
| 156 |
+
profile_data["images"].extend(post_images)
|
| 157 |
+
|
| 158 |
+
except Exception as e:
|
| 159 |
+
profile_data["errors"] = [f"Failed to extract profile data: {str(e)}"]
|
| 160 |
+
|
| 161 |
+
return profile_data
|
| 162 |
+
|
| 163 |
+
def extract_profile_info(self, soup, page_text):
|
| 164 |
+
"""Extract profile information"""
|
| 165 |
+
profile_info = {
|
| 166 |
+
"username": "",
|
| 167 |
+
"display_name": "",
|
| 168 |
+
"bio": "",
|
| 169 |
+
"followers": "",
|
| 170 |
+
"following": "",
|
| 171 |
+
"posts_count": ""
|
| 172 |
+
}
|
| 173 |
+
|
| 174 |
+
try:
|
| 175 |
+
# Look for profile information in the page source
|
| 176 |
+
# Instagram loads this data dynamically, so we need to parse JSON
|
| 177 |
+
|
| 178 |
+
# Find JSON data in the page
|
| 179 |
+
json_patterns = [
|
| 180 |
+
r'window\._sharedData\s*=\s*({[^}]+})',
|
| 181 |
+
r'"profile_page":\s*({[^}]+})',
|
| 182 |
+
r'"user":\s*({[^}]+})'
|
| 183 |
+
]
|
| 184 |
+
|
| 185 |
+
for pattern in json_patterns:
|
| 186 |
+
matches = re.findall(pattern, page_text)
|
| 187 |
+
if matches:
|
| 188 |
+
try:
|
| 189 |
+
data = json.loads(matches[0])
|
| 190 |
+
# Extract profile info from JSON
|
| 191 |
+
if "user" in data:
|
| 192 |
+
user_data = data["user"]
|
| 193 |
+
profile_info["username"] = user_data.get("username", "")
|
| 194 |
+
profile_info["display_name"] = user_data.get("full_name", "")
|
| 195 |
+
profile_info["bio"] = user_data.get("biography", "")
|
| 196 |
+
profile_info["followers"] = user_data.get("followed_by", {}).get("count", "")
|
| 197 |
+
profile_info["following"] = user_data.get("follows", {}).get("count", "")
|
| 198 |
+
profile_info["posts_count"] = user_data.get("media", {}).get("count", "")
|
| 199 |
+
except:
|
| 200 |
+
continue
|
| 201 |
+
|
| 202 |
+
# Fallback: try to extract from HTML
|
| 203 |
+
if not profile_info["username"]:
|
| 204 |
+
title_tag = soup.find('title')
|
| 205 |
+
if title_tag:
|
| 206 |
+
title_text = title_tag.get_text()
|
| 207 |
+
if '(' in title_text and ')' in title_text:
|
| 208 |
+
username = title_text.split('(')[1].split(')')[0]
|
| 209 |
+
profile_info["username"] = username
|
| 210 |
+
|
| 211 |
+
except Exception as e:
|
| 212 |
+
profile_info["error"] = f"Failed to extract profile info: {str(e)}"
|
| 213 |
+
|
| 214 |
+
return profile_info
|
| 215 |
+
|
| 216 |
+
def extract_profile_images(self, page_text):
|
| 217 |
+
"""Extract images from profile page"""
|
| 218 |
+
images = []
|
| 219 |
+
|
| 220 |
+
try:
|
| 221 |
+
# Look for Instagram post images in the page source
|
| 222 |
+
# Instagram stores post images in JSON data
|
| 223 |
+
image_patterns = [
|
| 224 |
+
# Instagram post images (high quality)
|
| 225 |
+
r'"display_url":"([^"]+)"',
|
| 226 |
+
r'"display_src":"([^"]+)"',
|
| 227 |
+
r'"src":"([^"]*\.jpg[^"]*)"',
|
| 228 |
+
r'"src":"([^"]*\.jpeg[^"]*)"',
|
| 229 |
+
r'"src":"([^"]*\.png[^"]*)"',
|
| 230 |
+
# Direct image URLs
|
| 231 |
+
r'https://[^"]*\.jpg[^"]*',
|
| 232 |
+
r'https://[^"]*\.jpeg[^"]*',
|
| 233 |
+
r'https://[^"]*\.png[^"]*',
|
| 234 |
+
# Instagram CDN URLs
|
| 235 |
+
r'https://scontent[^"]*\.jpg[^"]*',
|
| 236 |
+
r'https://scontent[^"]*\.jpeg[^"]*',
|
| 237 |
+
r'https://scontent[^"]*\.png[^"]*',
|
| 238 |
+
# Additional Instagram patterns
|
| 239 |
+
r'"url":"([^"]*\.jpg[^"]*)"',
|
| 240 |
+
r'"url":"([^"]*\.jpeg[^"]*)"',
|
| 241 |
+
r'"url":"([^"]*\.png[^"]*)"'
|
| 242 |
+
]
|
| 243 |
+
|
| 244 |
+
found_images = set()
|
| 245 |
+
for pattern in image_patterns:
|
| 246 |
+
matches = re.findall(pattern, page_text)
|
| 247 |
+
for match in matches:
|
| 248 |
+
if match and ('instagram' in match.lower() or 'scontent' in match.lower()):
|
| 249 |
+
# Clean up the URL
|
| 250 |
+
clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
|
| 251 |
+
found_images.add(clean_url)
|
| 252 |
+
|
| 253 |
+
# Convert to image objects
|
| 254 |
+
for i, img_url in enumerate(list(found_images)):
|
| 255 |
+
images.append({
|
| 256 |
+
"src": img_url,
|
| 257 |
+
"alt": f"Instagram post image {i+1}",
|
| 258 |
+
"title": f"Instagram post image {i+1}",
|
| 259 |
+
"width": "",
|
| 260 |
+
"height": ""
|
| 261 |
+
})
|
| 262 |
+
|
| 263 |
+
except Exception as e:
|
| 264 |
+
st.error(f"Failed to extract profile images: {str(e)}")
|
| 265 |
+
|
| 266 |
+
return images
|
| 267 |
+
|
| 268 |
+
def extract_recent_posts(self, page_text):
|
| 269 |
+
"""Extract recent posts from profile"""
|
| 270 |
+
posts = []
|
| 271 |
+
|
| 272 |
+
try:
|
| 273 |
+
# Look for post URLs in the page source
|
| 274 |
+
post_patterns = [
|
| 275 |
+
r'"shortcode":"([^"]+)"',
|
| 276 |
+
r'/p/([^/"]+)',
|
| 277 |
+
r'/reel/([^/"]+)'
|
| 278 |
+
]
|
| 279 |
+
|
| 280 |
+
found_posts = set()
|
| 281 |
+
for pattern in post_patterns:
|
| 282 |
+
matches = re.findall(pattern, page_text)
|
| 283 |
+
for match in matches:
|
| 284 |
+
if match:
|
| 285 |
+
found_posts.add(match)
|
| 286 |
+
|
| 287 |
+
# Convert to post objects
|
| 288 |
+
for i, post_code in enumerate(list(found_posts)[:10]): # Convert set to list and limit to 10 posts
|
| 289 |
+
posts.append({
|
| 290 |
+
"shortcode": post_code,
|
| 291 |
+
"url": f"https://www.instagram.com/p/{post_code}/",
|
| 292 |
+
"index": i + 1
|
| 293 |
+
})
|
| 294 |
+
|
| 295 |
+
except Exception as e:
|
| 296 |
+
st.error(f"Failed to extract recent posts: {str(e)}")
|
| 297 |
+
|
| 298 |
+
return posts
|
| 299 |
+
|
| 300 |
+
def extract_images_from_posts(self, posts, max_posts=5):
|
| 301 |
+
"""Extract images from individual posts"""
|
| 302 |
+
all_images = []
|
| 303 |
+
|
| 304 |
+
try:
|
| 305 |
+
for i, post in enumerate(posts[:max_posts]):
|
| 306 |
+
try:
|
| 307 |
+
# Get the post page
|
| 308 |
+
post_url = post["url"]
|
| 309 |
+
response = self.session.get(post_url, timeout=10)
|
| 310 |
+
response.raise_for_status()
|
| 311 |
+
|
| 312 |
+
# Extract images from this post
|
| 313 |
+
post_images = self.extract_post_images(response.text)
|
| 314 |
+
|
| 315 |
+
# Add post context to images
|
| 316 |
+
for img in post_images:
|
| 317 |
+
img["post_url"] = post_url
|
| 318 |
+
img["post_index"] = i + 1
|
| 319 |
+
all_images.append(img)
|
| 320 |
+
|
| 321 |
+
# Small delay to be respectful
|
| 322 |
+
time.sleep(1)
|
| 323 |
+
|
| 324 |
+
except Exception as e:
|
| 325 |
+
st.warning(f"Failed to extract images from post {post['shortcode']}: {str(e)}")
|
| 326 |
+
continue
|
| 327 |
+
|
| 328 |
+
except Exception as e:
|
| 329 |
+
st.error(f"Failed to extract images from posts: {str(e)}")
|
| 330 |
+
|
| 331 |
+
return all_images
|
| 332 |
+
|
| 333 |
+
def extract_post_images(self, page_text):
|
| 334 |
+
"""Extract images from a single post page"""
|
| 335 |
+
images = []
|
| 336 |
+
|
| 337 |
+
try:
|
| 338 |
+
# Look for high-quality Instagram post images
|
| 339 |
+
image_patterns = [
|
| 340 |
+
# Instagram post images (high quality)
|
| 341 |
+
r'"display_url":"([^"]+)"',
|
| 342 |
+
r'"display_src":"([^"]+)"',
|
| 343 |
+
# Instagram CDN URLs (highest quality)
|
| 344 |
+
r'https://scontent[^"]*\.jpg[^"]*',
|
| 345 |
+
r'https://scontent[^"]*\.jpeg[^"]*',
|
| 346 |
+
r'https://scontent[^"]*\.png[^"]*',
|
| 347 |
+
# Additional patterns
|
| 348 |
+
r'"src":"([^"]*\.jpg[^"]*)"',
|
| 349 |
+
r'"src":"([^"]*\.jpeg[^"]*)"',
|
| 350 |
+
r'"src":"([^"]*\.png[^"]*)"'
|
| 351 |
+
]
|
| 352 |
+
|
| 353 |
+
found_images = set()
|
| 354 |
+
for pattern in image_patterns:
|
| 355 |
+
matches = re.findall(pattern, page_text)
|
| 356 |
+
for match in matches:
|
| 357 |
+
if match and ('scontent' in match.lower() or 'instagram' in match.lower()):
|
| 358 |
+
# Clean up the URL
|
| 359 |
+
clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
|
| 360 |
+
found_images.add(clean_url)
|
| 361 |
+
|
| 362 |
+
# Convert to image objects
|
| 363 |
+
for i, img_url in enumerate(list(found_images)):
|
| 364 |
+
images.append({
|
| 365 |
+
"src": img_url,
|
| 366 |
+
"alt": f"Instagram post image {i+1}",
|
| 367 |
+
"title": f"Instagram post image {i+1}",
|
| 368 |
+
"width": "",
|
| 369 |
+
"height": ""
|
| 370 |
+
})
|
| 371 |
+
|
| 372 |
+
except Exception as e:
|
| 373 |
+
st.error(f"Failed to extract post images: {str(e)}")
|
| 374 |
+
|
| 375 |
+
return images
|
| 376 |
+
|
| 377 |
+
# Global Instagram scraper instance
|
| 378 |
+
instagram_scraper = InstagramScraper()
|
instagram_scraper_v2.py
ADDED
|
@@ -0,0 +1,184 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import requests
|
| 3 |
+
import json
|
| 4 |
+
import re
|
| 5 |
+
import time
|
| 6 |
+
import random
|
| 7 |
+
from datetime import datetime
|
| 8 |
+
|
| 9 |
+
class InstagramScraperV2:
|
| 10 |
+
def __init__(self):
|
| 11 |
+
self.session = requests.Session()
|
| 12 |
+
self.setup_session()
|
| 13 |
+
|
| 14 |
+
def setup_session(self):
|
| 15 |
+
"""Setup session with better anti-detection measures"""
|
| 16 |
+
user_agents = [
|
| 17 |
+
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
|
| 18 |
+
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
|
| 19 |
+
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
|
| 20 |
+
]
|
| 21 |
+
|
| 22 |
+
self.session.headers.update({
|
| 23 |
+
'User-Agent': random.choice(user_agents),
|
| 24 |
+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
|
| 25 |
+
'Accept-Language': 'en-US,en;q=0.5',
|
| 26 |
+
'Connection': 'keep-alive',
|
| 27 |
+
'Upgrade-Insecure-Requests': '1'
|
| 28 |
+
})
|
| 29 |
+
|
| 30 |
+
def get_page_with_retry(self, url, max_retries=3):
|
| 31 |
+
"""Get page with retry mechanism"""
|
| 32 |
+
for attempt in range(max_retries):
|
| 33 |
+
try:
|
| 34 |
+
time.sleep(random.uniform(2, 4))
|
| 35 |
+
response = self.session.get(url, timeout=20)
|
| 36 |
+
response.raise_for_status()
|
| 37 |
+
return response.text
|
| 38 |
+
except Exception as e:
|
| 39 |
+
st.warning(f"Attempt {attempt + 1} failed: {str(e)}")
|
| 40 |
+
if attempt == max_retries - 1:
|
| 41 |
+
raise
|
| 42 |
+
return None
|
| 43 |
+
|
| 44 |
+
def extract_instagram_data(self, url):
|
| 45 |
+
"""Extract data from Instagram with improved error handling"""
|
| 46 |
+
scraped_data = {
|
| 47 |
+
"url": url,
|
| 48 |
+
"timestamp": datetime.now().isoformat(),
|
| 49 |
+
"platform": "instagram",
|
| 50 |
+
"images": [],
|
| 51 |
+
"posts": [],
|
| 52 |
+
"profile_info": {},
|
| 53 |
+
"errors": []
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
try:
|
| 57 |
+
page_text = self.get_page_with_retry(url)
|
| 58 |
+
if not page_text:
|
| 59 |
+
scraped_data["errors"].append("Failed to load Instagram page")
|
| 60 |
+
return scraped_data
|
| 61 |
+
|
| 62 |
+
# Extract images
|
| 63 |
+
scraped_data["images"] = self.extract_images_from_page(page_text)
|
| 64 |
+
|
| 65 |
+
# Extract profile info
|
| 66 |
+
scraped_data["profile_info"] = self.extract_profile_info(page_text)
|
| 67 |
+
|
| 68 |
+
# Extract posts
|
| 69 |
+
scraped_data["posts"] = self.extract_recent_posts(page_text)
|
| 70 |
+
|
| 71 |
+
except Exception as e:
|
| 72 |
+
scraped_data["errors"].append(f"Instagram scraping error: {str(e)}")
|
| 73 |
+
|
| 74 |
+
return scraped_data
|
| 75 |
+
|
| 76 |
+
def extract_images_from_page(self, page_text):
|
| 77 |
+
"""Extract images with improved patterns"""
|
| 78 |
+
images = []
|
| 79 |
+
|
| 80 |
+
try:
|
| 81 |
+
# Enhanced patterns for Instagram images
|
| 82 |
+
patterns = [
|
| 83 |
+
r'https://scontent[^"]*\.jpg[^"]*',
|
| 84 |
+
r'https://scontent[^"]*\.jpeg[^"]*',
|
| 85 |
+
r'https://scontent[^"]*\.png[^"]*',
|
| 86 |
+
r'"display_url":"([^"]+)"',
|
| 87 |
+
r'"display_src":"([^"]+)"'
|
| 88 |
+
]
|
| 89 |
+
|
| 90 |
+
found_images = set()
|
| 91 |
+
for pattern in patterns:
|
| 92 |
+
matches = re.findall(pattern, page_text)
|
| 93 |
+
for match in matches:
|
| 94 |
+
if match and ('scontent' in match.lower() or 'instagram' in match.lower()):
|
| 95 |
+
clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
|
| 96 |
+
found_images.add(clean_url)
|
| 97 |
+
|
| 98 |
+
for i, img_url in enumerate(list(found_images)):
|
| 99 |
+
images.append({
|
| 100 |
+
"src": img_url,
|
| 101 |
+
"alt": f"Instagram image {i+1}",
|
| 102 |
+
"title": f"Instagram image {i+1}",
|
| 103 |
+
"width": "",
|
| 104 |
+
"height": ""
|
| 105 |
+
})
|
| 106 |
+
|
| 107 |
+
except Exception as e:
|
| 108 |
+
st.error(f"Failed to extract images: {str(e)}")
|
| 109 |
+
|
| 110 |
+
return images
|
| 111 |
+
|
| 112 |
+
def extract_profile_info(self, page_text):
|
| 113 |
+
"""Extract profile information"""
|
| 114 |
+
profile_info = {
|
| 115 |
+
"username": "",
|
| 116 |
+
"display_name": "",
|
| 117 |
+
"bio": "",
|
| 118 |
+
"followers": "",
|
| 119 |
+
"following": "",
|
| 120 |
+
"posts_count": ""
|
| 121 |
+
}
|
| 122 |
+
|
| 123 |
+
try:
|
| 124 |
+
# Extract username from title
|
| 125 |
+
title_match = re.search(r'<title>([^<]+)</title>', page_text)
|
| 126 |
+
if title_match:
|
| 127 |
+
title = title_match.group(1)
|
| 128 |
+
if '(' in title and ')' in title:
|
| 129 |
+
username = title.split('(')[1].split(')')[0]
|
| 130 |
+
profile_info["username"] = username
|
| 131 |
+
|
| 132 |
+
# Look for JSON data
|
| 133 |
+
json_patterns = [
|
| 134 |
+
r'"username":"([^"]+)"',
|
| 135 |
+
r'"full_name":"([^"]+)"',
|
| 136 |
+
r'"biography":"([^"]+)"'
|
| 137 |
+
]
|
| 138 |
+
|
| 139 |
+
for pattern in json_patterns:
|
| 140 |
+
matches = re.findall(pattern, page_text)
|
| 141 |
+
if matches:
|
| 142 |
+
if "username" in pattern:
|
| 143 |
+
profile_info["username"] = matches[0]
|
| 144 |
+
elif "full_name" in pattern:
|
| 145 |
+
profile_info["display_name"] = matches[0]
|
| 146 |
+
elif "biography" in pattern:
|
| 147 |
+
profile_info["bio"] = matches[0]
|
| 148 |
+
|
| 149 |
+
except Exception as e:
|
| 150 |
+
profile_info["error"] = f"Failed to extract profile info: {str(e)}"
|
| 151 |
+
|
| 152 |
+
return profile_info
|
| 153 |
+
|
| 154 |
+
def extract_recent_posts(self, page_text):
|
| 155 |
+
"""Extract recent posts"""
|
| 156 |
+
posts = []
|
| 157 |
+
|
| 158 |
+
try:
|
| 159 |
+
post_patterns = [
|
| 160 |
+
r'"shortcode":"([^"]+)"',
|
| 161 |
+
r'/p/([^/"]+)'
|
| 162 |
+
]
|
| 163 |
+
|
| 164 |
+
found_posts = set()
|
| 165 |
+
for pattern in post_patterns:
|
| 166 |
+
matches = re.findall(pattern, page_text)
|
| 167 |
+
for match in matches:
|
| 168 |
+
if match:
|
| 169 |
+
found_posts.add(match)
|
| 170 |
+
|
| 171 |
+
for i, post_code in enumerate(list(found_posts)[:10]):
|
| 172 |
+
posts.append({
|
| 173 |
+
"shortcode": post_code,
|
| 174 |
+
"url": f"https://www.instagram.com/p/{post_code}/",
|
| 175 |
+
"index": i + 1
|
| 176 |
+
})
|
| 177 |
+
|
| 178 |
+
except Exception as e:
|
| 179 |
+
st.error(f"Failed to extract posts: {str(e)}")
|
| 180 |
+
|
| 181 |
+
return posts
|
| 182 |
+
|
| 183 |
+
# Global instance
|
| 184 |
+
instagram_scraper_v2 = InstagramScraperV2()
|
requirements.txt
CHANGED
|
@@ -1,3 +1,7 @@
|
|
| 1 |
-
|
| 2 |
pandas
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit
|
| 2 |
pandas
|
| 3 |
+
requests
|
| 4 |
+
beautifulsoup4
|
| 5 |
+
selenium
|
| 6 |
+
lxml
|
| 7 |
+
openpyxl
|
requirements_hf.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit>=1.28.0
|
| 2 |
+
pandas>=1.5.0
|
| 3 |
+
requests>=2.28.0
|
| 4 |
+
beautifulsoup4>=4.11.0
|
| 5 |
+
lxml>=4.9.0
|
scraper.py
ADDED
|
@@ -0,0 +1,320 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import requests
|
| 2 |
+
from bs4 import BeautifulSoup
|
| 3 |
+
from selenium import webdriver
|
| 4 |
+
from selenium.webdriver.chrome.options import Options
|
| 5 |
+
from selenium.webdriver.common.by import By
|
| 6 |
+
from selenium.webdriver.support.ui import WebDriverWait
|
| 7 |
+
from selenium.webdriver.support import expected_conditions as EC
|
| 8 |
+
from webdriver_manager.chrome import ChromeDriverManager
|
| 9 |
+
import time
|
| 10 |
+
import re
|
| 11 |
+
from urllib.parse import urljoin, urlparse
|
| 12 |
+
import json
|
| 13 |
+
from datetime import datetime
|
| 14 |
+
|
| 15 |
+
class WebScraper:
|
| 16 |
+
def __init__(self):
|
| 17 |
+
self.session = requests.Session()
|
| 18 |
+
self.session.headers.update({
|
| 19 |
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
|
| 20 |
+
})
|
| 21 |
+
self.driver = None
|
| 22 |
+
|
| 23 |
+
def setup_selenium(self):
|
| 24 |
+
"""Setup Selenium WebDriver for dynamic content"""
|
| 25 |
+
try:
|
| 26 |
+
chrome_options = Options()
|
| 27 |
+
chrome_options.add_argument("--headless")
|
| 28 |
+
chrome_options.add_argument("--no-sandbox")
|
| 29 |
+
chrome_options.add_argument("--disable-dev-shm-usage")
|
| 30 |
+
chrome_options.add_argument("--disable-gpu")
|
| 31 |
+
chrome_options.add_argument("--window-size=1920,1080")
|
| 32 |
+
|
| 33 |
+
self.driver = webdriver.Chrome(
|
| 34 |
+
service=webdriver.chrome.service.Service(ChromeDriverManager().install()),
|
| 35 |
+
options=chrome_options
|
| 36 |
+
)
|
| 37 |
+
return True
|
| 38 |
+
except Exception as e:
|
| 39 |
+
print(f"Failed to setup Selenium: {e}")
|
| 40 |
+
return False
|
| 41 |
+
|
| 42 |
+
def close_selenium(self):
|
| 43 |
+
"""Close Selenium WebDriver"""
|
| 44 |
+
if self.driver:
|
| 45 |
+
self.driver.quit()
|
| 46 |
+
self.driver = None
|
| 47 |
+
|
| 48 |
+
def get_page_content(self, url, use_selenium=False):
|
| 49 |
+
"""Get page content using requests or Selenium"""
|
| 50 |
+
try:
|
| 51 |
+
if use_selenium and self.driver:
|
| 52 |
+
self.driver.get(url)
|
| 53 |
+
time.sleep(2) # Wait for dynamic content
|
| 54 |
+
return self.driver.page_source
|
| 55 |
+
else:
|
| 56 |
+
response = self.session.get(url, timeout=10)
|
| 57 |
+
response.raise_for_status()
|
| 58 |
+
return response.text
|
| 59 |
+
except Exception as e:
|
| 60 |
+
print(f"Error fetching page: {e}")
|
| 61 |
+
return None
|
| 62 |
+
|
| 63 |
+
def extract_text_content(self, soup):
|
| 64 |
+
"""Extract text content from BeautifulSoup object"""
|
| 65 |
+
text_data = {
|
| 66 |
+
"title": "",
|
| 67 |
+
"headings": [],
|
| 68 |
+
"paragraphs": [],
|
| 69 |
+
"lists": []
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
# Extract title
|
| 73 |
+
title_tag = soup.find('title')
|
| 74 |
+
if title_tag:
|
| 75 |
+
text_data["title"] = title_tag.get_text().strip()
|
| 76 |
+
|
| 77 |
+
# Extract headings
|
| 78 |
+
for tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
|
| 79 |
+
headings = soup.find_all(tag)
|
| 80 |
+
for heading in headings:
|
| 81 |
+
text = heading.get_text().strip()
|
| 82 |
+
if text:
|
| 83 |
+
text_data["headings"].append({
|
| 84 |
+
"level": tag,
|
| 85 |
+
"text": text
|
| 86 |
+
})
|
| 87 |
+
|
| 88 |
+
# Extract paragraphs
|
| 89 |
+
paragraphs = soup.find_all('p')
|
| 90 |
+
for p in paragraphs:
|
| 91 |
+
text = p.get_text().strip()
|
| 92 |
+
if text and len(text) > 20: # Filter out short text
|
| 93 |
+
text_data["paragraphs"].append(text)
|
| 94 |
+
|
| 95 |
+
# Extract lists
|
| 96 |
+
lists = soup.find_all(['ul', 'ol'])
|
| 97 |
+
for lst in lists:
|
| 98 |
+
items = []
|
| 99 |
+
for item in lst.find_all('li'):
|
| 100 |
+
text = item.get_text().strip()
|
| 101 |
+
if text:
|
| 102 |
+
items.append(text)
|
| 103 |
+
if items:
|
| 104 |
+
text_data["lists"].append({
|
| 105 |
+
"type": lst.name,
|
| 106 |
+
"items": items
|
| 107 |
+
})
|
| 108 |
+
|
| 109 |
+
return text_data
|
| 110 |
+
|
| 111 |
+
def extract_numbers(self, soup):
|
| 112 |
+
"""Extract all numbers (integers and floats) from the text content"""
|
| 113 |
+
text = soup.get_text()
|
| 114 |
+
# Regex to find integers and floats
|
| 115 |
+
numbers = re.findall(r'\b\d+\.?\d*\b', text)
|
| 116 |
+
# Convert to float for consistency, and remove duplicates
|
| 117 |
+
return sorted(list(set([float(n) for n in numbers if n.strip()])))
|
| 118 |
+
|
| 119 |
+
def extract_images(self, soup, base_url):
|
| 120 |
+
"""Extract images from BeautifulSoup object"""
|
| 121 |
+
images = []
|
| 122 |
+
img_tags = soup.find_all('img')
|
| 123 |
+
|
| 124 |
+
for img in img_tags:
|
| 125 |
+
src = img.get('src', '')
|
| 126 |
+
alt = img.get('alt', '')
|
| 127 |
+
title = img.get('title', '')
|
| 128 |
+
|
| 129 |
+
if src:
|
| 130 |
+
# Make relative URLs absolute
|
| 131 |
+
if not src.startswith(('http://', 'https://')):
|
| 132 |
+
src = urljoin(base_url, src)
|
| 133 |
+
|
| 134 |
+
images.append({
|
| 135 |
+
"src": src,
|
| 136 |
+
"alt": alt,
|
| 137 |
+
"title": title,
|
| 138 |
+
"width": img.get('width', ''),
|
| 139 |
+
"height": img.get('height', '')
|
| 140 |
+
})
|
| 141 |
+
|
| 142 |
+
return images
|
| 143 |
+
|
| 144 |
+
def extract_links(self, soup, base_url):
|
| 145 |
+
"""Extract links from BeautifulSoup object"""
|
| 146 |
+
links = []
|
| 147 |
+
link_tags = soup.find_all('a', href=True)
|
| 148 |
+
|
| 149 |
+
for link in link_tags:
|
| 150 |
+
href = link.get('href')
|
| 151 |
+
text = link.get_text().strip()
|
| 152 |
+
|
| 153 |
+
if href and text:
|
| 154 |
+
# Make relative URLs absolute
|
| 155 |
+
if not href.startswith(('http://', 'https://')):
|
| 156 |
+
href = urljoin(base_url, href)
|
| 157 |
+
|
| 158 |
+
# Only include external and internal links, skip anchors
|
| 159 |
+
if not href.startswith('#'):
|
| 160 |
+
links.append({
|
| 161 |
+
"href": href,
|
| 162 |
+
"text": text,
|
| 163 |
+
"title": link.get('title', ''),
|
| 164 |
+
"is_external": not href.startswith(base_url)
|
| 165 |
+
})
|
| 166 |
+
|
| 167 |
+
return links
|
| 168 |
+
|
| 169 |
+
def extract_tables(self, soup):
|
| 170 |
+
"""Extract tables from BeautifulSoup object"""
|
| 171 |
+
tables = []
|
| 172 |
+
table_tags = soup.find_all('table')
|
| 173 |
+
|
| 174 |
+
for table in table_tags:
|
| 175 |
+
table_data = {
|
| 176 |
+
"headers": [],
|
| 177 |
+
"rows": [],
|
| 178 |
+
"caption": ""
|
| 179 |
+
}
|
| 180 |
+
|
| 181 |
+
# Extract caption
|
| 182 |
+
caption = table.find('caption')
|
| 183 |
+
if caption:
|
| 184 |
+
table_data["caption"] = caption.get_text().strip()
|
| 185 |
+
|
| 186 |
+
# Extract headers
|
| 187 |
+
thead = table.find('thead')
|
| 188 |
+
if thead:
|
| 189 |
+
header_row = thead.find('tr')
|
| 190 |
+
if header_row:
|
| 191 |
+
headers = header_row.find_all(['th', 'td'])
|
| 192 |
+
table_data["headers"] = [h.get_text().strip() for h in headers]
|
| 193 |
+
|
| 194 |
+
# Extract rows
|
| 195 |
+
tbody = table.find('tbody') or table
|
| 196 |
+
rows = tbody.find_all('tr')
|
| 197 |
+
|
| 198 |
+
for row in rows:
|
| 199 |
+
cells = row.find_all(['td', 'th'])
|
| 200 |
+
if cells:
|
| 201 |
+
row_data = [cell.get_text().strip() for cell in cells]
|
| 202 |
+
table_data["rows"].append(row_data)
|
| 203 |
+
|
| 204 |
+
if table_data["rows"]:
|
| 205 |
+
tables.append(table_data)
|
| 206 |
+
|
| 207 |
+
return tables
|
| 208 |
+
|
| 209 |
+
def extract_metadata(self, soup):
|
| 210 |
+
"""Extract metadata from BeautifulSoup object"""
|
| 211 |
+
metadata = {
|
| 212 |
+
"title": "",
|
| 213 |
+
"description": "",
|
| 214 |
+
"keywords": [],
|
| 215 |
+
"author": "",
|
| 216 |
+
"language": "en",
|
| 217 |
+
"robots": "",
|
| 218 |
+
"viewport": "",
|
| 219 |
+
"charset": ""
|
| 220 |
+
}
|
| 221 |
+
|
| 222 |
+
# Extract title
|
| 223 |
+
title_tag = soup.find('title')
|
| 224 |
+
if title_tag:
|
| 225 |
+
metadata["title"] = title_tag.get_text().strip()
|
| 226 |
+
|
| 227 |
+
# Extract meta tags
|
| 228 |
+
meta_tags = soup.find_all('meta')
|
| 229 |
+
for meta in meta_tags:
|
| 230 |
+
name = meta.get('name', '').lower()
|
| 231 |
+
content = meta.get('content', '')
|
| 232 |
+
property_attr = meta.get('property', '').lower()
|
| 233 |
+
|
| 234 |
+
if name == 'description' or property_attr == 'og:description':
|
| 235 |
+
metadata["description"] = content
|
| 236 |
+
elif name == 'keywords':
|
| 237 |
+
metadata["keywords"] = [kw.strip() for kw in content.split(',')]
|
| 238 |
+
elif name == 'author':
|
| 239 |
+
metadata["author"] = content
|
| 240 |
+
elif name == 'robots':
|
| 241 |
+
metadata["robots"] = content
|
| 242 |
+
elif name == 'viewport':
|
| 243 |
+
metadata["viewport"] = content
|
| 244 |
+
elif property_attr == 'og:title':
|
| 245 |
+
metadata["title"] = content or metadata["title"]
|
| 246 |
+
|
| 247 |
+
# Extract charset
|
| 248 |
+
charset_meta = soup.find('meta', charset=True)
|
| 249 |
+
if charset_meta:
|
| 250 |
+
metadata["charset"] = charset_meta.get('charset')
|
| 251 |
+
|
| 252 |
+
# Extract language
|
| 253 |
+
html_tag = soup.find('html')
|
| 254 |
+
if html_tag:
|
| 255 |
+
lang = html_tag.get('lang', 'en')
|
| 256 |
+
metadata["language"] = lang
|
| 257 |
+
|
| 258 |
+
return metadata
|
| 259 |
+
|
| 260 |
+
def scrape_website(self, url, data_types, max_pages=1, rate_limit=2):
|
| 261 |
+
"""Main scraping function"""
|
| 262 |
+
scraped_data = {
|
| 263 |
+
"url": url,
|
| 264 |
+
"timestamp": datetime.now().isoformat(),
|
| 265 |
+
"data_types": data_types,
|
| 266 |
+
"pages_crawled": 0,
|
| 267 |
+
"errors": []
|
| 268 |
+
}
|
| 269 |
+
|
| 270 |
+
try:
|
| 271 |
+
# Setup Selenium if needed for dynamic content
|
| 272 |
+
use_selenium = "images" in data_types or "tables" in data_types
|
| 273 |
+
if use_selenium:
|
| 274 |
+
if not self.setup_selenium():
|
| 275 |
+
scraped_data["errors"].append("Failed to setup Selenium for dynamic content")
|
| 276 |
+
|
| 277 |
+
# Get page content
|
| 278 |
+
content = self.get_page_content(url, use_selenium)
|
| 279 |
+
if not content:
|
| 280 |
+
scraped_data["errors"].append("Failed to fetch page content")
|
| 281 |
+
return scraped_data
|
| 282 |
+
|
| 283 |
+
# Parse with BeautifulSoup
|
| 284 |
+
soup = BeautifulSoup(content, 'html.parser')
|
| 285 |
+
scraped_data["pages_crawled"] = 1
|
| 286 |
+
|
| 287 |
+
# Extract data based on selected types
|
| 288 |
+
if "text" in data_types:
|
| 289 |
+
scraped_data["text_content"] = self.extract_text_content(soup)
|
| 290 |
+
|
| 291 |
+
if "images" in data_types:
|
| 292 |
+
scraped_data["images"] = self.extract_images(soup, url)
|
| 293 |
+
|
| 294 |
+
if "links" in data_types:
|
| 295 |
+
scraped_data["links"] = self.extract_links(soup, url)
|
| 296 |
+
|
| 297 |
+
if "tables" in data_types:
|
| 298 |
+
scraped_data["tables"] = self.extract_tables(soup)
|
| 299 |
+
|
| 300 |
+
if "metadata" in data_types:
|
| 301 |
+
scraped_data["metadata"] = self.extract_metadata(soup)
|
| 302 |
+
|
| 303 |
+
if "numbers" in data_types:
|
| 304 |
+
scraped_data["numbers"] = self.extract_numbers(soup)
|
| 305 |
+
|
| 306 |
+
# Rate limiting
|
| 307 |
+
time.sleep(rate_limit)
|
| 308 |
+
|
| 309 |
+
except Exception as e:
|
| 310 |
+
scraped_data["errors"].append(f"Scraping error: {str(e)}")
|
| 311 |
+
|
| 312 |
+
finally:
|
| 313 |
+
# Clean up Selenium
|
| 314 |
+
if use_selenium:
|
| 315 |
+
self.close_selenium()
|
| 316 |
+
|
| 317 |
+
return scraped_data
|
| 318 |
+
|
| 319 |
+
# Global scraper instance
|
| 320 |
+
scraper = WebScraper()
|
youtube_scraper.py
ADDED
|
@@ -0,0 +1,215 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import requests
|
| 3 |
+
from bs4 import BeautifulSoup
|
| 4 |
+
from selenium import webdriver
|
| 5 |
+
from selenium.webdriver.chrome.options import Options
|
| 6 |
+
from selenium.webdriver.common.by import By
|
| 7 |
+
from selenium.webdriver.support.ui import WebDriverWait
|
| 8 |
+
from selenium.webdriver.support import expected_conditions as EC
|
| 9 |
+
from webdriver_manager.chrome import ChromeDriverManager
|
| 10 |
+
import time
|
| 11 |
+
import re
|
| 12 |
+
import json
|
| 13 |
+
from datetime import datetime
|
| 14 |
+
|
| 15 |
+
class YouTubeScraper:
|
| 16 |
+
def __init__(self):
|
| 17 |
+
self.driver = None
|
| 18 |
+
self.setup_selenium()
|
| 19 |
+
|
| 20 |
+
def setup_selenium(self):
|
| 21 |
+
"""Setup Selenium WebDriver for YouTube"""
|
| 22 |
+
try:
|
| 23 |
+
chrome_options = Options()
|
| 24 |
+
chrome_options.add_argument("--headless")
|
| 25 |
+
chrome_options.add_argument("--no-sandbox")
|
| 26 |
+
chrome_options.add_argument("--disable-dev-shm-usage")
|
| 27 |
+
chrome_options.add_argument("--disable-gpu")
|
| 28 |
+
chrome_options.add_argument("--window-size=1920,1080")
|
| 29 |
+
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
|
| 30 |
+
|
| 31 |
+
self.driver = webdriver.Chrome(
|
| 32 |
+
service=webdriver.chrome.service.Service(ChromeDriverManager().install()),
|
| 33 |
+
options=chrome_options
|
| 34 |
+
)
|
| 35 |
+
return True
|
| 36 |
+
except Exception as e:
|
| 37 |
+
st.error(f"Failed to setup Selenium: {e}")
|
| 38 |
+
return False
|
| 39 |
+
|
| 40 |
+
def close_selenium(self):
|
| 41 |
+
"""Close Selenium WebDriver"""
|
| 42 |
+
if self.driver:
|
| 43 |
+
self.driver.quit()
|
| 44 |
+
self.driver = None
|
| 45 |
+
|
| 46 |
+
def extract_video_info(self, url):
|
| 47 |
+
"""Extract basic video information"""
|
| 48 |
+
try:
|
| 49 |
+
self.driver.get(url)
|
| 50 |
+
time.sleep(3) # Wait for page to load
|
| 51 |
+
|
| 52 |
+
# Extract video title
|
| 53 |
+
title = ""
|
| 54 |
+
try:
|
| 55 |
+
title_element = self.driver.find_element(By.CSS_SELECTOR, "h1.ytd-video-primary-info-renderer")
|
| 56 |
+
title = title_element.text
|
| 57 |
+
except:
|
| 58 |
+
try:
|
| 59 |
+
title_element = self.driver.find_element(By.CSS_SELECTOR, "h1")
|
| 60 |
+
title = title_element.text
|
| 61 |
+
except:
|
| 62 |
+
title = "Title not found"
|
| 63 |
+
|
| 64 |
+
# Extract channel name
|
| 65 |
+
channel = ""
|
| 66 |
+
try:
|
| 67 |
+
channel_element = self.driver.find_element(By.CSS_SELECTOR, "ytd-channel-name yt-formatted-string a")
|
| 68 |
+
channel = channel_element.text
|
| 69 |
+
except:
|
| 70 |
+
channel = "Channel not found"
|
| 71 |
+
|
| 72 |
+
# Extract view count
|
| 73 |
+
views = ""
|
| 74 |
+
try:
|
| 75 |
+
views_element = self.driver.find_element(By.CSS_SELECTOR, "span.view-count")
|
| 76 |
+
views = views_element.text
|
| 77 |
+
except:
|
| 78 |
+
views = "Views not found"
|
| 79 |
+
|
| 80 |
+
# Extract description
|
| 81 |
+
description = ""
|
| 82 |
+
try:
|
| 83 |
+
desc_element = self.driver.find_element(By.CSS_SELECTOR, "ytd-expandable-video-description-body-text")
|
| 84 |
+
description = desc_element.text
|
| 85 |
+
except:
|
| 86 |
+
description = "Description not found"
|
| 87 |
+
|
| 88 |
+
return {
|
| 89 |
+
"title": title,
|
| 90 |
+
"channel": channel,
|
| 91 |
+
"views": views,
|
| 92 |
+
"description": description,
|
| 93 |
+
"url": url
|
| 94 |
+
}
|
| 95 |
+
|
| 96 |
+
except Exception as e:
|
| 97 |
+
return {"error": f"Failed to extract video info: {str(e)}"}
|
| 98 |
+
|
| 99 |
+
def extract_comments(self, url, max_comments=50):
|
| 100 |
+
"""Extract comments from YouTube video"""
|
| 101 |
+
try:
|
| 102 |
+
self.driver.get(url)
|
| 103 |
+
time.sleep(3)
|
| 104 |
+
|
| 105 |
+
# Scroll down to load comments
|
| 106 |
+
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
|
| 107 |
+
time.sleep(2)
|
| 108 |
+
|
| 109 |
+
# Try to find comments section
|
| 110 |
+
comments = []
|
| 111 |
+
|
| 112 |
+
# Method 1: Try to find comment elements
|
| 113 |
+
try:
|
| 114 |
+
comment_elements = self.driver.find_elements(By.CSS_SELECTOR, "ytd-comment-thread-renderer")
|
| 115 |
+
|
| 116 |
+
for i, comment in enumerate(comment_elements[:max_comments]):
|
| 117 |
+
try:
|
| 118 |
+
# Extract author
|
| 119 |
+
author_element = comment.find_element(By.CSS_SELECTOR, "a#author-text")
|
| 120 |
+
author = author_element.text.strip()
|
| 121 |
+
|
| 122 |
+
# Extract comment text
|
| 123 |
+
text_element = comment.find_element(By.CSS_SELECTOR, "#content-text")
|
| 124 |
+
text = text_element.text.strip()
|
| 125 |
+
|
| 126 |
+
# Extract timestamp
|
| 127 |
+
time_element = comment.find_element(By.CSS_SELECTOR, "a.yt-simple-endpoint")
|
| 128 |
+
timestamp = time_element.text.strip()
|
| 129 |
+
|
| 130 |
+
# Extract likes
|
| 131 |
+
likes = "0"
|
| 132 |
+
try:
|
| 133 |
+
likes_element = comment.find_element(By.CSS_SELECTOR, "span#vote-count-middle")
|
| 134 |
+
likes = likes_element.text.strip()
|
| 135 |
+
except:
|
| 136 |
+
pass
|
| 137 |
+
|
| 138 |
+
if text: # Only add if comment has text
|
| 139 |
+
comments.append({
|
| 140 |
+
"author": author,
|
| 141 |
+
"text": text,
|
| 142 |
+
"timestamp": timestamp,
|
| 143 |
+
"likes": likes,
|
| 144 |
+
"comment_id": i
|
| 145 |
+
})
|
| 146 |
+
|
| 147 |
+
except Exception as e:
|
| 148 |
+
continue
|
| 149 |
+
|
| 150 |
+
except Exception as e:
|
| 151 |
+
st.warning(f"Could not extract comments using primary method: {e}")
|
| 152 |
+
|
| 153 |
+
# Method 2: Alternative approach if first method fails
|
| 154 |
+
if not comments:
|
| 155 |
+
try:
|
| 156 |
+
# Look for any text that might be comments
|
| 157 |
+
page_text = self.driver.page_source
|
| 158 |
+
soup = BeautifulSoup(page_text, 'html.parser')
|
| 159 |
+
|
| 160 |
+
# Look for comment-like patterns
|
| 161 |
+
comment_patterns = [
|
| 162 |
+
"ytd-comment-renderer",
|
| 163 |
+
"comment-text",
|
| 164 |
+
"ytd-comment-thread-renderer"
|
| 165 |
+
]
|
| 166 |
+
|
| 167 |
+
for pattern in comment_patterns:
|
| 168 |
+
elements = soup.find_all(attrs={"class": re.compile(pattern)})
|
| 169 |
+
for element in elements[:max_comments]:
|
| 170 |
+
text = element.get_text().strip()
|
| 171 |
+
if text and len(text) > 10:
|
| 172 |
+
comments.append({
|
| 173 |
+
"author": "Unknown",
|
| 174 |
+
"text": text,
|
| 175 |
+
"timestamp": "Unknown",
|
| 176 |
+
"likes": "0",
|
| 177 |
+
"comment_id": len(comments)
|
| 178 |
+
})
|
| 179 |
+
|
| 180 |
+
except Exception as e:
|
| 181 |
+
st.error(f"Alternative comment extraction failed: {e}")
|
| 182 |
+
|
| 183 |
+
return comments
|
| 184 |
+
|
| 185 |
+
except Exception as e:
|
| 186 |
+
return [{"error": f"Failed to extract comments: {str(e)}"}]
|
| 187 |
+
|
| 188 |
+
def scrape_youtube_video(self, url, extract_comments=True, max_comments=50):
|
| 189 |
+
"""Main function to scrape YouTube video data"""
|
| 190 |
+
result = {
|
| 191 |
+
"url": url,
|
| 192 |
+
"timestamp": datetime.now().isoformat(),
|
| 193 |
+
"video_info": {},
|
| 194 |
+
"comments": [],
|
| 195 |
+
"errors": []
|
| 196 |
+
}
|
| 197 |
+
|
| 198 |
+
try:
|
| 199 |
+
# Extract video information
|
| 200 |
+
result["video_info"] = self.extract_video_info(url)
|
| 201 |
+
|
| 202 |
+
# Extract comments if requested
|
| 203 |
+
if extract_comments:
|
| 204 |
+
result["comments"] = self.extract_comments(url, max_comments)
|
| 205 |
+
|
| 206 |
+
except Exception as e:
|
| 207 |
+
result["errors"].append(f"Scraping error: {str(e)}")
|
| 208 |
+
|
| 209 |
+
finally:
|
| 210 |
+
self.close_selenium()
|
| 211 |
+
|
| 212 |
+
return result
|
| 213 |
+
|
| 214 |
+
# Global YouTube scraper instance
|
| 215 |
+
youtube_scraper = YouTubeScraper()
|