PHOROTHA913 commited on
Commit
5c3dc0d
·
verified ·
1 Parent(s): a5c1e5d

Upload 9 files

Browse files

For Testing use and explore the web data as you like

Files changed (9) hide show
  1. DEPLOYMENT.md +100 -0
  2. README.md +74 -14
  3. app.py +396 -0
  4. instagram_scraper.py +378 -0
  5. instagram_scraper_v2.py +184 -0
  6. requirements.txt +6 -2
  7. requirements_hf.txt +5 -0
  8. scraper.py +320 -0
  9. youtube_scraper.py +215 -0
DEPLOYMENT.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Deploy to Hugging Face Spaces
2
+
3
+ ## Step-by-Step Guide
4
+
5
+ ### 1. Create GitHub Repository
6
+
7
+ 1. Go to [GitHub](https://github.com) and create a new repository
8
+ 2. Name it something like `crawl4ai-streamlit-app`
9
+ 3. Make it public (required for free Hugging Face Spaces)
10
+ 4. Don't initialize with README (we already have one)
11
+
12
+ ### 2. Push Code to GitHub
13
+
14
+ ```bash
15
+ # Add your GitHub repository as remote
16
+ git remote add origin https://github.com/YOUR_USERNAME/crawl4ai-streamlit-app.git
17
+
18
+ # Push to GitHub
19
+ git branch -M main
20
+ git push -u origin main
21
+ ```
22
+
23
+ ### 3. Create Hugging Face Space
24
+
25
+ 1. Go to [Hugging Face Spaces](https://huggingface.co/spaces)
26
+ 2. Click "Create new Space"
27
+ 3. Fill in the details:
28
+ - **Owner**: Your username
29
+ - **Space name**: `crawl4ai-web-scraper` (or any name you like)
30
+ - **License**: MIT
31
+ - **SDK**: Select "Streamlit"
32
+ - **Space hardware**: CPU (free tier)
33
+ 4. Click "Create Space"
34
+
35
+ ### 4. Connect GitHub Repository
36
+
37
+ 1. In your new Space, click "Settings"
38
+ 2. Under "Repository", click "Connect to existing repository"
39
+ 3. Select your GitHub repository
40
+ 4. Set the path to `app.py`
41
+ 5. Click "Connect"
42
+
43
+ ### 5. Configure Environment
44
+
45
+ The Space will automatically:
46
+ - Install dependencies from `requirements.txt`
47
+ - Run `streamlit run app.py`
48
+ - Deploy your app
49
+
50
+ ### 6. Access Your App
51
+
52
+ Your app will be available at:
53
+ `https://huggingface.co/spaces/YOUR_USERNAME/crawl4ai-web-scraper`
54
+
55
+ ## 🛠️ Troubleshooting
56
+
57
+ ### Common Issues:
58
+
59
+ 1. **Dependencies not found**: Make sure `requirements.txt` is in the root directory
60
+ 2. **App not loading**: Check the logs in the Space settings
61
+ 3. **Selenium issues**: Hugging Face Spaces may have limitations with browser automation
62
+
63
+ ### Alternative: Manual Upload
64
+
65
+ If GitHub connection doesn't work:
66
+ 1. Download your repository as ZIP
67
+ 2. Upload files manually to the Space
68
+ 3. Commit changes in the Space interface
69
+
70
+ ## 📋 Files Required for Deployment
71
+
72
+ - ✅ `app.py` - Main Streamlit app
73
+ - ✅ `requirements.txt` - Dependencies
74
+ - ✅ `scraper.py` - Web scraping module
75
+ - ✅ `youtube_scraper.py` - YouTube scraping module
76
+ - ✅ `README.md` - Documentation
77
+ - ✅ `.gitignore` - Git ignore rules
78
+
79
+ ## 🎯 Your App URL
80
+
81
+ Once deployed, your app will be live at:
82
+ `https://huggingface.co/spaces/YOUR_USERNAME/crawl4ai-web-scraper`
83
+
84
+ ## 🔧 Customization
85
+
86
+ You can customize your Space:
87
+ - **Title**: Change in Space settings
88
+ - **Description**: Add in README.md
89
+ - **Tags**: Add relevant tags for discoverability
90
+ - **Hardware**: Upgrade if needed (paid tier)
91
+
92
+ ## 📊 Monitoring
93
+
94
+ - **Logs**: Check Space settings for runtime logs
95
+ - **Usage**: Monitor in Space analytics
96
+ - **Updates**: Push to GitHub to auto-deploy
97
+
98
+ ---
99
+
100
+ **Need help?** Check the [Hugging Face Spaces documentation](https://huggingface.co/docs/hub/spaces) or ask in the community!
README.md CHANGED
@@ -1,20 +1,80 @@
1
  ---
2
  title: Scrape Anythings
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: A powerful Streamlit app to scrape data from any website, in
12
- license: mit
13
  ---
14
 
15
- # Welcome to Streamlit!
16
 
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
 
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Scrape Anythings
3
+ emoji:
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: streamlit
7
+ sdk_version: "1.35.0"
8
+ python_version: "3.9"
9
+ app_file: app.py
 
 
 
10
  ---
11
 
12
+ # Scrape Anythings
13
 
14
+ A user-friendly Streamlit web application for extracting data from any website, including special support for YouTube and Instagram.
15
 
16
+ ## 🌟 Features
17
+
18
+ - **Scrape Any URL**: Paste any website, YouTube, or Instagram URL to start.
19
+ - **Multiple Data Types**: Extract text, images, links, tables, numbers, and metadata.
20
+ - **Social Media Support**: Scrape YouTube video info & comments, and Instagram profile details & posts.
21
+ - **Rich Data Export**: Download your data in JSON, CSV, TXT, and structured Excel (.xlsx) formats.
22
+ - **Modern UI**: A clean and simple interface for a smooth user experience.
23
+
24
+ ## 🚀 How to Deploy on Hugging Face Spaces
25
+
26
+ 1. **Create a Hugging Face Account**: If you don't have one, sign up at [huggingface.co](https://huggingface.co/).
27
+ 2. **Create a New Space**:
28
+ * Go to [huggingface.co/new-space](https://huggingface.co/new-space).
29
+ * Enter a **Space name** (e.g., `scrape-anythings`).
30
+ * Select **Streamlit** as the Space SDK.
31
+ * Choose **Create a new repository for this Space**.
32
+ * Click **Create Space**.
33
+ 3. **Upload Your Files**:
34
+ * In your new Space, go to the **Files** tab.
35
+ * Click **Upload files**.
36
+ * Drag and drop all the files from your project folder:
37
+ * `app.py`
38
+ * `scraper.py`
39
+ * `youtube_scraper.py`
40
+ * `instagram_scraper.py`
41
+ * `instagram_scraper_v2.py`
42
+ * `requirements.txt`
43
+ * `README.md`
44
+ * Commit the files directly to the `main` branch.
45
+
46
+ 4. **Done!** Hugging Face will automatically build and launch your application. You can share the URL of your Space with anyone.
47
+
48
+ ## 📋 How to Use the App
49
+
50
+ 1. **Enter a URL**: Paste the URL of the website, YouTube video, or Instagram profile you want to scrape.
51
+ 2. **Select Data Types**: Choose the data you want to extract.
52
+ 3. **Click Scrape!**: Let the app do the work.
53
+ 4. **View & Download**: See the results directly in the app and download them in your preferred format.
54
+
55
+ - [ ] Real-time scraping status
56
+ - [ ] Custom CSS selectors
57
+ - [ ] Proxy support
58
+ - [ ] Multi-language support
59
+
60
+ ## 🤝 Contributing
61
+
62
+ 1. Fork the repository
63
+ 2. Create a feature branch
64
+ 3. Make your changes
65
+ 4. Test thoroughly
66
+ 5. Submit a pull request
67
+
68
+ ## 📄 License
69
+
70
+ This project is licensed under the MIT License - see the LICENSE file for details.
71
+
72
+ ## 🙏 Acknowledgments
73
+
74
+ - Streamlit team for the amazing web app framework
75
+ - BeautifulSoup and Selenium communities
76
+ - Hugging Face for hosting capabilities
77
+
78
+ ---
79
+
80
+ **Made with ❤️ for the AI/ML community**
app.py ADDED
@@ -0,0 +1,396 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import json
4
+ import time
5
+ from datetime import datetime
6
+ import requests
7
+ from urllib.parse import urlparse
8
+ import io
9
+ import base64
10
+ from scraper import scraper
11
+ from youtube_scraper import youtube_scraper
12
+ from instagram_scraper import instagram_scraper
13
+ from instagram_scraper_v2 import instagram_scraper_v2
14
+
15
+ # Page configuration
16
+ st.set_page_config(
17
+ page_title="Scrape Anythings",
18
+ page_icon="🕷️",
19
+ layout="wide",
20
+ initial_sidebar_state="expanded"
21
+ )
22
+
23
+ # Custom CSS for better styling
24
+ st.markdown("""
25
+ <style>
26
+ .main-header {
27
+ font-size: 2.5rem;
28
+ font-weight: bold;
29
+ color: #1f77b4;
30
+ text-align: center;
31
+ margin-bottom: 2rem;
32
+ }
33
+ .sub-header {
34
+ font-size: 1.2rem;
35
+ color: #666;
36
+ text-align: center;
37
+ margin-bottom: 2rem;
38
+ }
39
+ .metric-card {
40
+ background-color: #f0f2f6;
41
+ padding: 1rem;
42
+ border-radius: 0.5rem;
43
+ border-left: 4px solid #1f77b4;
44
+ }
45
+ .success-box {
46
+ background-color: #d4edda;
47
+ border: 1px solid #c3e6cb;
48
+ border-radius: 0.5rem;
49
+ padding: 1rem;
50
+ margin: 1rem 0;
51
+ }
52
+ .error-box {
53
+ background-color: #f8d7da;
54
+ border: 1px solid #f5c6cb;
55
+ border-radius: 0.5rem;
56
+ padding: 1rem;
57
+ margin: 1rem 0;
58
+ }
59
+ </style>
60
+ """, unsafe_allow_html=True)
61
+
62
+ def validate_url(url):
63
+ """Validate if the URL is properly formatted"""
64
+ try:
65
+ result = urlparse(url)
66
+ return all([result.scheme, result.netloc])
67
+ except:
68
+ return False
69
+
70
+ def perform_web_scraping(url, data_types, max_pages=1, rate_limit=2):
71
+ """
72
+ Perform actual web scraping using the WebScraper class
73
+ """
74
+ st.info("🔍 Starting web scraping...")
75
+
76
+ data_types_lower = [dt.lower() for dt in data_types]
77
+ with st.spinner("Crawling website..."):
78
+ scraped_data = scraper.scrape_website(url, data_types_lower, max_pages, rate_limit)
79
+
80
+ return scraped_data
81
+
82
+ def display_results(scraped_data, is_youtube=False, is_instagram=False):
83
+ """Display the scraped data in a user-friendly format"""
84
+
85
+ if is_youtube:
86
+ display_youtube_results(scraped_data)
87
+ elif is_instagram:
88
+ display_instagram_results(scraped_data)
89
+ else:
90
+ display_regular_results(scraped_data)
91
+
92
+ def display_text_results(text_data):
93
+ st.write(f"**Title:** {text_data.get('title', 'N/A')}")
94
+ with st.expander("Headings"):
95
+ for heading in text_data.get("headings", []):
96
+ st.write(f"- **{heading.get('level', 'h?')}**: {heading.get('text', '')}")
97
+ with st.expander("Paragraphs"):
98
+ for para in text_data.get("paragraphs", []):
99
+ st.write(f"- {para}")
100
+
101
+ def display_image_results(images):
102
+ cols = st.columns(min(4, len(images)))
103
+ for i, img in enumerate(images):
104
+ with cols[i % 4]:
105
+ st.image(img.get("src", ""), caption=f"{img.get('alt', 'Image')[:50]}...", use_column_width=True)
106
+
107
+ def display_table_results(tables):
108
+ for i, table in enumerate(tables):
109
+ with st.expander(f"Table {i+1} (Header: {table.get('header', [])})"):
110
+ df = pd.DataFrame(table.get('rows', []))
111
+ st.dataframe(df)
112
+
113
+ def display_link_results(links):
114
+ for link in links:
115
+ st.write(f"- [{link.get('text', 'N/A')}]({link.get('href', '#')})")
116
+
117
+ def display_metadata_results(metadata):
118
+ st.json(metadata)
119
+
120
+ def display_regular_results(scraped_data):
121
+ """Display regular website scraping results in a structured format."""
122
+
123
+ st.subheader("📝 Text Content")
124
+ if scraped_data.get("text_content"):
125
+ display_text_results(scraped_data["text_content"])
126
+ else:
127
+ st.info("No text content was extracted.")
128
+
129
+ st.subheader("🖼️ Images")
130
+ if scraped_data.get("images"):
131
+ display_image_results(scraped_data["images"])
132
+ else:
133
+ st.info("No images were extracted.")
134
+
135
+ st.subheader("🔢 Numbers")
136
+ if scraped_data.get("numbers"):
137
+ with st.expander("Extracted Numbers", expanded=False):
138
+ st.write(scraped_data["numbers"])
139
+ else:
140
+ st.info("No numbers were extracted.")
141
+
142
+ st.subheader("📊 Tables")
143
+ if scraped_data.get("tables"):
144
+ display_table_results(scraped_data["tables"])
145
+ else:
146
+ st.info("No tables were extracted.")
147
+
148
+ st.subheader("🔗 Links")
149
+ if scraped_data.get("links"):
150
+ display_link_results(scraped_data["links"])
151
+ else:
152
+ st.info("No links were extracted.")
153
+
154
+ st.subheader("📄 Metadata")
155
+ if scraped_data.get("metadata"):
156
+ display_metadata_results(scraped_data["metadata"])
157
+ else:
158
+ st.info("No metadata was extracted.")
159
+
160
+ def to_excel(data):
161
+ """Converts a dictionary of scraped data to an Excel file in memory."""
162
+ output = io.BytesIO()
163
+ with pd.ExcelWriter(output, engine='openpyxl') as writer:
164
+ # Handle simple lists (links, images, numbers)
165
+ for key in ["links", "images", "numbers"]:
166
+ if data.get(key):
167
+ pd.DataFrame({key.capitalize(): data[key]}).to_excel(writer, sheet_name=key.capitalize(), index=False)
168
+
169
+ # Handle text content
170
+ if data.get("text_content"):
171
+ pd.DataFrame({'Text': [data["text_content"]]}).to_excel(writer, sheet_name='Text', index=False)
172
+
173
+ # Handle dictionaries (metadata, video_info, profile_info)
174
+ for key in ["metadata", "video_info", "profile_info"]:
175
+ if data.get(key):
176
+ pd.DataFrame(data[key].items(), columns=['Property', 'Value']).to_excel(writer, sheet_name=key.replace('_', ' ').capitalize(), index=False)
177
+
178
+ # Handle list of dictionaries (comments)
179
+ if data.get("comments"):
180
+ pd.DataFrame(data["comments"]).to_excel(writer, sheet_name='Comments', index=False)
181
+
182
+ # Handle list of DataFrames (tables)
183
+ if data.get("tables"):
184
+ for i, table_df in enumerate(data["tables"]):
185
+ table_df.to_excel(writer, sheet_name=f'Table_{i+1}', index=False)
186
+
187
+ processed_data = output.getvalue()
188
+ return processed_data
189
+
190
+ def create_download_links(scraped_data):
191
+ """Create download links for different formats"""
192
+ st.header("Download Data")
193
+ col1, col2, col3, col4 = st.columns(4)
194
+
195
+ # JSON download
196
+ with col1:
197
+ json_str = json.dumps(scraped_data or {}, indent=2, default=str)
198
+ st.download_button(
199
+ label="Download JSON",
200
+ data=json_str,
201
+ file_name="scraped_data.json",
202
+ mime="application/json",
203
+ use_container_width=True
204
+ )
205
+
206
+ # CSV download
207
+ with col2:
208
+ if scraped_data.get("tables"):
209
+ # For simplicity, we'll offer the first table as a CSV download
210
+ csv = scraped_data["tables"][0].to_csv(index=False)
211
+ st.download_button(
212
+ label="Download CSV",
213
+ data=csv,
214
+ file_name="scraped_table.csv",
215
+ mime="text/csv",
216
+ use_container_width=True
217
+ )
218
+ else:
219
+ st.button("Download CSV", disabled=True, help="No tables found to download.", use_container_width=True)
220
+
221
+ # TXT download
222
+ with col3:
223
+ text_content = scraped_data.get("text_content", "")
224
+ st.download_button(
225
+ label="Download TXT",
226
+ data=text_content,
227
+ file_name="scraped_text.txt",
228
+ mime="text/plain",
229
+ use_container_width=True
230
+ )
231
+
232
+ # Excel download
233
+ with col4:
234
+ try:
235
+ excel_data = to_excel(scraped_data)
236
+ st.download_button(
237
+ label="Download Excel",
238
+ data=excel_data,
239
+ file_name="scraped_data.xlsx",
240
+ mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
241
+ use_container_width=True
242
+ )
243
+ except Exception as e:
244
+ st.button("Download Excel", disabled=True, help=f"Excel export failed: {e}", use_container_width=True)
245
+ for heading in text_data.get("headings", []):
246
+ txt_content += f"- {heading}\n"
247
+ txt_content += "\nParagraphs:\n"
248
+ for i, para in enumerate(text_data.get("paragraphs", []), 1):
249
+ txt_content += f"{i}. {para}\n"
250
+
251
+ b64_txt = base64.b64encode(txt_content.encode()).decode()
252
+ href = f'<a href="data:file/txt;base64,{b64_txt}" download="scraped_data.txt">📝 Download TXT</a>'
253
+ st.markdown(href, unsafe_allow_html=True)
254
+
255
+ # Excel download
256
+ with col4:
257
+ try:
258
+ excel_data = to_excel(scraped_data)
259
+ st.download_button(
260
+ label="Download data as Excel",
261
+ data=excel_data,
262
+ file_name="scraped_data.xlsx",
263
+ mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
264
+ )
265
+ except Exception as e:
266
+ st.error(f"Failed to generate Excel file: {e}")
267
+
268
+ def display_youtube_results(scraped_data):
269
+ """Display YouTube scraping results"""
270
+ if not scraped_data.get("video_info"):
271
+ st.error("Could not extract YouTube video information.")
272
+ return
273
+
274
+ video_info = scraped_data["video_info"]
275
+ st.subheader(f'{video_info.get("title", "Untitled")}')
276
+ st.write(f'**Channel:** {video_info.get("channel", "N/A")}')
277
+ st.write(f'**Views:** {video_info.get("views", "N/A")}')
278
+
279
+ with st.expander("Video Description"):
280
+ st.write(video_info.get("description", "No description."))
281
+
282
+ if "comments" in scraped_data and scraped_data["comments"]:
283
+ with st.expander(f'Comments ({len(scraped_data["comments"])})'):
284
+ for comment in scraped_data["comments"]:
285
+ st.markdown(f"**{comment.get('author', 'Unknown')}** - {comment.get('timestamp', 'Unknown')}")
286
+ st.write(comment.get('text', ''))
287
+ if comment.get('likes', '0') != '0':
288
+ st.caption(f"👍 {comment.get('likes', '0')} likes")
289
+ st.divider()
290
+
291
+ def display_instagram_results(scraped_data):
292
+ """Display Instagram scraping results"""
293
+ if not scraped_data.get("profile_info"):
294
+ st.error("Could not extract Instagram profile information.")
295
+ return
296
+
297
+ profile_info = scraped_data["profile_info"]
298
+ with st.expander("Profile Information", expanded=True):
299
+ st.write(f'**Username:** {profile_info.get("username", "N/A")}')
300
+ st.write(f'**Display Name:** {profile_info.get("display_name", "N/A")}')
301
+ st.write(f'**Bio:** {profile_info.get("bio", "N/A")}')
302
+ st.write(f'**Followers:** {profile_info.get("followers", "N/A")}')
303
+
304
+ def main():
305
+ # Header
306
+ st.markdown('<h1 class="main-header">✨ Scrape Anythings</h1>', unsafe_allow_html=True)
307
+ st.markdown('<p class="sub-header">Extract data from any website with ease</p>', unsafe_allow_html=True)
308
+
309
+ # Sidebar for configuration
310
+ with st.sidebar:
311
+ st.header("Configuration")
312
+
313
+ url = st.text_input("Enter Website URL", placeholder="https://example.com")
314
+
315
+ is_youtube = "youtube.com" in url.lower() or "youtu.be" in url.lower() if url else False
316
+ is_instagram = "instagram.com" in url.lower() if url else False
317
+
318
+ data_types, youtube_data_types, instagram_data_types, max_comments = [], [], [], 50
319
+
320
+ if is_youtube:
321
+ st.info("YouTube URL detected!")
322
+ youtube_data_types = st.multiselect("YouTube Data Types", ["video_info", "comments"], default=["video_info", "comments"])
323
+ if "comments" in youtube_data_types:
324
+ max_comments = st.slider("Max Comments", 10, 200, 50)
325
+ elif is_instagram:
326
+ st.info("Instagram URL detected!")
327
+ instagram_data_types = st.multiselect("Instagram Data Types", ["profile_info", "images", "posts"], default=["profile_info", "images"])
328
+ else:
329
+ data_types = st.multiselect("Data Types", ["Text", "Images", "Links", "Tables", "Metadata", "Numbers"], default=["Text", "Links"])
330
+
331
+ st.subheader("Advanced Options")
332
+ max_pages = st.slider("Max Pages", 1, 10, 1)
333
+ rate_limit = st.slider("Rate Limit (s)", 1, 10, 2)
334
+
335
+ scrape_button = st.button("Start Scraping", type="primary", use_container_width=True)
336
+
337
+ # Main content area
338
+ if scrape_button:
339
+ if not url or not validate_url(url):
340
+ st.error("Please enter a valid URL.")
341
+ return
342
+
343
+ # Validate that at least one data type is selected for the given URL type
344
+ if is_youtube and not youtube_data_types:
345
+ st.error("Please select at least one YouTube data type to extract.")
346
+ return
347
+ elif is_instagram and not instagram_data_types:
348
+ st.error("Please select at least one Instagram data type to extract.")
349
+ return
350
+ elif not is_youtube and not is_instagram and not data_types:
351
+ st.error("Please select at least one data type to extract.")
352
+ return
353
+
354
+ with st.spinner("Scraping in progress... Please wait."):
355
+ try:
356
+ scraped_data = {}
357
+ if is_youtube:
358
+ scraped_data = youtube_scraper.scrape_youtube_video(url, "comments" in youtube_data_types, max_comments)
359
+ elif is_instagram:
360
+ try:
361
+ scraped_data = instagram_scraper_v2.extract_instagram_data(url)
362
+ except Exception:
363
+ st.warning("Improved scraper failed, trying fallback...")
364
+ scraped_data = instagram_scraper.extract_instagram_data(url)
365
+ else:
366
+ data_types_lower = [dt.lower() for dt in data_types]
367
+ scraped_data = perform_web_scraping(url, data_types_lower, max_pages, rate_limit)
368
+
369
+ if scraped_data.get("errors"):
370
+ st.error(f'Errors: {scraped_data["errors"]}')
371
+
372
+ # Check if any data was actually scraped before showing success
373
+ has_data = any(scraped_data.get(key) for key in ["text_content", "images", "numbers", "tables", "links", "metadata", "video_info", "profile_info"])
374
+
375
+ if has_data:
376
+ st.success("Scraping completed successfully!")
377
+ st.header("Scraping Results")
378
+ display_results(scraped_data, is_youtube, is_instagram)
379
+ st.header("Download Data")
380
+ create_download_links(scraped_data)
381
+ else:
382
+ st.warning("No data was extracted. The website might be blocking scrapers or the content is not available.")
383
+
384
+ except Exception as e:
385
+ st.error(f"An unexpected error occurred: {e}")
386
+
387
+ else:
388
+ st.markdown("""
389
+ ### How to Use
390
+ 1. **Enter URL** and **select data types** in the sidebar.
391
+ 2. Click **Start Scraping** to begin.
392
+ 3. View and **download the results** below.
393
+ """)
394
+
395
+ if __name__ == "__main__":
396
+ main()
instagram_scraper.py ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ from bs4 import BeautifulSoup
4
+ import json
5
+ import re
6
+ import time
7
+ from datetime import datetime
8
+ from urllib.parse import urljoin, urlparse
9
+
10
+ class InstagramScraper:
11
+ def __init__(self):
12
+ self.session = requests.Session()
13
+ self.session.headers.update({
14
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
15
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
16
+ 'Accept-Language': 'en-US,en;q=0.5',
17
+ 'Accept-Encoding': 'gzip, deflate',
18
+ 'Connection': 'keep-alive',
19
+ 'Upgrade-Insecure-Requests': '1',
20
+ })
21
+
22
+ def extract_instagram_data(self, url):
23
+ """Extract data from Instagram profile or post"""
24
+ scraped_data = {
25
+ "url": url,
26
+ "timestamp": datetime.now().isoformat(),
27
+ "platform": "instagram",
28
+ "images": [],
29
+ "posts": [],
30
+ "profile_info": {},
31
+ "errors": []
32
+ }
33
+
34
+ try:
35
+ # Determine if it's a profile or post URL
36
+ if "/p/" in url or "/reel/" in url:
37
+ # Single post
38
+ scraped_data.update(self.extract_post_data(url))
39
+ else:
40
+ # Profile
41
+ scraped_data.update(self.extract_profile_data(url))
42
+
43
+ except Exception as e:
44
+ scraped_data["errors"].append(f"Instagram scraping error: {str(e)}")
45
+
46
+ # Check if we found any data
47
+ if not scraped_data.get("images") and not scraped_data.get("posts") and not scraped_data.get("profile_info", {}).get("username"):
48
+ scraped_data["errors"].append("No Instagram data found. This might be due to:")
49
+ scraped_data["errors"].append("- Private or protected account")
50
+ scraped_data["errors"].append("- Instagram's anti-scraping measures")
51
+ scraped_data["errors"].append("- Network connectivity issues")
52
+ scraped_data["errors"].append("- URL format issues")
53
+
54
+ return scraped_data
55
+
56
+ def extract_post_data(self, url):
57
+ """Extract data from a single Instagram post"""
58
+ post_data = {
59
+ "post_type": "single_post",
60
+ "images": [],
61
+ "post_info": {}
62
+ }
63
+
64
+ try:
65
+ response = self.session.get(url, timeout=10)
66
+ response.raise_for_status()
67
+
68
+ soup = BeautifulSoup(response.text, 'html.parser')
69
+
70
+ # Look for image URLs in the page
71
+ # Instagram loads images dynamically, so we need to look for patterns
72
+ page_text = response.text
73
+
74
+ # Find image URLs in the page source
75
+ image_patterns = [
76
+ # Instagram post images (high quality)
77
+ r'"display_url":"([^"]+)"',
78
+ r'"display_src":"([^"]+)"',
79
+ r'"src":"([^"]*\.jpg[^"]*)"',
80
+ r'"src":"([^"]*\.jpeg[^"]*)"',
81
+ r'"src":"([^"]*\.png[^"]*)"',
82
+ # Direct image URLs
83
+ r'https://[^"]*\.jpg[^"]*',
84
+ r'https://[^"]*\.jpeg[^"]*',
85
+ r'https://[^"]*\.png[^"]*',
86
+ # Instagram CDN URLs (high quality)
87
+ r'https://scontent[^"]*\.jpg[^"]*',
88
+ r'https://scontent[^"]*\.jpeg[^"]*',
89
+ r'https://scontent[^"]*\.png[^"]*',
90
+ # Additional Instagram patterns
91
+ r'"url":"([^"]*\.jpg[^"]*)"',
92
+ r'"url":"([^"]*\.jpeg[^"]*)"',
93
+ r'"url":"([^"]*\.png[^"]*)"'
94
+ ]
95
+
96
+ found_images = set()
97
+ for pattern in image_patterns:
98
+ matches = re.findall(pattern, page_text)
99
+ for match in matches:
100
+ if match and ('instagram' in match.lower() or 'scontent' in match.lower()):
101
+ # Clean up the URL
102
+ clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
103
+ found_images.add(clean_url)
104
+
105
+ # Convert to image objects
106
+ for i, img_url in enumerate(list(found_images)):
107
+ post_data["images"].append({
108
+ "src": img_url,
109
+ "alt": f"Instagram post image {i+1}",
110
+ "title": f"Instagram post image {i+1}",
111
+ "width": "",
112
+ "height": ""
113
+ })
114
+
115
+ # Extract post information
116
+ post_data["post_info"] = {
117
+ "url": url,
118
+ "images_count": len(post_data["images"]),
119
+ "scraped_at": datetime.now().isoformat()
120
+ }
121
+
122
+ except Exception as e:
123
+ post_data["errors"] = [f"Failed to extract post data: {str(e)}"]
124
+
125
+ return post_data
126
+
127
+ def extract_profile_data(self, url):
128
+ """Extract data from Instagram profile"""
129
+ profile_data = {
130
+ "profile_type": "account",
131
+ "images": [],
132
+ "profile_info": {},
133
+ "posts": []
134
+ }
135
+
136
+ try:
137
+ response = self.session.get(url, timeout=10)
138
+ response.raise_for_status()
139
+
140
+ soup = BeautifulSoup(response.text, 'html.parser')
141
+ page_text = response.text
142
+
143
+ # Extract profile information
144
+ profile_data["profile_info"] = self.extract_profile_info(soup, page_text)
145
+
146
+ # Extract recent posts first
147
+ profile_data["posts"] = self.extract_recent_posts(page_text)
148
+
149
+ # Extract images from profile page
150
+ profile_data["images"] = self.extract_profile_images(page_text)
151
+
152
+ # Extract images from individual posts (higher quality)
153
+ if profile_data["posts"]:
154
+ post_images = self.extract_images_from_posts(profile_data["posts"], max_posts=3)
155
+ if post_images:
156
+ profile_data["images"].extend(post_images)
157
+
158
+ except Exception as e:
159
+ profile_data["errors"] = [f"Failed to extract profile data: {str(e)}"]
160
+
161
+ return profile_data
162
+
163
+ def extract_profile_info(self, soup, page_text):
164
+ """Extract profile information"""
165
+ profile_info = {
166
+ "username": "",
167
+ "display_name": "",
168
+ "bio": "",
169
+ "followers": "",
170
+ "following": "",
171
+ "posts_count": ""
172
+ }
173
+
174
+ try:
175
+ # Look for profile information in the page source
176
+ # Instagram loads this data dynamically, so we need to parse JSON
177
+
178
+ # Find JSON data in the page
179
+ json_patterns = [
180
+ r'window\._sharedData\s*=\s*({[^}]+})',
181
+ r'"profile_page":\s*({[^}]+})',
182
+ r'"user":\s*({[^}]+})'
183
+ ]
184
+
185
+ for pattern in json_patterns:
186
+ matches = re.findall(pattern, page_text)
187
+ if matches:
188
+ try:
189
+ data = json.loads(matches[0])
190
+ # Extract profile info from JSON
191
+ if "user" in data:
192
+ user_data = data["user"]
193
+ profile_info["username"] = user_data.get("username", "")
194
+ profile_info["display_name"] = user_data.get("full_name", "")
195
+ profile_info["bio"] = user_data.get("biography", "")
196
+ profile_info["followers"] = user_data.get("followed_by", {}).get("count", "")
197
+ profile_info["following"] = user_data.get("follows", {}).get("count", "")
198
+ profile_info["posts_count"] = user_data.get("media", {}).get("count", "")
199
+ except:
200
+ continue
201
+
202
+ # Fallback: try to extract from HTML
203
+ if not profile_info["username"]:
204
+ title_tag = soup.find('title')
205
+ if title_tag:
206
+ title_text = title_tag.get_text()
207
+ if '(' in title_text and ')' in title_text:
208
+ username = title_text.split('(')[1].split(')')[0]
209
+ profile_info["username"] = username
210
+
211
+ except Exception as e:
212
+ profile_info["error"] = f"Failed to extract profile info: {str(e)}"
213
+
214
+ return profile_info
215
+
216
+ def extract_profile_images(self, page_text):
217
+ """Extract images from profile page"""
218
+ images = []
219
+
220
+ try:
221
+ # Look for Instagram post images in the page source
222
+ # Instagram stores post images in JSON data
223
+ image_patterns = [
224
+ # Instagram post images (high quality)
225
+ r'"display_url":"([^"]+)"',
226
+ r'"display_src":"([^"]+)"',
227
+ r'"src":"([^"]*\.jpg[^"]*)"',
228
+ r'"src":"([^"]*\.jpeg[^"]*)"',
229
+ r'"src":"([^"]*\.png[^"]*)"',
230
+ # Direct image URLs
231
+ r'https://[^"]*\.jpg[^"]*',
232
+ r'https://[^"]*\.jpeg[^"]*',
233
+ r'https://[^"]*\.png[^"]*',
234
+ # Instagram CDN URLs
235
+ r'https://scontent[^"]*\.jpg[^"]*',
236
+ r'https://scontent[^"]*\.jpeg[^"]*',
237
+ r'https://scontent[^"]*\.png[^"]*',
238
+ # Additional Instagram patterns
239
+ r'"url":"([^"]*\.jpg[^"]*)"',
240
+ r'"url":"([^"]*\.jpeg[^"]*)"',
241
+ r'"url":"([^"]*\.png[^"]*)"'
242
+ ]
243
+
244
+ found_images = set()
245
+ for pattern in image_patterns:
246
+ matches = re.findall(pattern, page_text)
247
+ for match in matches:
248
+ if match and ('instagram' in match.lower() or 'scontent' in match.lower()):
249
+ # Clean up the URL
250
+ clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
251
+ found_images.add(clean_url)
252
+
253
+ # Convert to image objects
254
+ for i, img_url in enumerate(list(found_images)):
255
+ images.append({
256
+ "src": img_url,
257
+ "alt": f"Instagram post image {i+1}",
258
+ "title": f"Instagram post image {i+1}",
259
+ "width": "",
260
+ "height": ""
261
+ })
262
+
263
+ except Exception as e:
264
+ st.error(f"Failed to extract profile images: {str(e)}")
265
+
266
+ return images
267
+
268
+ def extract_recent_posts(self, page_text):
269
+ """Extract recent posts from profile"""
270
+ posts = []
271
+
272
+ try:
273
+ # Look for post URLs in the page source
274
+ post_patterns = [
275
+ r'"shortcode":"([^"]+)"',
276
+ r'/p/([^/"]+)',
277
+ r'/reel/([^/"]+)'
278
+ ]
279
+
280
+ found_posts = set()
281
+ for pattern in post_patterns:
282
+ matches = re.findall(pattern, page_text)
283
+ for match in matches:
284
+ if match:
285
+ found_posts.add(match)
286
+
287
+ # Convert to post objects
288
+ for i, post_code in enumerate(list(found_posts)[:10]): # Convert set to list and limit to 10 posts
289
+ posts.append({
290
+ "shortcode": post_code,
291
+ "url": f"https://www.instagram.com/p/{post_code}/",
292
+ "index": i + 1
293
+ })
294
+
295
+ except Exception as e:
296
+ st.error(f"Failed to extract recent posts: {str(e)}")
297
+
298
+ return posts
299
+
300
+ def extract_images_from_posts(self, posts, max_posts=5):
301
+ """Extract images from individual posts"""
302
+ all_images = []
303
+
304
+ try:
305
+ for i, post in enumerate(posts[:max_posts]):
306
+ try:
307
+ # Get the post page
308
+ post_url = post["url"]
309
+ response = self.session.get(post_url, timeout=10)
310
+ response.raise_for_status()
311
+
312
+ # Extract images from this post
313
+ post_images = self.extract_post_images(response.text)
314
+
315
+ # Add post context to images
316
+ for img in post_images:
317
+ img["post_url"] = post_url
318
+ img["post_index"] = i + 1
319
+ all_images.append(img)
320
+
321
+ # Small delay to be respectful
322
+ time.sleep(1)
323
+
324
+ except Exception as e:
325
+ st.warning(f"Failed to extract images from post {post['shortcode']}: {str(e)}")
326
+ continue
327
+
328
+ except Exception as e:
329
+ st.error(f"Failed to extract images from posts: {str(e)}")
330
+
331
+ return all_images
332
+
333
+ def extract_post_images(self, page_text):
334
+ """Extract images from a single post page"""
335
+ images = []
336
+
337
+ try:
338
+ # Look for high-quality Instagram post images
339
+ image_patterns = [
340
+ # Instagram post images (high quality)
341
+ r'"display_url":"([^"]+)"',
342
+ r'"display_src":"([^"]+)"',
343
+ # Instagram CDN URLs (highest quality)
344
+ r'https://scontent[^"]*\.jpg[^"]*',
345
+ r'https://scontent[^"]*\.jpeg[^"]*',
346
+ r'https://scontent[^"]*\.png[^"]*',
347
+ # Additional patterns
348
+ r'"src":"([^"]*\.jpg[^"]*)"',
349
+ r'"src":"([^"]*\.jpeg[^"]*)"',
350
+ r'"src":"([^"]*\.png[^"]*)"'
351
+ ]
352
+
353
+ found_images = set()
354
+ for pattern in image_patterns:
355
+ matches = re.findall(pattern, page_text)
356
+ for match in matches:
357
+ if match and ('scontent' in match.lower() or 'instagram' in match.lower()):
358
+ # Clean up the URL
359
+ clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
360
+ found_images.add(clean_url)
361
+
362
+ # Convert to image objects
363
+ for i, img_url in enumerate(list(found_images)):
364
+ images.append({
365
+ "src": img_url,
366
+ "alt": f"Instagram post image {i+1}",
367
+ "title": f"Instagram post image {i+1}",
368
+ "width": "",
369
+ "height": ""
370
+ })
371
+
372
+ except Exception as e:
373
+ st.error(f"Failed to extract post images: {str(e)}")
374
+
375
+ return images
376
+
377
+ # Global Instagram scraper instance
378
+ instagram_scraper = InstagramScraper()
instagram_scraper_v2.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ import json
4
+ import re
5
+ import time
6
+ import random
7
+ from datetime import datetime
8
+
9
+ class InstagramScraperV2:
10
+ def __init__(self):
11
+ self.session = requests.Session()
12
+ self.setup_session()
13
+
14
+ def setup_session(self):
15
+ """Setup session with better anti-detection measures"""
16
+ user_agents = [
17
+ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
18
+ 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
19
+ 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
20
+ ]
21
+
22
+ self.session.headers.update({
23
+ 'User-Agent': random.choice(user_agents),
24
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
25
+ 'Accept-Language': 'en-US,en;q=0.5',
26
+ 'Connection': 'keep-alive',
27
+ 'Upgrade-Insecure-Requests': '1'
28
+ })
29
+
30
+ def get_page_with_retry(self, url, max_retries=3):
31
+ """Get page with retry mechanism"""
32
+ for attempt in range(max_retries):
33
+ try:
34
+ time.sleep(random.uniform(2, 4))
35
+ response = self.session.get(url, timeout=20)
36
+ response.raise_for_status()
37
+ return response.text
38
+ except Exception as e:
39
+ st.warning(f"Attempt {attempt + 1} failed: {str(e)}")
40
+ if attempt == max_retries - 1:
41
+ raise
42
+ return None
43
+
44
+ def extract_instagram_data(self, url):
45
+ """Extract data from Instagram with improved error handling"""
46
+ scraped_data = {
47
+ "url": url,
48
+ "timestamp": datetime.now().isoformat(),
49
+ "platform": "instagram",
50
+ "images": [],
51
+ "posts": [],
52
+ "profile_info": {},
53
+ "errors": []
54
+ }
55
+
56
+ try:
57
+ page_text = self.get_page_with_retry(url)
58
+ if not page_text:
59
+ scraped_data["errors"].append("Failed to load Instagram page")
60
+ return scraped_data
61
+
62
+ # Extract images
63
+ scraped_data["images"] = self.extract_images_from_page(page_text)
64
+
65
+ # Extract profile info
66
+ scraped_data["profile_info"] = self.extract_profile_info(page_text)
67
+
68
+ # Extract posts
69
+ scraped_data["posts"] = self.extract_recent_posts(page_text)
70
+
71
+ except Exception as e:
72
+ scraped_data["errors"].append(f"Instagram scraping error: {str(e)}")
73
+
74
+ return scraped_data
75
+
76
+ def extract_images_from_page(self, page_text):
77
+ """Extract images with improved patterns"""
78
+ images = []
79
+
80
+ try:
81
+ # Enhanced patterns for Instagram images
82
+ patterns = [
83
+ r'https://scontent[^"]*\.jpg[^"]*',
84
+ r'https://scontent[^"]*\.jpeg[^"]*',
85
+ r'https://scontent[^"]*\.png[^"]*',
86
+ r'"display_url":"([^"]+)"',
87
+ r'"display_src":"([^"]+)"'
88
+ ]
89
+
90
+ found_images = set()
91
+ for pattern in patterns:
92
+ matches = re.findall(pattern, page_text)
93
+ for match in matches:
94
+ if match and ('scontent' in match.lower() or 'instagram' in match.lower()):
95
+ clean_url = match.replace('\\u0026', '&').replace('\\/', '/')
96
+ found_images.add(clean_url)
97
+
98
+ for i, img_url in enumerate(list(found_images)):
99
+ images.append({
100
+ "src": img_url,
101
+ "alt": f"Instagram image {i+1}",
102
+ "title": f"Instagram image {i+1}",
103
+ "width": "",
104
+ "height": ""
105
+ })
106
+
107
+ except Exception as e:
108
+ st.error(f"Failed to extract images: {str(e)}")
109
+
110
+ return images
111
+
112
+ def extract_profile_info(self, page_text):
113
+ """Extract profile information"""
114
+ profile_info = {
115
+ "username": "",
116
+ "display_name": "",
117
+ "bio": "",
118
+ "followers": "",
119
+ "following": "",
120
+ "posts_count": ""
121
+ }
122
+
123
+ try:
124
+ # Extract username from title
125
+ title_match = re.search(r'<title>([^<]+)</title>', page_text)
126
+ if title_match:
127
+ title = title_match.group(1)
128
+ if '(' in title and ')' in title:
129
+ username = title.split('(')[1].split(')')[0]
130
+ profile_info["username"] = username
131
+
132
+ # Look for JSON data
133
+ json_patterns = [
134
+ r'"username":"([^"]+)"',
135
+ r'"full_name":"([^"]+)"',
136
+ r'"biography":"([^"]+)"'
137
+ ]
138
+
139
+ for pattern in json_patterns:
140
+ matches = re.findall(pattern, page_text)
141
+ if matches:
142
+ if "username" in pattern:
143
+ profile_info["username"] = matches[0]
144
+ elif "full_name" in pattern:
145
+ profile_info["display_name"] = matches[0]
146
+ elif "biography" in pattern:
147
+ profile_info["bio"] = matches[0]
148
+
149
+ except Exception as e:
150
+ profile_info["error"] = f"Failed to extract profile info: {str(e)}"
151
+
152
+ return profile_info
153
+
154
+ def extract_recent_posts(self, page_text):
155
+ """Extract recent posts"""
156
+ posts = []
157
+
158
+ try:
159
+ post_patterns = [
160
+ r'"shortcode":"([^"]+)"',
161
+ r'/p/([^/"]+)'
162
+ ]
163
+
164
+ found_posts = set()
165
+ for pattern in post_patterns:
166
+ matches = re.findall(pattern, page_text)
167
+ for match in matches:
168
+ if match:
169
+ found_posts.add(match)
170
+
171
+ for i, post_code in enumerate(list(found_posts)[:10]):
172
+ posts.append({
173
+ "shortcode": post_code,
174
+ "url": f"https://www.instagram.com/p/{post_code}/",
175
+ "index": i + 1
176
+ })
177
+
178
+ except Exception as e:
179
+ st.error(f"Failed to extract posts: {str(e)}")
180
+
181
+ return posts
182
+
183
+ # Global instance
184
+ instagram_scraper_v2 = InstagramScraperV2()
requirements.txt CHANGED
@@ -1,3 +1,7 @@
1
- altair
2
  pandas
3
- streamlit
 
 
 
 
 
1
+ streamlit
2
  pandas
3
+ requests
4
+ beautifulsoup4
5
+ selenium
6
+ lxml
7
+ openpyxl
requirements_hf.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ streamlit>=1.28.0
2
+ pandas>=1.5.0
3
+ requests>=2.28.0
4
+ beautifulsoup4>=4.11.0
5
+ lxml>=4.9.0
scraper.py ADDED
@@ -0,0 +1,320 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ from bs4 import BeautifulSoup
3
+ from selenium import webdriver
4
+ from selenium.webdriver.chrome.options import Options
5
+ from selenium.webdriver.common.by import By
6
+ from selenium.webdriver.support.ui import WebDriverWait
7
+ from selenium.webdriver.support import expected_conditions as EC
8
+ from webdriver_manager.chrome import ChromeDriverManager
9
+ import time
10
+ import re
11
+ from urllib.parse import urljoin, urlparse
12
+ import json
13
+ from datetime import datetime
14
+
15
+ class WebScraper:
16
+ def __init__(self):
17
+ self.session = requests.Session()
18
+ self.session.headers.update({
19
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
20
+ })
21
+ self.driver = None
22
+
23
+ def setup_selenium(self):
24
+ """Setup Selenium WebDriver for dynamic content"""
25
+ try:
26
+ chrome_options = Options()
27
+ chrome_options.add_argument("--headless")
28
+ chrome_options.add_argument("--no-sandbox")
29
+ chrome_options.add_argument("--disable-dev-shm-usage")
30
+ chrome_options.add_argument("--disable-gpu")
31
+ chrome_options.add_argument("--window-size=1920,1080")
32
+
33
+ self.driver = webdriver.Chrome(
34
+ service=webdriver.chrome.service.Service(ChromeDriverManager().install()),
35
+ options=chrome_options
36
+ )
37
+ return True
38
+ except Exception as e:
39
+ print(f"Failed to setup Selenium: {e}")
40
+ return False
41
+
42
+ def close_selenium(self):
43
+ """Close Selenium WebDriver"""
44
+ if self.driver:
45
+ self.driver.quit()
46
+ self.driver = None
47
+
48
+ def get_page_content(self, url, use_selenium=False):
49
+ """Get page content using requests or Selenium"""
50
+ try:
51
+ if use_selenium and self.driver:
52
+ self.driver.get(url)
53
+ time.sleep(2) # Wait for dynamic content
54
+ return self.driver.page_source
55
+ else:
56
+ response = self.session.get(url, timeout=10)
57
+ response.raise_for_status()
58
+ return response.text
59
+ except Exception as e:
60
+ print(f"Error fetching page: {e}")
61
+ return None
62
+
63
+ def extract_text_content(self, soup):
64
+ """Extract text content from BeautifulSoup object"""
65
+ text_data = {
66
+ "title": "",
67
+ "headings": [],
68
+ "paragraphs": [],
69
+ "lists": []
70
+ }
71
+
72
+ # Extract title
73
+ title_tag = soup.find('title')
74
+ if title_tag:
75
+ text_data["title"] = title_tag.get_text().strip()
76
+
77
+ # Extract headings
78
+ for tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
79
+ headings = soup.find_all(tag)
80
+ for heading in headings:
81
+ text = heading.get_text().strip()
82
+ if text:
83
+ text_data["headings"].append({
84
+ "level": tag,
85
+ "text": text
86
+ })
87
+
88
+ # Extract paragraphs
89
+ paragraphs = soup.find_all('p')
90
+ for p in paragraphs:
91
+ text = p.get_text().strip()
92
+ if text and len(text) > 20: # Filter out short text
93
+ text_data["paragraphs"].append(text)
94
+
95
+ # Extract lists
96
+ lists = soup.find_all(['ul', 'ol'])
97
+ for lst in lists:
98
+ items = []
99
+ for item in lst.find_all('li'):
100
+ text = item.get_text().strip()
101
+ if text:
102
+ items.append(text)
103
+ if items:
104
+ text_data["lists"].append({
105
+ "type": lst.name,
106
+ "items": items
107
+ })
108
+
109
+ return text_data
110
+
111
+ def extract_numbers(self, soup):
112
+ """Extract all numbers (integers and floats) from the text content"""
113
+ text = soup.get_text()
114
+ # Regex to find integers and floats
115
+ numbers = re.findall(r'\b\d+\.?\d*\b', text)
116
+ # Convert to float for consistency, and remove duplicates
117
+ return sorted(list(set([float(n) for n in numbers if n.strip()])))
118
+
119
+ def extract_images(self, soup, base_url):
120
+ """Extract images from BeautifulSoup object"""
121
+ images = []
122
+ img_tags = soup.find_all('img')
123
+
124
+ for img in img_tags:
125
+ src = img.get('src', '')
126
+ alt = img.get('alt', '')
127
+ title = img.get('title', '')
128
+
129
+ if src:
130
+ # Make relative URLs absolute
131
+ if not src.startswith(('http://', 'https://')):
132
+ src = urljoin(base_url, src)
133
+
134
+ images.append({
135
+ "src": src,
136
+ "alt": alt,
137
+ "title": title,
138
+ "width": img.get('width', ''),
139
+ "height": img.get('height', '')
140
+ })
141
+
142
+ return images
143
+
144
+ def extract_links(self, soup, base_url):
145
+ """Extract links from BeautifulSoup object"""
146
+ links = []
147
+ link_tags = soup.find_all('a', href=True)
148
+
149
+ for link in link_tags:
150
+ href = link.get('href')
151
+ text = link.get_text().strip()
152
+
153
+ if href and text:
154
+ # Make relative URLs absolute
155
+ if not href.startswith(('http://', 'https://')):
156
+ href = urljoin(base_url, href)
157
+
158
+ # Only include external and internal links, skip anchors
159
+ if not href.startswith('#'):
160
+ links.append({
161
+ "href": href,
162
+ "text": text,
163
+ "title": link.get('title', ''),
164
+ "is_external": not href.startswith(base_url)
165
+ })
166
+
167
+ return links
168
+
169
+ def extract_tables(self, soup):
170
+ """Extract tables from BeautifulSoup object"""
171
+ tables = []
172
+ table_tags = soup.find_all('table')
173
+
174
+ for table in table_tags:
175
+ table_data = {
176
+ "headers": [],
177
+ "rows": [],
178
+ "caption": ""
179
+ }
180
+
181
+ # Extract caption
182
+ caption = table.find('caption')
183
+ if caption:
184
+ table_data["caption"] = caption.get_text().strip()
185
+
186
+ # Extract headers
187
+ thead = table.find('thead')
188
+ if thead:
189
+ header_row = thead.find('tr')
190
+ if header_row:
191
+ headers = header_row.find_all(['th', 'td'])
192
+ table_data["headers"] = [h.get_text().strip() for h in headers]
193
+
194
+ # Extract rows
195
+ tbody = table.find('tbody') or table
196
+ rows = tbody.find_all('tr')
197
+
198
+ for row in rows:
199
+ cells = row.find_all(['td', 'th'])
200
+ if cells:
201
+ row_data = [cell.get_text().strip() for cell in cells]
202
+ table_data["rows"].append(row_data)
203
+
204
+ if table_data["rows"]:
205
+ tables.append(table_data)
206
+
207
+ return tables
208
+
209
+ def extract_metadata(self, soup):
210
+ """Extract metadata from BeautifulSoup object"""
211
+ metadata = {
212
+ "title": "",
213
+ "description": "",
214
+ "keywords": [],
215
+ "author": "",
216
+ "language": "en",
217
+ "robots": "",
218
+ "viewport": "",
219
+ "charset": ""
220
+ }
221
+
222
+ # Extract title
223
+ title_tag = soup.find('title')
224
+ if title_tag:
225
+ metadata["title"] = title_tag.get_text().strip()
226
+
227
+ # Extract meta tags
228
+ meta_tags = soup.find_all('meta')
229
+ for meta in meta_tags:
230
+ name = meta.get('name', '').lower()
231
+ content = meta.get('content', '')
232
+ property_attr = meta.get('property', '').lower()
233
+
234
+ if name == 'description' or property_attr == 'og:description':
235
+ metadata["description"] = content
236
+ elif name == 'keywords':
237
+ metadata["keywords"] = [kw.strip() for kw in content.split(',')]
238
+ elif name == 'author':
239
+ metadata["author"] = content
240
+ elif name == 'robots':
241
+ metadata["robots"] = content
242
+ elif name == 'viewport':
243
+ metadata["viewport"] = content
244
+ elif property_attr == 'og:title':
245
+ metadata["title"] = content or metadata["title"]
246
+
247
+ # Extract charset
248
+ charset_meta = soup.find('meta', charset=True)
249
+ if charset_meta:
250
+ metadata["charset"] = charset_meta.get('charset')
251
+
252
+ # Extract language
253
+ html_tag = soup.find('html')
254
+ if html_tag:
255
+ lang = html_tag.get('lang', 'en')
256
+ metadata["language"] = lang
257
+
258
+ return metadata
259
+
260
+ def scrape_website(self, url, data_types, max_pages=1, rate_limit=2):
261
+ """Main scraping function"""
262
+ scraped_data = {
263
+ "url": url,
264
+ "timestamp": datetime.now().isoformat(),
265
+ "data_types": data_types,
266
+ "pages_crawled": 0,
267
+ "errors": []
268
+ }
269
+
270
+ try:
271
+ # Setup Selenium if needed for dynamic content
272
+ use_selenium = "images" in data_types or "tables" in data_types
273
+ if use_selenium:
274
+ if not self.setup_selenium():
275
+ scraped_data["errors"].append("Failed to setup Selenium for dynamic content")
276
+
277
+ # Get page content
278
+ content = self.get_page_content(url, use_selenium)
279
+ if not content:
280
+ scraped_data["errors"].append("Failed to fetch page content")
281
+ return scraped_data
282
+
283
+ # Parse with BeautifulSoup
284
+ soup = BeautifulSoup(content, 'html.parser')
285
+ scraped_data["pages_crawled"] = 1
286
+
287
+ # Extract data based on selected types
288
+ if "text" in data_types:
289
+ scraped_data["text_content"] = self.extract_text_content(soup)
290
+
291
+ if "images" in data_types:
292
+ scraped_data["images"] = self.extract_images(soup, url)
293
+
294
+ if "links" in data_types:
295
+ scraped_data["links"] = self.extract_links(soup, url)
296
+
297
+ if "tables" in data_types:
298
+ scraped_data["tables"] = self.extract_tables(soup)
299
+
300
+ if "metadata" in data_types:
301
+ scraped_data["metadata"] = self.extract_metadata(soup)
302
+
303
+ if "numbers" in data_types:
304
+ scraped_data["numbers"] = self.extract_numbers(soup)
305
+
306
+ # Rate limiting
307
+ time.sleep(rate_limit)
308
+
309
+ except Exception as e:
310
+ scraped_data["errors"].append(f"Scraping error: {str(e)}")
311
+
312
+ finally:
313
+ # Clean up Selenium
314
+ if use_selenium:
315
+ self.close_selenium()
316
+
317
+ return scraped_data
318
+
319
+ # Global scraper instance
320
+ scraper = WebScraper()
youtube_scraper.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ from bs4 import BeautifulSoup
4
+ from selenium import webdriver
5
+ from selenium.webdriver.chrome.options import Options
6
+ from selenium.webdriver.common.by import By
7
+ from selenium.webdriver.support.ui import WebDriverWait
8
+ from selenium.webdriver.support import expected_conditions as EC
9
+ from webdriver_manager.chrome import ChromeDriverManager
10
+ import time
11
+ import re
12
+ import json
13
+ from datetime import datetime
14
+
15
+ class YouTubeScraper:
16
+ def __init__(self):
17
+ self.driver = None
18
+ self.setup_selenium()
19
+
20
+ def setup_selenium(self):
21
+ """Setup Selenium WebDriver for YouTube"""
22
+ try:
23
+ chrome_options = Options()
24
+ chrome_options.add_argument("--headless")
25
+ chrome_options.add_argument("--no-sandbox")
26
+ chrome_options.add_argument("--disable-dev-shm-usage")
27
+ chrome_options.add_argument("--disable-gpu")
28
+ chrome_options.add_argument("--window-size=1920,1080")
29
+ chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
30
+
31
+ self.driver = webdriver.Chrome(
32
+ service=webdriver.chrome.service.Service(ChromeDriverManager().install()),
33
+ options=chrome_options
34
+ )
35
+ return True
36
+ except Exception as e:
37
+ st.error(f"Failed to setup Selenium: {e}")
38
+ return False
39
+
40
+ def close_selenium(self):
41
+ """Close Selenium WebDriver"""
42
+ if self.driver:
43
+ self.driver.quit()
44
+ self.driver = None
45
+
46
+ def extract_video_info(self, url):
47
+ """Extract basic video information"""
48
+ try:
49
+ self.driver.get(url)
50
+ time.sleep(3) # Wait for page to load
51
+
52
+ # Extract video title
53
+ title = ""
54
+ try:
55
+ title_element = self.driver.find_element(By.CSS_SELECTOR, "h1.ytd-video-primary-info-renderer")
56
+ title = title_element.text
57
+ except:
58
+ try:
59
+ title_element = self.driver.find_element(By.CSS_SELECTOR, "h1")
60
+ title = title_element.text
61
+ except:
62
+ title = "Title not found"
63
+
64
+ # Extract channel name
65
+ channel = ""
66
+ try:
67
+ channel_element = self.driver.find_element(By.CSS_SELECTOR, "ytd-channel-name yt-formatted-string a")
68
+ channel = channel_element.text
69
+ except:
70
+ channel = "Channel not found"
71
+
72
+ # Extract view count
73
+ views = ""
74
+ try:
75
+ views_element = self.driver.find_element(By.CSS_SELECTOR, "span.view-count")
76
+ views = views_element.text
77
+ except:
78
+ views = "Views not found"
79
+
80
+ # Extract description
81
+ description = ""
82
+ try:
83
+ desc_element = self.driver.find_element(By.CSS_SELECTOR, "ytd-expandable-video-description-body-text")
84
+ description = desc_element.text
85
+ except:
86
+ description = "Description not found"
87
+
88
+ return {
89
+ "title": title,
90
+ "channel": channel,
91
+ "views": views,
92
+ "description": description,
93
+ "url": url
94
+ }
95
+
96
+ except Exception as e:
97
+ return {"error": f"Failed to extract video info: {str(e)}"}
98
+
99
+ def extract_comments(self, url, max_comments=50):
100
+ """Extract comments from YouTube video"""
101
+ try:
102
+ self.driver.get(url)
103
+ time.sleep(3)
104
+
105
+ # Scroll down to load comments
106
+ self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
107
+ time.sleep(2)
108
+
109
+ # Try to find comments section
110
+ comments = []
111
+
112
+ # Method 1: Try to find comment elements
113
+ try:
114
+ comment_elements = self.driver.find_elements(By.CSS_SELECTOR, "ytd-comment-thread-renderer")
115
+
116
+ for i, comment in enumerate(comment_elements[:max_comments]):
117
+ try:
118
+ # Extract author
119
+ author_element = comment.find_element(By.CSS_SELECTOR, "a#author-text")
120
+ author = author_element.text.strip()
121
+
122
+ # Extract comment text
123
+ text_element = comment.find_element(By.CSS_SELECTOR, "#content-text")
124
+ text = text_element.text.strip()
125
+
126
+ # Extract timestamp
127
+ time_element = comment.find_element(By.CSS_SELECTOR, "a.yt-simple-endpoint")
128
+ timestamp = time_element.text.strip()
129
+
130
+ # Extract likes
131
+ likes = "0"
132
+ try:
133
+ likes_element = comment.find_element(By.CSS_SELECTOR, "span#vote-count-middle")
134
+ likes = likes_element.text.strip()
135
+ except:
136
+ pass
137
+
138
+ if text: # Only add if comment has text
139
+ comments.append({
140
+ "author": author,
141
+ "text": text,
142
+ "timestamp": timestamp,
143
+ "likes": likes,
144
+ "comment_id": i
145
+ })
146
+
147
+ except Exception as e:
148
+ continue
149
+
150
+ except Exception as e:
151
+ st.warning(f"Could not extract comments using primary method: {e}")
152
+
153
+ # Method 2: Alternative approach if first method fails
154
+ if not comments:
155
+ try:
156
+ # Look for any text that might be comments
157
+ page_text = self.driver.page_source
158
+ soup = BeautifulSoup(page_text, 'html.parser')
159
+
160
+ # Look for comment-like patterns
161
+ comment_patterns = [
162
+ "ytd-comment-renderer",
163
+ "comment-text",
164
+ "ytd-comment-thread-renderer"
165
+ ]
166
+
167
+ for pattern in comment_patterns:
168
+ elements = soup.find_all(attrs={"class": re.compile(pattern)})
169
+ for element in elements[:max_comments]:
170
+ text = element.get_text().strip()
171
+ if text and len(text) > 10:
172
+ comments.append({
173
+ "author": "Unknown",
174
+ "text": text,
175
+ "timestamp": "Unknown",
176
+ "likes": "0",
177
+ "comment_id": len(comments)
178
+ })
179
+
180
+ except Exception as e:
181
+ st.error(f"Alternative comment extraction failed: {e}")
182
+
183
+ return comments
184
+
185
+ except Exception as e:
186
+ return [{"error": f"Failed to extract comments: {str(e)}"}]
187
+
188
+ def scrape_youtube_video(self, url, extract_comments=True, max_comments=50):
189
+ """Main function to scrape YouTube video data"""
190
+ result = {
191
+ "url": url,
192
+ "timestamp": datetime.now().isoformat(),
193
+ "video_info": {},
194
+ "comments": [],
195
+ "errors": []
196
+ }
197
+
198
+ try:
199
+ # Extract video information
200
+ result["video_info"] = self.extract_video_info(url)
201
+
202
+ # Extract comments if requested
203
+ if extract_comments:
204
+ result["comments"] = self.extract_comments(url, max_comments)
205
+
206
+ except Exception as e:
207
+ result["errors"].append(f"Scraping error: {str(e)}")
208
+
209
+ finally:
210
+ self.close_selenium()
211
+
212
+ return result
213
+
214
+ # Global YouTube scraper instance
215
+ youtube_scraper = YouTubeScraper()