Spaces:

AUXteam
/

Scraper_hub

Paused

App Files Files Community

itsOwen commited on Aug 26, 2024

Commit

d25f246

1 Parent(s): a89b867

multi-page scrape beta released

Browse files

Files changed (3) hide show

README.md +67 -0
src/scrapers/playwright_scraper.py +93 -15
src/web_extractor.py +41 -21

README.md CHANGED Viewed

@@ -167,6 +167,73 @@ Note: Ensure that your firewall allows connections to port 11434 for Ollama.
 4. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!
 ## Setup Google Sheets Authentication:
 1. Go to the Google Cloud Console (https://console.cloud.google.com/).

 4. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!
+## 🌐 Multi-Page Scraping (BETA)
+> **Note**: The multi-page scraping feature is currently in beta. While functional, you may encounter occasional issues or unexpected behavior. We appreciate your feedback and patience as we continue to improve this feature.
+CyberScraper 2077 now supports multi-page scraping, allowing you to extract data from multiple pages of a website in one go. This feature is perfect for scraping paginated content, search results, or any site with data spread across multiple pages.
+### How to Use Multi-Page Scraping
+I suggest you enter the URL structure every time If you want to scrape multiple pages so it can detect the URL structure easily, It detects nearly all URL types.
+1. **Basic Usage**:
+   To scrape multiple pages, use the following format when entering the URL:
+   ```
+   https://example.com/page 1-5
+   https://example.com/p/ 1-6
+   https://example.com/xample/something-something-1279?p=1 1-3
+   ```
+   This will scrape pages 1 through 5 of the website.
+2. **Custom Page Ranges**:
+   You can specify custom page ranges:
+   ```
+   https://example.com/p/ 1-5,7,9-12
+   https://example.com/xample/something-something-1279?p=1 1,7,8,9
+   ```
+   This will scrape pages 1 to 5, page 7, and pages 9 to 12.
+3. **URL Patterns**:
+   For websites with different URL structures, you can specify a pattern:
+   ```
+   https://example.com/search?q=cyberpunk&page={page} 1-5
+   ```
+   Replace `{page}` with where the page number should be in the URL.
+4. **Automatic Pattern Detection**:
+   If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended.
+### Tips for Effective Multi-Page Scraping
+- Start with a small range of pages to test before scraping a large number.
+- Be mindful of the website's load and your scraping speed to avoid overloading servers.
+- Use the `simulate_human` option for more natural scraping behavior on sites with anti-bot measures.
+- Regularly check the website's `robots.txt` file and terms of service to ensure compliance.
+### Example
+```bash
+URL Example : "https://news.ycombinator.com/?p=1 1-3 or 1,2,3,4"
+```
+### Handling Errors
+If you encounter errors during multi-page scraping:
+- Check your internet connection
+- Verify the URL pattern is correct
+- Ensure the website allows scraping
+- Try reducing the number of pages or increasing the delay between requests
+### Beta Feedback
+As this feature is in beta, we highly value your feedback. If you encounter any issues or have suggestions for improvement, please:
+1. Open an issue on our GitHub repository
+2. Provide detailed information about the problem, including the URL structure and number of pages you were attempting to scrape
+3. Share any error messages or unexpected behaviors you observed
+Your input is crucial in helping us refine and stabilize this feature for future releases.
 ## Setup Google Sheets Authentication:
 1. Go to the Google Cloud Console (https://console.cloud.google.com/).

src/scrapers/playwright_scraper.py CHANGED Viewed

@@ -1,10 +1,12 @@
 from playwright.async_api import async_playwright, Browser, BrowserContext, Page
 from playwright_stealth import stealth_async
 from .base_scraper import BaseScraper
-from typing import Dict, Any, Optional
 import asyncio
 import random
 import logging
 class ScraperConfig:
     def __init__(self,
@@ -12,20 +14,28 @@ class ScraperConfig:
                  simulate_human: bool = False,
                  use_custom_headers: bool = True,
                  hide_webdriver: bool = True,
-                 bypass_cloudflare: bool = True):
         self.use_stealth = use_stealth
         self.simulate_human = simulate_human
         self.use_custom_headers = use_custom_headers
         self.hide_webdriver = hide_webdriver
         self.bypass_cloudflare = bypass_cloudflare
 class PlaywrightScraper(BaseScraper):
     def __init__(self, config: ScraperConfig = ScraperConfig()):
         self.logger = logging.getLogger(__name__)
-        self.logger.setLevel(logging.DEBUG)
         self.config = config
-    async def fetch_content(self, url: str, proxy: Optional[str] = None) -> str:
         async with async_playwright() as p:
             browser = await self.launch_browser(p, proxy)
             context = await self.create_context(browser, proxy)
@@ -36,24 +46,49 @@ class PlaywrightScraper(BaseScraper):
             await self.set_browser_features(page)
             try:
-                content = await self.navigate_and_get_content(page, url)
-                if self.config.bypass_cloudflare and "Cloudflare" in content and "ray ID" in content.lower():
-                    self.logger.info("Cloudflare detected, attempting to bypass...")
-                    content = await self.bypass_cloudflare(page, url)
             except Exception as e:
                 self.logger.error(f"Error during scraping: {str(e)}")
-                content = f"Error: {str(e)}"
             finally:
                 await browser.close()
-            return content
     async def launch_browser(self, playwright, proxy: Optional[str] = None) -> Browser:
         return await playwright.chromium.launch(
-            headless=True,  # Set to False for GUI
             args=['--no-sandbox', '--disable-setuid-sandbox', '--disable-infobars',
                   '--window-position=0,0', '--ignore-certifcate-errors',
-                  '--ignore-certifcate-errors-spki-list', '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'],
             proxy={'server': proxy} if proxy else None
         )
@@ -86,17 +121,17 @@ class PlaywrightScraper(BaseScraper):
             ''')
     async def navigate_and_get_content(self, page: Page, url: str) -> str:
-        await page.goto(url, wait_until='domcontentloaded', timeout=30000)
         if self.config.simulate_human:
             await self.simulate_human_behavior(page)
         else:
-            await asyncio.sleep(1)
         return await page.content()
     async def bypass_cloudflare(self, page: Page, url: str) -> str:
         max_retries = 3
         for _ in range(max_retries):
-            await page.reload(wait_until='domcontentloaded', timeout=30000)
             if self.config.simulate_human:
                 await self.simulate_human_behavior(page)
             else:
@@ -131,5 +166,48 @@ class PlaywrightScraper(BaseScraper):
             await random_element.hover()
             await asyncio.sleep(random.uniform(0.3, 0.7))
     async def extract(self, content: str) -> Dict[str, Any]:
         return {"raw_content": content}

 from playwright.async_api import async_playwright, Browser, BrowserContext, Page
 from playwright_stealth import stealth_async
 from .base_scraper import BaseScraper
+from typing import Dict, Any, Optional, List, Tuple
 import asyncio
 import random
 import logging
+import re
+from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
 class ScraperConfig:
     def __init__(self,
                  simulate_human: bool = False,
                  use_custom_headers: bool = True,
                  hide_webdriver: bool = True,
+                 bypass_cloudflare: bool = True,
+                 headless: bool = True,
+                 debug: bool = False,
+                 timeout: int = 60000,
+                 wait_for: str = 'networkidle'):
         self.use_stealth = use_stealth
         self.simulate_human = simulate_human
         self.use_custom_headers = use_custom_headers
         self.hide_webdriver = hide_webdriver
         self.bypass_cloudflare = bypass_cloudflare
+        self.headless = headless
+        self.debug = debug
+        self.timeout = timeout
+        self.wait_for = wait_for
 class PlaywrightScraper(BaseScraper):
     def __init__(self, config: ScraperConfig = ScraperConfig()):
         self.logger = logging.getLogger(__name__)
+        self.logger.setLevel(logging.DEBUG if config.debug else logging.INFO)
         self.config = config
+    async def fetch_content(self, url: str, proxy: Optional[str] = None, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> List[str]:
         async with async_playwright() as p:
             browser = await self.launch_browser(p, proxy)
             context = await self.create_context(browser, proxy)
             await self.set_browser_features(page)
             try:
+                contents = await self.scrape_multiple_pages(page, url, pages, url_pattern)
             except Exception as e:
                 self.logger.error(f"Error during scraping: {str(e)}")
+                contents = [f"Error: {str(e)}"]
             finally:
+                if self.config.debug:
+                    self.logger.info("Scraping completed. Keeping browser open for debugging.")
+                    await asyncio.sleep(30)  # Keep the browser open for 30 seconds
                 await browser.close()
+            return contents
+    async def scrape_multiple_pages(self, page: Page, base_url: str, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> List[str]:
+        if not url_pattern:
+            url_pattern = self.detect_url_pattern(base_url)
+            if not url_pattern:
+                return ["Error: Unable to detect URL pattern. Please provide a pattern."]
+        page_numbers = self.parse_page_numbers(pages)
+        contents = []
+        for page_num in page_numbers:
+            current_url = self.apply_url_pattern(base_url, url_pattern, page_num)
+            self.logger.info(f"Scraping page {page_num}: {current_url}")
+            content = await self.navigate_and_get_content(page, current_url)
+            if self.config.bypass_cloudflare and "Cloudflare" in content and "ray ID" in content.lower():
+                self.logger.info("Cloudflare detected, attempting to bypass...")
+                content = await self.bypass_cloudflare(page, current_url)
+            contents.append(content)
+            # Add a delay between page navigations
+            await asyncio.sleep(random.uniform(1, 3))
+        return contents
     async def launch_browser(self, playwright, proxy: Optional[str] = None) -> Browser:
         return await playwright.chromium.launch(
+            headless=self.config.headless,
             args=['--no-sandbox', '--disable-setuid-sandbox', '--disable-infobars',
                   '--window-position=0,0', '--ignore-certifcate-errors',
+                  '--ignore-certifcate-errors-spki-list'],
             proxy={'server': proxy} if proxy else None
         )
             ''')
     async def navigate_and_get_content(self, page: Page, url: str) -> str:
+        await page.goto(url, wait_until=self.config.wait_for, timeout=self.config.timeout)
         if self.config.simulate_human:
             await self.simulate_human_behavior(page)
         else:
+            await asyncio.sleep(2)  # Wait for 2 seconds to ensure content is loaded
         return await page.content()
     async def bypass_cloudflare(self, page: Page, url: str) -> str:
         max_retries = 3
         for _ in range(max_retries):
+            await page.reload(wait_until=self.config.wait_for, timeout=self.config.timeout)
             if self.config.simulate_human:
                 await self.simulate_human_behavior(page)
             else:
             await random_element.hover()
             await asyncio.sleep(random.uniform(0.3, 0.7))
+    def detect_url_pattern(self, url: str) -> Optional[str]:
+        parsed_url = urlparse(url)
+        query = parse_qs(parsed_url.query)
+        # Check query parameters for any numeric values
+        for param, value in query.items():
+            if value and value[0].isdigit():
+                return f"{param}={{{param}}}"
+        # Check path for numeric segments
+        path_parts = parsed_url.path.split('/')
+        for i, part in enumerate(path_parts):
+            if part.isdigit():
+                path_parts[i] = "{page}"
+                return '/'.join(path_parts)
+        # If no pattern is detected, return None
+        return None
+    def apply_url_pattern(self, base_url: str, pattern: str, page_num: int) -> str:
+        parsed_url = urlparse(base_url)
+        if '=' in pattern:  # Query parameter pattern
+            query = parse_qs(parsed_url.query)
+            param, value = pattern.split('=')
+            query[param] = [value.format(**{param: page_num})]
+            return urlunparse(parsed_url._replace(query=urlencode(query, doseq=True)))
+        else:  # Path pattern
+            return urlunparse(parsed_url._replace(path=pattern.format(page=page_num)))
+    def parse_page_numbers(self, pages: Optional[str]) -> List[int]:
+        if not pages:
+            return [1]  # Default to first page if not specified
+        page_numbers = []
+        for part in pages.split(','):
+            if '-' in part:
+                start, end = map(int, part.split('-'))
+                page_numbers.extend(range(start, end + 1))
+            else:
+                page_numbers.append(int(part))
+        return sorted(set(page_numbers))  # Remove duplicates and sort
     async def extract(self, content: str) -> Dict[str, Any]:
         return {"raw_content": content}

src/web_extractor.py CHANGED Viewed

@@ -1,5 +1,5 @@
 import asyncio
-from typing import Dict, Any, Optional, List, Tuple
 import json
 import pandas as pd
 from io import StringIO, BytesIO
@@ -20,9 +20,10 @@ from langchain.text_splitter import RecursiveCharacterTextSplitter
 import tiktoken
 import csv
 from bs4 import BeautifulSoup, Comment
 class WebExtractor:
-    def __init__(self, model_name: str = "gpt-4o-mini", model_kwargs: Dict[str, Any] = None, proxy: Optional[str] = None):
         model_kwargs = model_kwargs or {}
         if isinstance(model_name, str) and model_name.startswith("ollama:"):
             self.model = OllamaModelManager.get_model(model_name[7:])
@@ -31,7 +32,8 @@ class WebExtractor:
         else:
             self.model = Models.get_model(model_name, **model_kwargs)
-        self.playwright_scraper = PlaywrightScraper()
         self.html_scraper = HTMLScraper()
         self.json_scraper = JSONScraper()
         self.proxy_manager = ProxyManager(proxy)
@@ -77,7 +79,7 @@ class WebExtractor:
             Based on the following preprocessed webpage content and the user's request, extract the relevant information.
             Always present the data as a JSON array of objects, regardless of the user's requested format.
             Each object in the array should represent one item or row of data.
-            Use the following format without any unnecessary text, provide only the format and nothing else:
             [
             {{
@@ -107,7 +109,11 @@ class WebExtractor:
     async def process_query(self, user_input: str) -> str:
         if user_input.lower().startswith("http"):
-            response = await self._fetch_url(user_input)
         elif not self.current_content:
             response = "Please provide a URL first before asking for information."
         else:
@@ -117,10 +123,12 @@ class WebExtractor:
         self.conversation_history.append(f"AI: {response}")
         return response
-    async def _fetch_url(self, url: str) -> str:
         self.current_url = url
         proxy = await self.proxy_manager.get_proxy()
-        self.current_content = await self.playwright_scraper.fetch_content(self.current_url, proxy)
         self.preprocessed_content = self._preprocess_content(self.current_content)
         new_hash = self._hash_content(self.preprocessed_content)
@@ -128,7 +136,9 @@ class WebExtractor:
             self.content_hash = new_hash
             self.query_cache.clear()
-        return f"I've fetched and preprocessed the content from {self.current_url}. What would you like to know about it?"
     def _preprocess_content(self, content: str) -> str:
         soup = BeautifulSoup(content, 'html.parser')
@@ -185,19 +195,29 @@ class WebExtractor:
         self.query_cache[cache_key] = formatted_result
         return formatted_result
-    def _format_result(self, extracted_data: str, query: str) -> str:
-        if 'json' in query.lower():
-            return self._format_as_json(extracted_data)
-        elif 'csv' in query.lower():
-            csv_string, df = self._format_as_csv(extracted_data)
-            return f"```csv\n{csv_string}\n```", df
-        elif 'excel' in query.lower():
-            return self._format_as_excel(extracted_data)
-        elif 'sql' in query.lower():
-            return self._format_as_sql(extracted_data)
-        elif 'html' in query.lower():
-            return self._format_as_html(extracted_data)
-        else:
             return self._format_as_text(extracted_data)
     def optimized_text_splitter(self, text: str) -> List[str]:

 import asyncio
+from typing import Dict, Any, Optional, List, Tuple, Union
 import json
 import pandas as pd
 from io import StringIO, BytesIO
 import tiktoken
 import csv
 from bs4 import BeautifulSoup, Comment
+from .scrapers.playwright_scraper import PlaywrightScraper, ScraperConfig
 class WebExtractor:
+    def __init__(self, model_name: str = "gpt-4o-mini", model_kwargs: Dict[str, Any] = None, proxy: Optional[str] = None, headless: bool = True, debug: bool = False):
         model_kwargs = model_kwargs or {}
         if isinstance(model_name, str) and model_name.startswith("ollama:"):
             self.model = OllamaModelManager.get_model(model_name[7:])
         else:
             self.model = Models.get_model(model_name, **model_kwargs)
+        scraper_config = ScraperConfig(headless=headless, debug=debug)
+        self.playwright_scraper = PlaywrightScraper(config=scraper_config)
         self.html_scraper = HTMLScraper()
         self.json_scraper = JSONScraper()
         self.proxy_manager = ProxyManager(proxy)
             Based on the following preprocessed webpage content and the user's request, extract the relevant information.
             Always present the data as a JSON array of objects, regardless of the user's requested format.
             Each object in the array should represent one item or row of data.
+            Use the following format without any commentary text, provide only the format and nothing else:
             [
             {{
     async def process_query(self, user_input: str) -> str:
         if user_input.lower().startswith("http"):
+            parts = user_input.split(maxsplit=2)
+            url = parts[0]
+            pages = parts[1] if len(parts) > 1 else None
+            url_pattern = parts[2] if len(parts) > 2 else None
+            response = await self._fetch_url(url, pages, url_pattern)
         elif not self.current_content:
             response = "Please provide a URL first before asking for information."
         else:
         self.conversation_history.append(f"AI: {response}")
         return response
+    async def _fetch_url(self, url: str, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> str:
         self.current_url = url
         proxy = await self.proxy_manager.get_proxy()
+        contents = await self.playwright_scraper.fetch_content(url, proxy, pages, url_pattern)
+        self.current_content = "\n".join(contents)
         self.preprocessed_content = self._preprocess_content(self.current_content)
         new_hash = self._hash_content(self.preprocessed_content)
             self.content_hash = new_hash
             self.query_cache.clear()
+        return f"I've fetched and preprocessed the content from {self.current_url}" + \
+               (f" (pages: {pages})" if pages else "") + \
+               ". What would you like to know about it?"
     def _preprocess_content(self, content: str) -> str:
         soup = BeautifulSoup(content, 'html.parser')
         self.query_cache[cache_key] = formatted_result
         return formatted_result
+    def _format_result(self, extracted_data: str, query: str) -> Union[str, Tuple[str, pd.DataFrame], BytesIO]:
+        try:
+            json_data = json.loads(extracted_data)
+            if 'json' in query.lower():
+                return self._format_as_json(json.dumps(json_data))
+            elif 'csv' in query.lower():
+                csv_string, df = self._format_as_csv(json.dumps(json_data))
+                return f"```csv\n{csv_string}\n```", df
+            elif 'excel' in query.lower():
+                return self._format_as_excel(json.dumps(json_data))
+            elif 'sql' in query.lower():
+                return self._format_as_sql(json.dumps(json_data))
+            elif 'html' in query.lower():
+                return self._format_as_html(json.dumps(json_data))
+            else:
+                if isinstance(json_data, list) and all(isinstance(item, dict) for item in json_data):
+                    csv_string, df = self._format_as_csv(json.dumps(json_data))
+                    return f"```csv\n{csv_string}\n```", df
+                else:
+                    return self._format_as_json(json.dumps(json_data))
+        except json.JSONDecodeError:
             return self._format_as_text(extracted_data)
     def optimized_text_splitter(self, text: str) -> List[str]: