Spaces:

AUXteam
/

Scraper_hub

Paused

App Files Files Community

itsOwen commited on Aug 28, 2024

Commit

dfc44b9

1 Parent(s): 74036ee

stealth mode fix + captcha bypass added

Browse files

Files changed (3) hide show

README.md +6 -6
src/scrapers/playwright_scraper.py +75 -43
src/web_extractor.py +14 -12

README.md CHANGED Viewed

@@ -37,6 +37,7 @@ Whether you're a corpo data analyst, a street-smart netrunner, or just someone l
 - 🛡️ **Ethical Scraping**: Respects robots.txt and site policies. We may be in 2077, but we still have standards.
 - 📄 **Caching**: We implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
 - ✅ **Upload to Google Sheets**: Now you can easily upload your extract csv data to google sheets with one click.
 - 🌐 **Proxy Mode (Coming Soon)**: Built-in proxy support to keep you ghosting through the net.
 - 🛡️ **Navigate through the Pages (BETA)**: Navigate through the webpage and scrap the data from different pages.
@@ -78,12 +79,9 @@ Please follow the Docker Container Guide given below, As I won't be able to main
    Linux/Mac:
    ```bash
-   export OPENAI_API_KEY='your-api-key-here'
-   ```
-   For Windows:
-   ```bash
-   set OPENAI_API_KEY=your-api-key-here
    ```
 6. If you want to use the Ollama:
 Note: I only recommend using OpenAI API as GPT4o-mini is really good at following instructions, If you are using open-source LLMs make sure you have a good system as the speed of the data generation/presentation depends on how good your system is in running the LLM and also you may have to fine-tune the prompt and add some additional filters yourself.
@@ -118,7 +116,7 @@ If you prefer to use Docker, follow these steps to set up and run CyberScraper 2
      ```
    - With OpenAI API key:
      ```bash
-     docker run -p 8501:8501 -e OPENAI_API_KEY='your-actual-api-key' cyberscraper-2077
      ```
 5. Open your browser and navigate to `http://localhost:8501`.
@@ -272,6 +270,8 @@ bypass_cloudflare: bool = True:
 Adjust these settings based on your target website and environment for optimal results.
 ## 🤝 Contributing
 We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077!

 - 🛡️ **Ethical Scraping**: Respects robots.txt and site policies. We may be in 2077, but we still have standards.
 - 📄 **Caching**: We implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
 - ✅ **Upload to Google Sheets**: Now you can easily upload your extract csv data to google sheets with one click.
+- 🤖 **Bypass Captcha**: Bypass captcha by using the -captcha at the end of the url.
 - 🌐 **Proxy Mode (Coming Soon)**: Built-in proxy support to keep you ghosting through the net.
 - 🛡️ **Navigate through the Pages (BETA)**: Navigate through the webpage and scrap the data from different pages.
    Linux/Mac:
    ```bash
+   export OPENAI_API_KEY="your-api-key-here"
    ```
 6. If you want to use the Ollama:
 Note: I only recommend using OpenAI API as GPT4o-mini is really good at following instructions, If you are using open-source LLMs make sure you have a good system as the speed of the data generation/presentation depends on how good your system is in running the LLM and also you may have to fine-tune the prompt and add some additional filters yourself.
      ```
    - With OpenAI API key:
      ```bash
+     docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" cyberscraper-2077
      ```
 5. Open your browser and navigate to `http://localhost:8501`.
 Adjust these settings based on your target website and environment for optimal results.
+You can also bypass the captcha using the ```-captcha``` parameter at the end of the URL, the browser window will popup, complete the capatcha and go back to your terminal window, Press enter and the bot will complete it's task.
 ## 🤝 Contributing
 We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077!

src/scrapers/playwright_scraper.py CHANGED Viewed

@@ -1,5 +1,4 @@
 from playwright.async_api import async_playwright, Browser, BrowserContext, Page
-from playwright_stealth import stealth_async
 from .base_scraper import BaseScraper
 from typing import Dict, Any, Optional, List, Tuple
 import asyncio
@@ -35,17 +34,19 @@ class PlaywrightScraper(BaseScraper):
         self.logger.setLevel(logging.DEBUG if config.debug else logging.INFO)
         self.config = config
-    async def fetch_content(self, url: str, proxy: Optional[str] = None, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> List[str]:
         async with async_playwright() as p:
-            browser = await self.launch_browser(p, proxy)
             context = await self.create_context(browser, proxy)
             page = await context.new_page()
             if self.config.use_stealth:
-                await stealth_async(page)
             await self.set_browser_features(page)
             try:
                 contents = await self.scrape_multiple_pages(page, url, pages, url_pattern)
             except Exception as e:
                 self.logger.error(f"Error during scraping: {str(e)}")
@@ -58,6 +59,72 @@ class PlaywrightScraper(BaseScraper):
             return contents
     async def scrape_multiple_pages(self, page: Page, base_url: str, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> List[str]:
         contents = []
@@ -65,6 +132,7 @@ class PlaywrightScraper(BaseScraper):
             url_pattern = self.detect_url_pattern(base_url)
         if not url_pattern and not pages:
             self.logger.info(f"Scraping single page: {base_url}")
             content = await self.navigate_and_get_content(page, base_url)
             if self.config.bypass_cloudflare and "Cloudflare" in content and "ray ID" in content.lower():
@@ -72,6 +140,7 @@ class PlaywrightScraper(BaseScraper):
                 content = await self.bypass_cloudflare(page, base_url)
             contents.append(content)
         else:
             page_numbers = self.parse_page_numbers(pages) if pages else [1]
             for page_num in page_numbers:
                 current_url = self.apply_url_pattern(base_url, url_pattern, page_num) if url_pattern else base_url
@@ -88,43 +157,6 @@ class PlaywrightScraper(BaseScraper):
         return contents
-    async def launch_browser(self, playwright, proxy: Optional[str] = None) -> Browser:
-        return await playwright.chromium.launch(
-            headless=self.config.headless,
-            args=['--no-sandbox', '--disable-setuid-sandbox', '--disable-infobars',
-                  '--window-position=0,0', '--ignore-certifcate-errors',
-                  '--ignore-certifcate-errors-spki-list'],
-            proxy={'server': proxy} if proxy else None
-        )
-    async def create_context(self, browser: Browser, proxy: Optional[str] = None) -> BrowserContext:
-        return await browser.new_context(
-            viewport={'width': 1920, 'height': 1080},
-            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
-            proxy={'server': proxy} if proxy else None,
-            java_script_enabled=True,
-            ignore_https_errors=True
-        )
-    async def set_browser_features(self, page: Page):
-        if self.config.use_custom_headers:
-            await page.set_extra_http_headers({
-                'Accept-Language': 'en-US,en;q=0.9',
-                'Accept-Encoding': 'gzip, deflate, br',
-                'Referer': 'https://www.google.com/',
-                'Sec-Fetch-Dest': 'document',
-                'Sec-Fetch-Mode': 'navigate',
-                'Sec-Fetch-Site': 'none',
-                'Sec-Fetch-User': '?1',
-                'Upgrade-Insecure-Requests': '1'
-            })
-        if self.config.hide_webdriver:
-            await page.evaluate('''
-                Object.defineProperty(navigator, 'webdriver', {
-                    get: () => undefined
-                })
-            ''')
     async def navigate_and_get_content(self, page: Page, url: str) -> str:
         await page.goto(url, wait_until=self.config.wait_for, timeout=self.config.timeout)
         if self.config.simulate_human:
@@ -189,12 +221,12 @@ class PlaywrightScraper(BaseScraper):
     def apply_url_pattern(self, base_url: str, pattern: str, page_num: int) -> str:
         parsed_url = urlparse(base_url)
-        if '=' in pattern:
             query = parse_qs(parsed_url.query)
             param, value = pattern.split('=')
             query[param] = [value.format(**{param: page_num})]
             return urlunparse(parsed_url._replace(query=urlencode(query, doseq=True)))
-        elif '{page}' in pattern:
             return urlunparse(parsed_url._replace(path=pattern.format(page=page_num)))
         else:
             return base_url

 from playwright.async_api import async_playwright, Browser, BrowserContext, Page
 from .base_scraper import BaseScraper
 from typing import Dict, Any, Optional, List, Tuple
 import asyncio
         self.logger.setLevel(logging.DEBUG if config.debug else logging.INFO)
         self.config = config
+    async def fetch_content(self, url: str, proxy: Optional[str] = None, pages: Optional[str] = None, url_pattern: Optional[str] = None, handle_captcha: bool = False) -> List[str]:
         async with async_playwright() as p:
+            browser = await self.launch_browser(p, proxy, handle_captcha)
             context = await self.create_context(browser, proxy)
             page = await context.new_page()
             if self.config.use_stealth:
+                await self.apply_stealth_settings(page)
             await self.set_browser_features(page)
             try:
+                if handle_captcha:
+                    await self.handle_captcha(page, url)
                 contents = await self.scrape_multiple_pages(page, url, pages, url_pattern)
             except Exception as e:
                 self.logger.error(f"Error during scraping: {str(e)}")
             return contents
+    async def launch_browser(self, playwright, proxy: Optional[str] = None, handle_captcha: bool = False) -> Browser:
+        return await playwright.chromium.launch(
+            headless=self.config.headless and not handle_captcha,
+            args=['--no-sandbox', '--disable-setuid-sandbox', '--disable-infobars',
+                  '--window-position=0,0', '--ignore-certifcate-errors',
+                  '--ignore-certifcate-errors-spki-list'],
+            proxy={'server': proxy} if proxy else None
+        )
+    async def create_context(self, browser: Browser, proxy: Optional[str] = None) -> BrowserContext:
+        return await browser.new_context(
+            viewport={'width': 1920, 'height': 1080},
+            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            proxy={'server': proxy} if proxy else None,
+            java_script_enabled=True,
+            ignore_https_errors=True
+        )
+    async def apply_stealth_settings(self, page: Page):
+        await page.evaluate('''
+            () => {
+                Object.defineProperty(navigator, 'webdriver', {
+                    get: () => undefined
+                });
+                Object.defineProperty(navigator, 'languages', {
+                    get: () => ['en-US', 'en']
+                });
+                Object.defineProperty(navigator, 'plugins', {
+                    get: () => [1, 2, 3, 4, 5]
+                });
+                const originalQuery = window.navigator.permissions.query;
+                window.navigator.permissions.query = (parameters) => (
+                    parameters.name === 'notifications' ?
+                        Promise.resolve({ state: Notification.permission }) :
+                        originalQuery(parameters)
+                );
+            }
+        ''')
+    async def set_browser_features(self, page: Page):
+        if self.config.use_custom_headers:
+            await page.set_extra_http_headers({
+                'Accept-Language': 'en-US,en;q=0.9',
+                'Accept-Encoding': 'gzip, deflate, br',
+                'Referer': 'https://www.google.com/',
+                'Sec-Fetch-Dest': 'document',
+                'Sec-Fetch-Mode': 'navigate',
+                'Sec-Fetch-Site': 'none',
+                'Sec-Fetch-User': '?1',
+                'Upgrade-Insecure-Requests': '1'
+            })
+    async def handle_captcha(self, page: Page, url: str):
+        self.logger.info("Waiting for user to solve CAPTCHA...")
+        await page.goto(url, wait_until=self.config.wait_for, timeout=self.config.timeout)
+        print("Please solve the CAPTCHA in the browser window.")
+        print("Once solved, press Enter in this console to continue...")
+        input()
+        await page.wait_for_load_state('networkidle')
+        self.logger.info("CAPTCHA handling completed.")
     async def scrape_multiple_pages(self, page: Page, base_url: str, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> List[str]:
         contents = []
             url_pattern = self.detect_url_pattern(base_url)
         if not url_pattern and not pages:
+            # Single page scraping
             self.logger.info(f"Scraping single page: {base_url}")
             content = await self.navigate_and_get_content(page, base_url)
             if self.config.bypass_cloudflare and "Cloudflare" in content and "ray ID" in content.lower():
                 content = await self.bypass_cloudflare(page, base_url)
             contents.append(content)
         else:
+            # Multiple page scraping
             page_numbers = self.parse_page_numbers(pages) if pages else [1]
             for page_num in page_numbers:
                 current_url = self.apply_url_pattern(base_url, url_pattern, page_num) if url_pattern else base_url
         return contents
     async def navigate_and_get_content(self, page: Page, url: str) -> str:
         await page.goto(url, wait_until=self.config.wait_for, timeout=self.config.timeout)
         if self.config.simulate_human:
     def apply_url_pattern(self, base_url: str, pattern: str, page_num: int) -> str:
         parsed_url = urlparse(base_url)
+        if '=' in pattern:
             query = parse_qs(parsed_url.query)
             param, value = pattern.split('=')
             query[param] = [value.format(**{param: page_num})]
             return urlunparse(parsed_url._replace(query=urlencode(query, doseq=True)))
+        elif '{page}' in pattern:
             return urlunparse(parsed_url._replace(path=pattern.format(page=page_num)))
         else:
             return base_url

src/web_extractor.py CHANGED Viewed

@@ -109,36 +109,38 @@ class WebExtractor:
     async def process_query(self, user_input: str) -> str:
         if user_input.lower().startswith("http"):
-            parts = user_input.split(maxsplit=2)
             url = parts[0]
-            pages = parts[1] if len(parts) > 1 else None
-            url_pattern = parts[2] if len(parts) > 2 else None
-            response = await self._fetch_url(url, pages, url_pattern)
         elif not self.current_content:
             response = "Please provide a URL first before asking for information."
         else:
             response = await self._extract_info(user_input)
         self.conversation_history.append(f"Human: {user_input}")
         self.conversation_history.append(f"AI: {response}")
         return response
-    async def _fetch_url(self, url: str, pages: Optional[str] = None, url_pattern: Optional[str] = None) -> str:
         self.current_url = url
         proxy = await self.proxy_manager.get_proxy()
-        contents = await self.playwright_scraper.fetch_content(url, proxy, pages, url_pattern)
         self.current_content = "\n".join(contents)
         self.preprocessed_content = self._preprocess_content(self.current_content)
         new_hash = self._hash_content(self.preprocessed_content)
         if self.content_hash != new_hash:
             self.content_hash = new_hash
             self.query_cache.clear()
         return f"I've fetched and preprocessed the content from {self.current_url}" + \
-               (f" (pages: {pages})" if pages else "") + \
-               ". What would you like to know about it?"
     def _preprocess_content(self, content: str) -> str:
         soup = BeautifulSoup(content, 'html.parser')

     async def process_query(self, user_input: str) -> str:
         if user_input.lower().startswith("http"):
+            parts = user_input.split(maxsplit=3)
             url = parts[0]
+            pages = parts[1] if len(parts) > 1 and not parts[1].startswith('-') else None
+            url_pattern = parts[2] if len(parts) > 2 and not parts[2].startswith('-') else None
+            handle_captcha = '-captcha' in user_input.lower()
+            response = await self._fetch_url(url, pages, url_pattern, handle_captcha)
         elif not self.current_content:
             response = "Please provide a URL first before asking for information."
         else:
             response = await self._extract_info(user_input)
         self.conversation_history.append(f"Human: {user_input}")
         self.conversation_history.append(f"AI: {response}")
         return response
+    async def _fetch_url(self, url: str, pages: Optional[str] = None, url_pattern: Optional[str] = None, handle_captcha: bool = False) -> str:
         self.current_url = url
         proxy = await self.proxy_manager.get_proxy()
+        contents = await self.playwright_scraper.fetch_content(url, proxy, pages, url_pattern, handle_captcha)
         self.current_content = "\n".join(contents)
         self.preprocessed_content = self._preprocess_content(self.current_content)
         new_hash = self._hash_content(self.preprocessed_content)
         if self.content_hash != new_hash:
             self.content_hash = new_hash
             self.query_cache.clear()
         return f"I've fetched and preprocessed the content from {self.current_url}" + \
+            (f" (pages: {pages})" if pages else "") + \
+            ". What would you like to know about it?"
     def _preprocess_content(self, content: str) -> str:
         soup = BeautifulSoup(content, 'html.parser')