# CyberScraper-2077 Project Scan & Integration Analysis ## 1. Project Overview CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom `PlaywrightScraper` (based on `patchright`) for stealthy scraping. ## 2. Architecture Analysis ### Core Components * **Frontend (`main.py`, `app/`):** Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys. * **Orchestrator (`src/web_extractor.py`):** The central logic (`WebExtractor` class). It coordinates: * **Fetching:** Calls `PlaywrightScraper` or `TorScraper` to get raw HTML. * **Preprocessing:** Cleans HTML using BeautifulSoup. * **Extraction:** Uses LangChain and LLMs to parse content and answer user queries. * **Agentic Loop:** Has an experimental agentic loop for iterative investigation using `browser_tools`. * **Scraping Engine (`src/scrapers/`):** * **`PlaywrightScraper`:** A custom implementation using `patchright` (undetected Playwright). Features: * **Stealth:** Uses persistent contexts and `patchright`'s built-in stealth. * **Concurrency:** `scrape_multiple_pages` handles basic pagination. * **CAPTCHA:** Manual `handle_captcha` method. * **`TorScraper`:** Uses `requests` with SOCKS proxy for .onion sites. * **Utilities:** * `src/models.py` / `src/ollama_models.py`: Model management. * `src/utils/`: Error handling, Google Sheets integration. ### Data Flow 1. User inputs URL/Query in Streamlit. 2. `StreamlitWebScraperChat` initializes `WebExtractor`. 3. `WebExtractor` determines if it's a new URL or chat. 4. If new URL: * `PlaywrightScraper.fetch_content` is called. * Content is preprocessed and cached. * LLM is called with context to answer query or extract data. ## 3. Scrapling Integration Analysis The `Scrapling` folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing: * **Adaptive Parsing:** Robustness against layout changes. * **Advanced Fetching:** `StealthyFetcher` and `DynamicFetcher` (wrapping Playwright) for better evasion. * **Spider Framework:** A Scrapy-like `Spider` class for efficient, concurrent, and persistent crawling of multiple pages. ### Key Integration Points * **Replacement/Augmentation of `PlaywrightScraper`:** * `Scrapling`'s `StealthyFetcher` can replace the manual `patchright` setup in `PlaywrightScraper`. * `Scrapling`'s `Spider` can replace the manual `scrape_multiple_pages` loop, offering better concurrency and state management. * **Dependencies:** * `Scrapling` requires `playwright` (implied by imports in `Scrapling/scrapling/engines/_browsers/_controllers.py`). * Current repo uses `patchright`. Integration must ensure `Scrapling` works with the installed environment. `Scrapling`'s `pyproject.toml` lists both `playwright` and `patchright` in optional dependencies, but the code imports `playwright`. I will need to ensure `playwright` is installed or alias it if `patchright` is to be used exclusively (though `patchright` module name is different). * **Async Compatibility:** * `Scrapling` supports async fetchers and spiders, compatible with `WebExtractor`'s async nature. ## 4. Proposed Integration Plan 1. **Dependency Management:** * Install `Scrapling` in editable mode (`pip install -e Scrapling[fetchers]`). * Ensure compatible versions of `playwright`/`patchright`. 2. **Adapter Implementation (`src/scrapers/scrapling_adapter.py`):** * Create a `ScraplingAdapter` class implementing the interface expected by `WebExtractor` (matching `PlaywrightScraper`'s `fetch_content` signature). * Implement `fetch_content` to use: * `StealthyFetcher` (or `DynamicFetcher`) for single URLs. * A dynamic `Spider` subclass for multi-page/crawl tasks. 3. **WebExtractor Update:** * Modify `src/web_extractor.py` to instantiate `ScraplingAdapter` instead of `PlaywrightScraper`. * Map existing config (headless, proxy) to `Scrapling` configuration. 4. **Verification:** * Test single page extraction. * Test multi-page crawling. ## 5. Potential Issues & Risks * **Dependency Conflict:** `Scrapling` might pull a different version of `playwright` or `patchright`. * **Browser Management:** `Scrapling` manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though `WebExtractor` creates a new scraper instance).