Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on 28 days ago

Commit

b1f6740

2 Parent(s): 03d3c9b 72f999a

v0.4.2 (#183)

Browse files

Files changed (38) hide show

.github/workflows/tests.yml +1 -1
README.md +0 -1
agent-skill/Scrapling-Skill.zip +0 -0
agent-skill/Scrapling-Skill/SKILL.md +2 -2
agent-skill/Scrapling-Skill/examples/README.md +1 -1
agent-skill/Scrapling-Skill/references/fetching/dynamic.md +3 -3
agent-skill/Scrapling-Skill/references/fetching/static.md +1 -1
agent-skill/Scrapling-Skill/references/fetching/stealthy.md +4 -5
agent-skill/Scrapling-Skill/references/mcp-server.md +2 -2
docs/README_AR.md +0 -1
docs/README_CN.md +0 -1
docs/README_DE.md +0 -1
docs/README_ES.md +0 -1
docs/README_FR.md +0 -1
docs/README_JP.md +0 -1
docs/README_RU.md +0 -1
docs/fetching/dynamic.md +3 -3
docs/fetching/static.md +1 -1
docs/fetching/stealthy.md +4 -5
docs/overview.md +2 -3
pyproject.toml +5 -4
scrapling/__init__.py +1 -1
scrapling/core/_types.py +1 -2
scrapling/core/ai.py +10 -10
scrapling/engines/_browsers/_base.py +2 -2
scrapling/engines/_browsers/_controllers.py +10 -15
scrapling/engines/_browsers/_stealth.py +10 -15
scrapling/engines/static.py +4 -4
scrapling/engines/toolbelt/fingerprints.py +3 -32
scrapling/fetchers/chrome.py +2 -2
scrapling/fetchers/stealth_chrome.py +4 -4
scrapling/parser.py +24 -10
server.json +2 -2
setup.cfg +1 -1
tests/fetchers/test_utils.py +0 -17
tests/parser/test_parser_advanced.py +27 -0
tests/requirements.txt +1 -1
tox.ini +2 -2

.github/workflows/tests.yml CHANGED Viewed

@@ -73,7 +73,7 @@ jobs:
     - name: Install all browsers dependencies
       run: |
         python3 -m pip install --upgrade pip
-        python3 -m pip install playwright==1.56.0 patchright==1.56.0
     - name: Get Playwright version
       id: playwright-version

     - name: Install all browsers dependencies
       run: |
         python3 -m pip install --upgrade pip
+        python3 -m pip install playwright==1.58.0 patchright==1.58.2
     - name: Get Playwright version
       id: playwright-version

README.md CHANGED Viewed

@@ -144,7 +144,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

agent-skill/Scrapling-Skill.zip CHANGED Viewed

Binary files a/agent-skill/Scrapling-Skill.zip and b/agent-skill/Scrapling-Skill.zip differ

agent-skill/Scrapling-Skill/SKILL.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 name: scrapling-official
 description: Scrape web pages using Scrapling with anti-bot bypass (like Cloudflare Turnstile), stealth headless browsing, spiders framework, adaptive scraping, and JavaScript rendering. Use when asked to scrape, crawl, or extract data from websites; web_fetch fails; the site has anti-bot protections; write Python code to scrape/crawl; or write spiders.
-version: 0.4.1
 license: Complete terms in LICENSE.txt
 ---
@@ -22,7 +22,7 @@ Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers fo
 Create a virtual Python environment through any way available, like `venv`, then inside the environment do:
-`pip install "scrapling[all]>=0.4.1"`
 Then do this to download all the browsers' dependencies:

 ---
 name: scrapling-official
 description: Scrape web pages using Scrapling with anti-bot bypass (like Cloudflare Turnstile), stealth headless browsing, spiders framework, adaptive scraping, and JavaScript rendering. Use when asked to scrape, crawl, or extract data from websites; web_fetch fails; the site has anti-bot protections; write Python code to scrape/crawl; or write spiders.
+version: 0.4.2
 license: Complete terms in LICENSE.txt
 ---
 Create a virtual Python environment through any way available, like `venv`, then inside the environment do:
+`pip install "scrapling[all]>=0.4.2"`
 Then do this to download all the browsers' dependencies:

agent-skill/Scrapling-Skill/examples/README.md CHANGED Viewed

@@ -9,7 +9,7 @@ All examples collect **all 100 quotes across 10 pages**.
 Make sure Scrapling is installed:
 ```bash
-pip install "scrapling[all]>=0.4.1"
 scrapling install --force
 ```

 Make sure Scrapling is installed:
 ```bash
+pip install "scrapling[all]>=0.4.2"
 scrapling install --force
 ```

agent-skill/Scrapling-Skill/references/fetching/dynamic.md CHANGED Viewed

@@ -64,8 +64,8 @@ All arguments for `DynamicFetcher` and its session classes:
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
-|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                               |    ✔️    |
-|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                   |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
@@ -84,7 +84,7 @@ In session classes, all these arguments can be set globally for the session. Sti
 **Notes:**
 1. The `disable_resources` option made requests ~25% faster in tests for some websites and can help save proxy usage, but be careful with it, as it can cause some websites to never finish loading.
-2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.

 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
+|    google_search    | Enabled by default, Scrapling will set a Google referer header.                                                                                               |    ✔️    |
+|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._                                                                   |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
 **Notes:**
 1. The `disable_resources` option made requests ~25% faster in tests for some websites and can help save proxy usage, but be careful with it, as it can cause some websites to never finish loading.
+2. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.

agent-skill/Scrapling-Skill/references/fetching/static.md CHANGED Viewed

@@ -14,7 +14,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
 All methods for making requests here share some arguments, so let's discuss them first.
 - **url**: The targeted URL
-- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain.
 - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
 - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
 - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.

 All methods for making requests here share some arguments, so let's discuss them first.
 - **url**: The targeted URL
+- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
 - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
 - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
 - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.

agent-skill/Scrapling-Skill/references/fetching/stealthy.md CHANGED Viewed

@@ -21,8 +21,7 @@ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynami
 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
 4. It generates canvas noise to prevent fingerprinting through canvas.
 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
-6. It makes requests look as if they came from Google's search page of the requested website.
-7. and other anti-protection options...
 ## Full list of arguments
 Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
@@ -43,8 +42,8 @@ Scrapling provides many options with this fetcher and its session classes. Befor
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
-|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                               |    ✔️    |
-|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                   |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
@@ -69,7 +68,7 @@ In session classes, all these arguments can be set globally for the session. Sti
 1. It's basically the same arguments as [DynamicFetcher](dynamic.md) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
 2. The `disable_resources` option made requests ~25% faster in tests for some websites and can help save proxy usage, but be careful with it, as it can cause some websites to never finish loading.
-3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
 ## Examples

 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
 4. It generates canvas noise to prevent fingerprinting through canvas.
 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
+6. and other anti-protection options...
 ## Full list of arguments
 Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
+|    google_search    | Enabled by default, Scrapling will set a Google referer header.                                                                                               |    ✔️    |
+|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._                                                                   |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
 1. It's basically the same arguments as [DynamicFetcher](dynamic.md) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
 2. The `disable_resources` option made requests ~25% faster in tests for some websites and can help save proxy usage, but be careful with it, as it can cause some websites to never finish loading.
+3. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
 ## Examples

agent-skill/Scrapling-Skill/references/mcp-server.md CHANGED Viewed

@@ -25,7 +25,7 @@ Fast HTTP GET with browser fingerprint impersonation (TLS, headers). Suitable fo
 | `timeout`           | number                             | 30           | Seconds before timeout                                             |
 | `retries`           | int                                | 3            | Retry attempts on failure                                          |
 | `retry_delay`       | int                                | 1            | Seconds between retries                                            |
-| `stealthy_headers`  | bool                               | true         | Generate realistic browser headers and Google-search referer       |
 | `http3`             | bool                               | false        | Use HTTP/3 (may conflict with `impersonate`)                       |
 | `follow_redirects`  | bool                               | true         | Follow HTTP redirects                                              |
 | `max_redirects`     | int                                | 30           | Max redirects (-1 for unlimited)                                   |
@@ -58,7 +58,7 @@ Opens a Chromium browser via Playwright to render JavaScript. Suitable for dynam
 | `wait_selector_state` | str                 | `"attached"` | State for wait_selector: `"attached"` / `"visible"` / `"hidden"` / `"detached"` |
 | `network_idle`        | bool                | false        | Wait until no network activity for 500ms                                        |
 | `disable_resources`   | bool                | false        | Block fonts, images, media, stylesheets, etc. for speed                         |
-| `google_search`       | bool                | true         | Set referer as if from Google search                                            |
 | `real_chrome`         | bool                | false        | Use locally installed Chrome instead of bundled Chromium                        |
 | `cdp_url`             | str or null         | null         | Connect to existing browser via CDP URL                                         |
 | `extra_headers`       | dict or null        | null         | Additional request headers                                                      |

 | `timeout`           | number                             | 30           | Seconds before timeout                                             |
 | `retries`           | int                                | 3            | Retry attempts on failure                                          |
 | `retry_delay`       | int                                | 1            | Seconds between retries                                            |
+| `stealthy_headers`  | bool                               | true         | Generate realistic browser headers and Google referer       |
 | `http3`             | bool                               | false        | Use HTTP/3 (may conflict with `impersonate`)                       |
 | `follow_redirects`  | bool                               | true         | Follow HTTP redirects                                              |
 | `max_redirects`     | int                                | 30           | Max redirects (-1 for unlimited)                                   |
 | `wait_selector_state` | str                 | `"attached"` | State for wait_selector: `"attached"` / `"visible"` / `"hidden"` / `"detached"` |
 | `network_idle`        | bool                | false        | Wait until no network activity for 500ms                                        |
 | `disable_resources`   | bool                | false        | Block fonts, images, media, stylesheets, etc. for speed                         |
+| `google_search`       | bool                | true         | Set a Google referer header                                            |
 | `real_chrome`         | bool                | false        | Use locally installed Chrome instead of bundled Chromium                        |
 | `cdp_url`             | str or null         | null         | Connect to existing browser via CDP URL                                         |
 | `extra_headers`       | dict or null        | null         | Additional request headers                                                      |

docs/README_AR.md CHANGED Viewed

@@ -140,7 +140,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/README_CN.md CHANGED Viewed

@@ -140,7 +140,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/README_DE.md CHANGED Viewed

@@ -140,7 +140,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/README_ES.md CHANGED Viewed

@@ -140,7 +140,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/README_FR.md CHANGED Viewed

@@ -140,7 +140,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/README_JP.md CHANGED Viewed

@@ -140,7 +140,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/README_RU.md CHANGED Viewed

@@ -143,7 +143,6 @@ MySpider().start()
 <!-- sponsors -->
-<a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

 <!-- sponsors -->
 <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
 <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
 <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>

docs/fetching/dynamic.md CHANGED Viewed

@@ -76,8 +76,8 @@ Scrapling provides many options with this fetcher and its session classes. To ma
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
-|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                               |    ✔️    |
-|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                   |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
@@ -97,7 +97,7 @@ In session classes, all these arguments can be set globally for the session. Sti
 !!! note "Notes:"
     1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
-    2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
     3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
     4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.

 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
+|    google_search    | Enabled by default, Scrapling will set a Google referer header.                                                                                                                                                                      |    ✔️    |
+|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._                                                                                |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
 !!! note "Notes:"
     1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
+    2. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
     3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
     4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.

docs/fetching/static.md CHANGED Viewed

@@ -20,7 +20,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
 All methods for making requests here share some arguments, so let's discuss them first.
 - **url**: The targeted URL
-- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain.
 - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
 - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
 - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.

 All methods for making requests here share some arguments, so let's discuss them first.
 - **url**: The targeted URL
+- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
 - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
 - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
 - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.

docs/fetching/stealthy.md CHANGED Viewed

@@ -32,8 +32,7 @@ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynami
 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
 4. It generates canvas noise to prevent fingerprinting through canvas.
 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
-6. It makes requests look as if they came from Google's search page of the requested website.
-7. and other anti-protection options...
 ## Full list of arguments
 Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
@@ -54,8 +53,8 @@ Scrapling provides many options with this fetcher and its session classes. Befor
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
-|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                               |    ✔️    |
-|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                   |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
@@ -80,7 +79,7 @@ In session classes, all these arguments can be set globally for the session. Sti
     1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
     2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
-    3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
     4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
 ## Examples

 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
 4. It generates canvas noise to prevent fingerprinting through canvas.
 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
+6. and other anti-protection options...
 ## Full list of arguments
 Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
 |    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
 |     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
 | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
+|    google_search    | Enabled by default, Scrapling will set a Google referer header.                                                                                                                                                                      |    ✔️    |
+|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._                                                                                |    ✔️    |
 |        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
 |     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
 |       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
     1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
     2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
+    3. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
     4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
 ## Examples

docs/overview.md CHANGED Viewed

@@ -280,7 +280,7 @@ For Async requests, you will replace the import like below:
 !!! note "Notes:"
-    1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
     2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
     3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
@@ -320,8 +320,7 @@ Some of the things it does:
 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
 4. It generates canvas noise to prevent fingerprinting through canvas.
 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
-6. It makes requests look as if they came from Google's search page of the requested website.
-7. and other anti-protection options...
 ```python
 >>> from scrapling.fetchers import StealthyFetcher

 !!! note "Notes:"
+    1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a Google referer header. It's enabled by default.
     2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
     3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
 4. It generates canvas noise to prevent fingerprinting through canvas.
 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
+6. and other anti-protection options...
 ```python
 >>> from scrapling.fetchers import StealthyFetcher

pyproject.toml CHANGED Viewed

@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "scrapling"
 # Static version instead of a dynamic version so we can get better layer caching while building docker, check the docker file to understand
-version = "0.4.1"
 description = "Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!"
 readme = {file = "README.md", content-type = "text/markdown"}
 license = {file = "LICENSE"}
@@ -64,7 +64,7 @@ dependencies = [
     "lxml>=6.0.2",
     "cssselect>=1.4.0",
     "orjson>=3.11.7",
-    "tld>=0.13.1",
     "w3lib>=2.4.0",
     "typing_extensions",
 ]
@@ -73,9 +73,10 @@ dependencies = [
 fetchers = [
     "click>=8.3.0",
     "curl_cffi>=0.14.0",
-    "playwright==1.56.0",
-    "patchright==1.56.0",
     "browserforge>=1.2.4",
     "msgspec>=0.20.0",
     "anyio>=4.12.1"
 ]

 [project]
 name = "scrapling"
 # Static version instead of a dynamic version so we can get better layer caching while building docker, check the docker file to understand
+version = "0.4.2"
 description = "Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!"
 readme = {file = "README.md", content-type = "text/markdown"}
 license = {file = "LICENSE"}
     "lxml>=6.0.2",
     "cssselect>=1.4.0",
     "orjson>=3.11.7",
+    "tld>=0.13.2",
     "w3lib>=2.4.0",
     "typing_extensions",
 ]
 fetchers = [
     "click>=8.3.0",
     "curl_cffi>=0.14.0",
+    "playwright==1.58.0",
+    "patchright==1.58.2",
     "browserforge>=1.2.4",
+    "apify-fingerprint-datapoints>=0.11.0",
     "msgspec>=0.20.0",
     "anyio>=4.12.1"
 ]

scrapling/__init__.py CHANGED Viewed

@@ -1,5 +1,5 @@
 __author__ = "Karim Shoair (karim.shoair@pm.me)"
-__version__ = "0.4.1"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"
 from typing import Any, TYPE_CHECKING

 __author__ = "Karim Shoair (karim.shoair@pm.me)"
+__version__ = "0.4.2"
 __copyright__ = "Copyright (c) 2024 Karim Shoair"
 from typing import Any, TYPE_CHECKING

scrapling/core/_types.py CHANGED Viewed

@@ -4,7 +4,6 @@ Type definitions for type checking purposes.
 from typing import (
     TYPE_CHECKING,
-    TypedDict,
     TypeAlias,
     cast,
     overload,
@@ -32,7 +31,7 @@ from typing import (
     Coroutine,
     SupportsIndex,
 )
-from typing_extensions import Self, Unpack
 # Proxy can be a string URL or a dict (Playwright format: {"server": "...", "username": "...", "password": "..."})
 ProxyType = Union[str, Dict[str, str]]

 from typing import (
     TYPE_CHECKING,
     TypeAlias,
     cast,
     overload,
     Coroutine,
     SupportsIndex,
 )
+from typing_extensions import Self, Unpack, TypedDict
 # Proxy can be a string URL or a dict (Playwright format: {"server": "...", "username": "...", "password": "..."})
 ProxyType = Union[str, Dict[str, str]]

scrapling/core/ai.py CHANGED Viewed

@@ -105,7 +105,7 @@ class ScraplingMCPServer:
         :param auth: HTTP basic auth in dictionary format with `username` and `password` keys.
         :param verify: Whether to verify HTTPS certificates.
         :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
-        :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of URL's domain.
         """
         normalized_proxy_auth = _normalize_credentials(proxy_auth)
         normalized_auth = _normalize_credentials(auth)
@@ -186,7 +186,7 @@ class ScraplingMCPServer:
         :param auth: HTTP basic auth in dictionary format with `username` and `password` keys.
         :param verify: Whether to verify HTTPS certificates.
         :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
-        :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of URL's domain.
         """
         normalized_proxy_auth = _normalize_credentials(proxy_auth)
         normalized_auth = _normalize_credentials(auth)
@@ -276,8 +276,8 @@ class ScraplingMCPServer:
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         """
         page = await DynamicFetcher.async_fetch(
@@ -358,8 +358,8 @@ class ScraplingMCPServer:
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         """
         async with AsyncDynamicSession(
@@ -454,8 +454,8 @@ class ScraplingMCPServer:
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param block_webrtc: Forces WebRTC to respect proxy settings to prevent local IP address leak.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param additional_args: Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.
         """
@@ -551,8 +551,8 @@ class ScraplingMCPServer:
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param block_webrtc: Forces WebRTC to respect proxy settings to prevent local IP address leak.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param additional_args: Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.
         """

         :param auth: HTTP basic auth in dictionary format with `username` and `password` keys.
         :param verify: Whether to verify HTTPS certificates.
         :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
+        :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
         """
         normalized_proxy_auth = _normalize_credentials(proxy_auth)
         normalized_auth = _normalize_credentials(auth)
         :param auth: HTTP basic auth in dictionary format with `username` and `password` keys.
         :param verify: Whether to verify HTTPS certificates.
         :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
+        :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
         """
         normalized_proxy_auth = _normalize_credentials(proxy_auth)
         normalized_auth = _normalize_credentials(auth)
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         """
         page = await DynamicFetcher.async_fetch(
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         """
         async with AsyncDynamicSession(
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param block_webrtc: Forces WebRTC to respect proxy settings to prevent local IP address leak.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param additional_args: Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.
         """
         :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
         :param block_webrtc: Forces WebRTC to respect proxy settings to prevent local IP address leak.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param additional_args: Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.
         """

scrapling/engines/_browsers/_base.py CHANGED Viewed

@@ -419,7 +419,7 @@ class BaseSessionMixin:
         if not config.cdp_url:
             flags = self._browser_options["args"]
             if config.extra_flags or extra_flags:
-                flags = list(set(flags + (config.extra_flags or extra_flags)))
             self._browser_options.update(
                 {
@@ -480,7 +480,7 @@ class StealthySessionMixin(BaseSessionMixin):
         config = cast(StealthConfig, self._config)
         flags: Tuple[str, ...] = tuple()
         if not config.cdp_url:
-            flags = DEFAULT_ARGS + STEALTH_ARGS
             if config.block_webrtc:
                 flags += (

         if not config.cdp_url:
             flags = self._browser_options["args"]
             if config.extra_flags or extra_flags:
+                flags = list(set(tuple(flags) + tuple(config.extra_flags or extra_flags or ())))
             self._browser_options.update(
                 {
         config = cast(StealthConfig, self._config)
         flags: Tuple[str, ...] = tuple()
         if not config.cdp_url:
+            flags = tuple(DEFAULT_ARGS) + tuple(STEALTH_ARGS)
             if config.block_webrtc:
                 flags += (

scrapling/engines/_browsers/_controllers.py CHANGED Viewed

@@ -14,7 +14,6 @@ from scrapling.core.utils import log
 from scrapling.core._types import Optional, ProxyType, Unpack
 from scrapling.engines.toolbelt.proxy_rotation import is_proxy_error
 from scrapling.engines.toolbelt.convertor import Response, ResponseFactory
-from scrapling.engines.toolbelt.fingerprints import generate_convincing_referer
 from scrapling.engines._browsers._types import PlaywrightSession, PlaywrightFetchParams
 from scrapling.engines._browsers._base import SyncSession, AsyncSession, DynamicSessionMixin
 from scrapling.engines._browsers._validators import validate_fetch as _validate, PlaywrightConfig
@@ -58,8 +57,8 @@ class DynamicSession(SyncSession, DynamicSessionMixin):
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
@@ -103,11 +102,11 @@ class DynamicSession(SyncSession, DynamicSessionMixin):
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
@@ -127,9 +126,7 @@ class DynamicSession(SyncSession, DynamicSessionMixin):
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
-            generate_convincing_referer(url)
-            if (params.google_search and "referer" not in request_headers_keys)
-            else None
         )
         for attempt in range(self._config.retries):
@@ -226,8 +223,8 @@ class AsyncDynamicSession(AsyncSession, DynamicSessionMixin):
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param max_pages: The maximum number of tabs to be opened at the same time. It will be used in rotation through a PagePool.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
@@ -271,11 +268,11 @@ class AsyncDynamicSession(AsyncSession, DynamicSessionMixin):
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
@@ -296,9 +293,7 @@ class AsyncDynamicSession(AsyncSession, DynamicSessionMixin):
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
-            generate_convincing_referer(url)
-            if (params.google_search and "referer" not in request_headers_keys)
-            else None
         )
         for attempt in range(self._config.retries):

 from scrapling.core._types import Optional, ProxyType, Unpack
 from scrapling.engines.toolbelt.proxy_rotation import is_proxy_error
 from scrapling.engines.toolbelt.convertor import Response, ResponseFactory
 from scrapling.engines._browsers._types import PlaywrightSession, PlaywrightFetchParams
 from scrapling.engines._browsers._base import SyncSession, AsyncSession, DynamicSessionMixin
 from scrapling.engines._browsers._validators import validate_fetch as _validate, PlaywrightConfig
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
+            "https://www.google.com/" if (params.google_search and "referer" not in request_headers_keys) else None
         )
         for attempt in range(self._config.retries):
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param max_pages: The maximum number of tabs to be opened at the same time. It will be used in rotation through a PagePool.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
+            "https://www.google.com/" if (params.google_search and "referer" not in request_headers_keys) else None
         )
         for attempt in range(self._config.retries):

scrapling/engines/_browsers/_stealth.py CHANGED Viewed

@@ -16,7 +16,6 @@ from scrapling.core.utils import log
 from scrapling.core._types import Any, Optional, ProxyType, Unpack
 from scrapling.engines.toolbelt.proxy_rotation import is_proxy_error
 from scrapling.engines.toolbelt.convertor import Response, ResponseFactory
-from scrapling.engines.toolbelt.fingerprints import generate_convincing_referer
 from scrapling.engines._browsers._types import StealthSession, StealthFetchParams
 from scrapling.engines._browsers._base import SyncSession, AsyncSession, StealthySessionMixin
 from scrapling.engines._browsers._validators import validate_fetch as _validate, StealthConfig
@@ -66,8 +65,8 @@ class StealthySession(SyncSession, StealthySessionMixin):
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
@@ -189,11 +188,11 @@ class StealthySession(SyncSession, StealthySessionMixin):
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
@@ -214,9 +213,7 @@ class StealthySession(SyncSession, StealthySessionMixin):
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
-            generate_convincing_referer(url)
-            if (params.google_search and "referer" not in request_headers_keys)
-            else None
         )
         for attempt in range(self._config.retries):
@@ -322,8 +319,8 @@ class AsyncStealthySession(AsyncSession, StealthySessionMixin):
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
@@ -444,11 +441,11 @@ class AsyncStealthySession(AsyncSession, StealthySessionMixin):
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
@@ -470,9 +467,7 @@ class AsyncStealthySession(AsyncSession, StealthySessionMixin):
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
-            generate_convincing_referer(url)
-            if (params.google_search and "referer" not in request_headers_keys)
-            else None
         )
         for attempt in range(self._config.retries):

 from scrapling.core._types import Any, Optional, ProxyType, Unpack
 from scrapling.engines.toolbelt.proxy_rotation import is_proxy_error
 from scrapling.engines.toolbelt.convertor import Response, ResponseFactory
 from scrapling.engines._browsers._types import StealthSession, StealthFetchParams
 from scrapling.engines._browsers._base import SyncSession, AsyncSession, StealthySessionMixin
 from scrapling.engines._browsers._validators import validate_fetch as _validate, StealthConfig
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
+            "https://www.google.com/" if (params.google_search and "referer" not in request_headers_keys) else None
         )
         for attempt in range(self._config.retries):
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
         """Opens up the browser and do your request based on your chosen options.
         :param url: The Target url.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
         :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
         :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
         :param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param disable_resources: Drop requests for unnecessary resources for a speed boost.
             Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
         :param blocked_domains: A set of domain names to block requests to. Subdomains are also matched (e.g., ``"example.com"`` blocks ``"sub.example.com"`` too).
         request_headers_keys = {h.lower() for h in params.extra_headers.keys()} if params.extra_headers else set()
         referer = (
+            "https://www.google.com/" if (params.google_search and "referer" not in request_headers_keys) else None
         )
         for attempt in range(self._config.retries):

scrapling/engines/static.py CHANGED Viewed

@@ -26,7 +26,7 @@ from .toolbelt.custom import Response
 from .toolbelt.convertor import ResponseFactory
 from .toolbelt.proxy_rotation import ProxyRotator, is_proxy_error
 from ._browsers._types import RequestsSession, GetRequestParams, DataRequestParams, ImpersonateType
-from .toolbelt.fingerprints import generate_convincing_referer, generate_headers, __default_useragent__
 _NO_SESSION: Any = object()
@@ -166,14 +166,14 @@ class _ConfigurationLogic(ABC):
         """
         1. Adds a useragent to the headers if it doesn't have one
         2. Generates real headers and append them to current headers
-        3. Generates a referer header that looks like as if this request came from a Google's search of the current URL's domain.
         """
         # Merge session headers with request headers, request takes precedence (if it was set)
         final_headers = {**self._default_headers, **(headers if headers else {})}
         headers_keys = {k.lower() for k in final_headers}
         if stealth:
             if "referer" not in headers_keys:
-                final_headers["referer"] = generate_convincing_referer(url)
             if not impersonate_enabled:  # Curl will generate the suitable headers
                 extra_headers = generate_headers(browser_mode=False)
@@ -672,7 +672,7 @@ class FetcherSession:
         """
         :param impersonate: Browser version to impersonate. Can be a single browser string or a list of browser strings for random selection. (Default: latest available Chrome version)
         :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
-        :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of URL's domain.
         :param proxies: Dict of proxies to use. Format: {"http": proxy_url, "https": proxy_url}.
         :param proxy: Proxy URL to use. Format: "http://username:password@localhost:8030".
                      Cannot be used together with the `proxies` parameter.

 from .toolbelt.convertor import ResponseFactory
 from .toolbelt.proxy_rotation import ProxyRotator, is_proxy_error
 from ._browsers._types import RequestsSession, GetRequestParams, DataRequestParams, ImpersonateType
+from .toolbelt.fingerprints import generate_headers, __default_useragent__
 _NO_SESSION: Any = object()
         """
         1. Adds a useragent to the headers if it doesn't have one
         2. Generates real headers and append them to current headers
+        3. Sets a Google referer header.
         """
         # Merge session headers with request headers, request takes precedence (if it was set)
         final_headers = {**self._default_headers, **(headers if headers else {})}
         headers_keys = {k.lower() for k in final_headers}
         if stealth:
             if "referer" not in headers_keys:
+                final_headers["referer"] = "https://www.google.com/"
             if not impersonate_enabled:  # Curl will generate the suitable headers
                 extra_headers = generate_headers(browser_mode=False)
         """
         :param impersonate: Browser version to impersonate. Can be a single browser string or a list of browser strings for random selection. (Default: latest available Chrome version)
         :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
+        :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
         :param proxies: Dict of proxies to use. Format: {"http": proxy_url, "https": proxy_url}.
         :param proxy: Proxy URL to use. Format: "http://username:password@localhost:8030".
                      Cannot be used together with the `proxies` parameter.

scrapling/engines/toolbelt/fingerprints.py CHANGED Viewed

@@ -5,45 +5,16 @@ Functions related to generating headers and fingerprints generally
 from functools import lru_cache
 from platform import system as platform_system
-from tld import get_tld, Result
 from browserforge.headers import Browser, HeaderGenerator
 from browserforge.headers.generator import SUPPORTED_OPERATING_SYSTEMS
-from scrapling.core._types import Dict, Literal, Tuple, cast
 __OS_NAME__ = platform_system()
 OSName = Literal["linux", "macos", "windows"]
 # Current versions hardcoded for now (Playwright doesn't allow to know the version of a browser without launching it)
-chromium_version = 141
-chrome_version = 143
-@lru_cache(10, typed=True)
-def generate_convincing_referer(url: str) -> str | None:
-    """Takes the domain from the URL without the subdomain/suffix and make it look like you were searching Google for this website
-    >>> generate_convincing_referer('https://www.somewebsite.com/blah')
-    'https://www.google.com/search?q=somewebsite'
-    :param url: The URL you are about to fetch.
-    :return: Google's search URL of the domain name, or None for localhost/IP addresses
-    """
-    # Fixing the inaccurate return type hint in `get_tld`
-    extracted: Result | None = cast(Result, get_tld(url, as_object=True, fail_silently=True))
-    if not extracted:
-        return None
-    website_name = extracted.domain
-    # Skip generating referer for localhost, IP addresses, or when there's no valid domain
-    if not website_name or not extracted.tld or website_name in ("localhost", "127.0.0.1", "::1"):
-        return None
-    # Check if it's an IP address (simple check for IPv4)
-    if all(part.isdigit() for part in website_name.split(".") if part):
-        return None
-    return f"https://www.google.com/search?q={website_name}"
 @lru_cache(1, typed=True)

 from functools import lru_cache
 from platform import system as platform_system
 from browserforge.headers import Browser, HeaderGenerator
 from browserforge.headers.generator import SUPPORTED_OPERATING_SYSTEMS
+from scrapling.core._types import Dict, Literal, Tuple
 __OS_NAME__ = platform_system()
 OSName = Literal["linux", "macos", "windows"]
 # Current versions hardcoded for now (Playwright doesn't allow to know the version of a browser without launching it)
+chromium_version = 145
+chrome_version = 145
 @lru_cache(1, typed=True)

scrapling/fetchers/chrome.py CHANGED Viewed

@@ -28,7 +28,7 @@ class DynamicFetcher(BaseFetcher):
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param extra_headers: A dictionary of extra headers to add to the request.
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
@@ -68,7 +68,7 @@ class DynamicFetcher(BaseFetcher):
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
         :param extra_headers: A dictionary of extra headers to add to the request.
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.

         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
         :param extra_headers: A dictionary of extra headers to add to the request.
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
         :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
         :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
         :param extra_headers: A dictionary of extra headers to add to the request.
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.

scrapling/fetchers/stealth_chrome.py CHANGED Viewed

@@ -39,8 +39,8 @@ class StealthyFetcher(BaseFetcher):
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
@@ -88,8 +88,8 @@ class StealthyFetcher(BaseFetcher):
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
-        :param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
-        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.

         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.
         :param allow_webgl: Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
         :param load_dom: Enabled by default, wait for all JavaScript on page(s) to fully load and execute.
         :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
+        :param google_search: Enabled by default, Scrapling will set a Google referer header.
+        :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
         :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
         :param user_data_dir: Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory.
         :param extra_flags: A list of additional browser flags to pass to the browser on launch.

scrapling/parser.py CHANGED Viewed

@@ -58,6 +58,7 @@ _find_all_elements = XPath(".//*")
 _find_all_elements_with_spaces = XPath(
     ".//*[normalize-space(text())]"
 )  # This selector gets all elements with text content
 class Selector(SelectorsGeneration):
@@ -299,18 +300,31 @@ class Selector(SelectorsGeneration):
         ignored_elements: set[Any] = set()
         if ignore_tags:
-            for element in self._root.iter(*ignore_tags):
-                ignored_elements.add(element)
-                ignored_elements.update(cast(list, _find_all_elements(element)))
         _all_strings = []
-        for node in self._root.iter():
-            if node not in ignored_elements:
-                text = node.text
-                if text and isinstance(text, str):
-                    processed_text = text.strip() if strip else text
-                    if not valid_values or processed_text.strip():
-                        _all_strings.append(processed_text)
         return cast(TextHandler, TextHandler(separator).join(_all_strings))

 _find_all_elements_with_spaces = XPath(
     ".//*[normalize-space(text())]"
 )  # This selector gets all elements with text content
+_find_all_text_nodes = XPath(".//text()")
 class Selector(SelectorsGeneration):
         ignored_elements: set[Any] = set()
         if ignore_tags:
+            ignored_elements.update(self._root.iter(*ignore_tags))
         _all_strings = []
+        def append_text(text: str) -> None:
+            processed_text = text.strip() if strip else text
+            if not valid_values or processed_text.strip():
+                _all_strings.append(processed_text)
+        def is_visible_text_node(text_node: _ElementUnicodeResult) -> bool:
+            parent = text_node.getparent()
+            if parent is None:
+                return False
+            owner = parent.getparent() if text_node.is_tail else parent
+            while owner is not None:
+                if owner in ignored_elements:
+                    return False
+                owner = owner.getparent()
+            return True
+        for text_node in cast(list[_ElementUnicodeResult], _find_all_text_nodes(self._root)):
+            text = str(text_node)
+            if text and is_visible_text_node(text_node):
+                append_text(text)
         return cast(TextHandler, TextHandler(separator).join(_all_strings))

server.json CHANGED Viewed

@@ -14,12 +14,12 @@
       "mimeType": "image/png"
     }
   ],
-  "version": "0.4.1",
   "packages": [
     {
       "registryType": "pypi",
       "identifier": "scrapling",
-      "version": "0.4.1",
       "runtimeHint": "uvx",
       "packageArguments": [
         {

       "mimeType": "image/png"
     }
   ],
+  "version": "0.4.2",
   "packages": [
     {
       "registryType": "pypi",
       "identifier": "scrapling",
+      "version": "0.4.2",
       "runtimeHint": "uvx",
       "packageArguments": [
         {

setup.cfg CHANGED Viewed

@@ -1,6 +1,6 @@
 [metadata]
 name = scrapling
-version = 0.4.1
 author = Karim Shoair
 author_email = karim.shoair@pm.me
 description = Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!

 [metadata]
 name = scrapling
+version = 0.4.2
 author = Karim Shoair
 author_email = karim.shoair@pm.me
 description = Scrapling is an undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!

tests/fetchers/test_utils.py CHANGED Viewed

@@ -7,7 +7,6 @@ from scrapling.engines.toolbelt.navigation import (
     create_async_intercept_handler,
 )
 from scrapling.engines.toolbelt.fingerprints import (
-    generate_convincing_referer,
     get_os_name,
     generate_headers
 )
@@ -204,22 +203,6 @@ class TestConstructProxyDict:
 class TestFingerprintFunctions:
     """Test fingerprint generation functions"""
-    def test_generate_convincing_referer(self):
-        """Test referer generation"""
-        url = "https://sub.example.com/page.html"
-        result = generate_convincing_referer(url)
-        assert result.startswith("https://www.google.com/search?q=")
-        assert "example" in result
-    def test_generate_convincing_referer_caching(self):
-        """Test referer generation caching"""
-        url = "https://example.com"
-        result1 = generate_convincing_referer(url)
-        result2 = generate_convincing_referer(url)
-        assert result1 == result2
     def test_get_os_name(self):
         """Test OS name detection"""
         result = get_os_name()

     create_async_intercept_handler,
 )
 from scrapling.engines.toolbelt.fingerprints import (
     get_os_name,
     generate_headers
 )
 class TestFingerprintFunctions:
     """Test fingerprint generation functions"""
     def test_get_os_name(self):
         """Test OS name detection"""
         result = get_os_name()

tests/parser/test_parser_advanced.py CHANGED Viewed

@@ -183,6 +183,33 @@ class TestAdvancedSelectors:
         text = page.get_all_text(valid_values=False)
         assert text != ""
 class TestTextHandlerAdvanced:
     """Test advanced TextHandler functionality"""

         text = page.get_all_text(valid_values=False)
         assert text != ""
+    def test_get_all_text_preserves_interleaved_text_nodes(self):
+        """Test get_all_text preserves interleaved text nodes"""
+        html = """
+        <html>
+        <body>
+            <main>
+                string1
+                <b>string2</b>
+                string3
+                <div>
+                    <span>string4</span>
+                </div>
+                string5
+                <script>ignored</script>
+                string6
+                <style>ignored</style>
+                string7
+            </main>
+        </body>
+        </html>
+        """
+        page = Selector(html, adaptive=False)
+        node = page.css("main")[0]
+        assert node.get_all_text("\n", strip=True) == "string1\nstring2\nstring3\nstring4\nstring5\nstring6\nstring7"
 class TestTextHandlerAdvanced:
     """Test advanced TextHandler functionality"""

tests/requirements.txt CHANGED Viewed

@@ -1,6 +1,6 @@
 pytest>=2.8.0,<9
 pytest-cov
-playwright==1.56.0
 werkzeug<3.0.0
 pytest-httpbin==2.1.0
 pytest-asyncio

 pytest>=2.8.0,<9
 pytest-cov
+playwright==1.58.0
 werkzeug<3.0.0
 pytest-httpbin==2.1.0
 pytest-asyncio

tox.ini CHANGED Viewed

@@ -10,8 +10,8 @@ envlist = pre-commit,py{310,311,312,313}
 usedevelop = True
 changedir = tests
 deps =
-    playwright==1.56.0
-    patchright==1.56.0
     -r{toxinidir}/tests/requirements.txt
 extras = ai,shell
 commands =

 usedevelop = True
 changedir = tests
 deps =
+    playwright==1.58.0
+    patchright==1.58.2
     -r{toxinidir}/tests/requirements.txt
 extras = ai,shell
 commands =