Karim shoair commited on
Commit ·
61dec8a
1
Parent(s): ccd54cb
docs: update the website to reflect the google referer logic
Browse files
agent-skill/Scrapling-Skill/references/fetching/stealthy.md
CHANGED
|
@@ -21,8 +21,7 @@ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynami
|
|
| 21 |
3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
|
| 22 |
4. It generates canvas noise to prevent fingerprinting through canvas.
|
| 23 |
5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
|
| 24 |
-
6.
|
| 25 |
-
7. and other anti-protection options...
|
| 26 |
|
| 27 |
## Full list of arguments
|
| 28 |
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
|
|
|
| 21 |
3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
|
| 22 |
4. It generates canvas noise to prevent fingerprinting through canvas.
|
| 23 |
5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
|
| 24 |
+
6. and other anti-protection options...
|
|
|
|
| 25 |
|
| 26 |
## Full list of arguments
|
| 27 |
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
docs/fetching/dynamic.md
CHANGED
|
@@ -76,8 +76,8 @@ Scrapling provides many options with this fetcher and its session classes. To ma
|
|
| 76 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 77 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 78 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 79 |
-
| google_search | Enabled by default, Scrapling will set
|
| 80 |
-
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by
|
| 81 |
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 82 |
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
| 83 |
| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
|
|
@@ -97,7 +97,7 @@ In session classes, all these arguments can be set globally for the session. Sti
|
|
| 97 |
!!! note "Notes:"
|
| 98 |
|
| 99 |
1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
| 100 |
-
2. The `google_search` argument is enabled by default for all requests,
|
| 101 |
3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
|
| 102 |
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
| 103 |
|
|
|
|
| 76 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 77 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 78 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 79 |
+
| google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
|
| 80 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._ | ✔️ |
|
| 81 |
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 82 |
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
| 83 |
| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
|
|
|
|
| 97 |
!!! note "Notes:"
|
| 98 |
|
| 99 |
1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
| 100 |
+
2. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
|
| 101 |
3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
|
| 102 |
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
| 103 |
|
docs/fetching/static.md
CHANGED
|
@@ -20,7 +20,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
|
|
| 20 |
All methods for making requests here share some arguments, so let's discuss them first.
|
| 21 |
|
| 22 |
- **url**: The targeted URL
|
| 23 |
-
- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets
|
| 24 |
- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
|
| 25 |
- **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
|
| 26 |
- **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
|
|
|
|
| 20 |
All methods for making requests here share some arguments, so let's discuss them first.
|
| 21 |
|
| 22 |
- **url**: The targeted URL
|
| 23 |
+
- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
|
| 24 |
- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
|
| 25 |
- **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
|
| 26 |
- **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
|
docs/fetching/stealthy.md
CHANGED
|
@@ -32,8 +32,7 @@ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynami
|
|
| 32 |
3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
|
| 33 |
4. It generates canvas noise to prevent fingerprinting through canvas.
|
| 34 |
5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
|
| 35 |
-
6.
|
| 36 |
-
7. and other anti-protection options...
|
| 37 |
|
| 38 |
## Full list of arguments
|
| 39 |
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
|
@@ -54,8 +53,8 @@ Scrapling provides many options with this fetcher and its session classes. Befor
|
|
| 54 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 55 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 56 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 57 |
-
| google_search | Enabled by default, Scrapling will set
|
| 58 |
-
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by
|
| 59 |
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 60 |
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
| 61 |
| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
|
|
@@ -80,7 +79,7 @@ In session classes, all these arguments can be set globally for the session. Sti
|
|
| 80 |
|
| 81 |
1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
|
| 82 |
2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
| 83 |
-
3. The `google_search` argument is enabled by default for all requests,
|
| 84 |
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
| 85 |
|
| 86 |
## Examples
|
|
|
|
| 32 |
3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
|
| 33 |
4. It generates canvas noise to prevent fingerprinting through canvas.
|
| 34 |
5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
|
| 35 |
+
6. and other anti-protection options...
|
|
|
|
| 36 |
|
| 37 |
## Full list of arguments
|
| 38 |
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
|
|
|
| 53 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 54 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 55 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 56 |
+
| google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
|
| 57 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._ | ✔️ |
|
| 58 |
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 59 |
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
| 60 |
| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
|
|
|
|
| 79 |
|
| 80 |
1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
|
| 81 |
2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
| 82 |
+
3. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
|
| 83 |
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
| 84 |
|
| 85 |
## Examples
|
docs/overview.md
CHANGED
|
@@ -280,7 +280,7 @@ For Async requests, you will replace the import like below:
|
|
| 280 |
|
| 281 |
!!! note "Notes:"
|
| 282 |
|
| 283 |
-
1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a
|
| 284 |
2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
|
| 285 |
3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
|
| 286 |
|
|
@@ -320,8 +320,7 @@ Some of the things it does:
|
|
| 320 |
3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
|
| 321 |
4. It generates canvas noise to prevent fingerprinting through canvas.
|
| 322 |
5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
|
| 323 |
-
6.
|
| 324 |
-
7. and other anti-protection options...
|
| 325 |
|
| 326 |
```python
|
| 327 |
>>> from scrapling.fetchers import StealthyFetcher
|
|
|
|
| 280 |
|
| 281 |
!!! note "Notes:"
|
| 282 |
|
| 283 |
+
1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a Google referer header. It's enabled by default.
|
| 284 |
2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
|
| 285 |
3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
|
| 286 |
|
|
|
|
| 320 |
3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
|
| 321 |
4. It generates canvas noise to prevent fingerprinting through canvas.
|
| 322 |
5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
|
| 323 |
+
6. and other anti-protection options...
|
|
|
|
| 324 |
|
| 325 |
```python
|
| 326 |
>>> from scrapling.fetchers import StealthyFetcher
|