Karim shoair commited on
Commit
61dec8a
·
1 Parent(s): ccd54cb

docs: update the website to reflect the google referer logic

Browse files
agent-skill/Scrapling-Skill/references/fetching/stealthy.md CHANGED
@@ -21,8 +21,7 @@ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynami
21
  3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
22
  4. It generates canvas noise to prevent fingerprinting through canvas.
23
  5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
24
- 6. It makes requests look as if they came from Google's search page of the requested website.
25
- 7. and other anti-protection options...
26
 
27
  ## Full list of arguments
28
  Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
 
21
  3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
22
  4. It generates canvas noise to prevent fingerprinting through canvas.
23
  5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
24
+ 6. and other anti-protection options...
 
25
 
26
  ## Full list of arguments
27
  Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
docs/fetching/dynamic.md CHANGED
@@ -76,8 +76,8 @@ Scrapling provides many options with this fetcher and its session classes. To ma
76
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
77
  | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
78
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
79
- | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
80
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
81
  | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
82
  | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
83
  | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
@@ -97,7 +97,7 @@ In session classes, all these arguments can be set globally for the session. Sti
97
  !!! note "Notes:"
98
 
99
  1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
100
- 2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
101
  3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
102
  4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
103
 
 
76
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
77
  | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
78
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
79
+ | google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
80
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._ | ✔️ |
81
  | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
82
  | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
83
  | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
 
97
  !!! note "Notes:"
98
 
99
  1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
100
+ 2. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
101
  3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
102
  4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
103
 
docs/fetching/static.md CHANGED
@@ -20,7 +20,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
20
  All methods for making requests here share some arguments, so let's discuss them first.
21
 
22
  - **url**: The targeted URL
23
- - **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain.
24
  - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
25
  - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
26
  - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
 
20
  All methods for making requests here share some arguments, so let's discuss them first.
21
 
22
  - **url**: The targeted URL
23
+ - **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
24
  - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
25
  - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
26
  - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
docs/fetching/stealthy.md CHANGED
@@ -32,8 +32,7 @@ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynami
32
  3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
33
  4. It generates canvas noise to prevent fingerprinting through canvas.
34
  5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
35
- 6. It makes requests look as if they came from Google's search page of the requested website.
36
- 7. and other anti-protection options...
37
 
38
  ## Full list of arguments
39
  Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
@@ -54,8 +53,8 @@ Scrapling provides many options with this fetcher and its session classes. Befor
54
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
55
  | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
56
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
57
- | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
58
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
59
  | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
60
  | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
61
  | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
@@ -80,7 +79,7 @@ In session classes, all these arguments can be set globally for the session. Sti
80
 
81
  1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
82
  2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
83
- 3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
84
  4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
85
 
86
  ## Examples
 
32
  3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
33
  4. It generates canvas noise to prevent fingerprinting through canvas.
34
  5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
35
+ 6. and other anti-protection options...
 
36
 
37
  ## Full list of arguments
38
  Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
 
53
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
54
  | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
55
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
56
+ | google_search | Enabled by default, Scrapling will set a Google referer header. | ✔️ |
57
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._ | ✔️ |
58
  | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
59
  | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
60
  | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
 
79
 
80
  1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
81
  2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
82
+ 3. The `google_search` argument is enabled by default for all requests, setting the referer to `https://www.google.com/`. If used together with `extra_headers`, it takes priority over the referer set there.
83
  4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
84
 
85
  ## Examples
docs/overview.md CHANGED
@@ -280,7 +280,7 @@ For Async requests, you will replace the import like below:
280
 
281
  !!! note "Notes:"
282
 
283
- 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
284
  2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
285
  3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
286
 
@@ -320,8 +320,7 @@ Some of the things it does:
320
  3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
321
  4. It generates canvas noise to prevent fingerprinting through canvas.
322
  5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
323
- 6. It makes requests look as if they came from Google's search page of the requested website.
324
- 7. and other anti-protection options...
325
 
326
  ```python
327
  >>> from scrapling.fetchers import StealthyFetcher
 
280
 
281
  !!! note "Notes:"
282
 
283
+ 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a Google referer header. It's enabled by default.
284
  2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
285
  3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
286
 
 
320
  3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
321
  4. It generates canvas noise to prevent fingerprinting through canvas.
322
  5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
323
+ 6. and other anti-protection options...
 
324
 
325
  ```python
326
  >>> from scrapling.fetchers import StealthyFetcher