Karim shoair commited on
Commit ·
75573dc
1
Parent(s): 6bfe2e1
Corrections and rephrasing
Browse files- scrapling/fetchers.py +14 -14
scrapling/fetchers.py
CHANGED
|
@@ -13,7 +13,7 @@ class Fetcher(BaseFetcher):
|
|
| 13 |
"""Make basic HTTP GET request for you but with some added flavors.
|
| 14 |
:param url: Target url.
|
| 15 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 16 |
-
:param timeout: The time to wait for the request to finish in seconds.
|
| 17 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 18 |
create a referer header as if this request had came from Google's search of this URL's domain.
|
| 19 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details.
|
|
@@ -26,7 +26,7 @@ class Fetcher(BaseFetcher):
|
|
| 26 |
"""Make basic HTTP POST request for you but with some added flavors.
|
| 27 |
:param url: Target url.
|
| 28 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 29 |
-
:param timeout: The time to wait for the request to finish in seconds.
|
| 30 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 31 |
create a referer header as if this request came from Google's search of this URL's domain.
|
| 32 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details.
|
|
@@ -39,7 +39,7 @@ class Fetcher(BaseFetcher):
|
|
| 39 |
"""Make basic HTTP PUT request for you but with some added flavors.
|
| 40 |
:param url: Target url
|
| 41 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 42 |
-
:param timeout: The time to wait for the request to finish in seconds.
|
| 43 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 44 |
create a referer header as if this request came from Google's search of this URL's domain.
|
| 45 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details.
|
|
@@ -52,7 +52,7 @@ class Fetcher(BaseFetcher):
|
|
| 52 |
"""Make basic HTTP DELETE request for you but with some added flavors.
|
| 53 |
:param url: Target url
|
| 54 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 55 |
-
:param timeout: The time to wait for the request to finish in seconds.
|
| 56 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 57 |
create a referer header as if this request came from Google's search of this URL's domain.
|
| 58 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details.
|
|
@@ -85,15 +85,15 @@ class StealthyFetcher(BaseFetcher):
|
|
| 85 |
This can help save your proxy usage but be careful with this option as it makes some websites never finish loading.
|
| 86 |
:param block_webrtc: Blocks WebRTC entirely.
|
| 87 |
:param addons: List of Firefox addons to use. Must be paths to extracted addons.
|
| 88 |
-
:param humanize: Humanize the cursor movement. Takes either True
|
| 89 |
:param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases.
|
| 90 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 91 |
-
:param timeout: The timeout in milliseconds that
|
| 92 |
-
:param page_action: Added for automation. A function that takes the `page` object,
|
| 93 |
:param wait_selector: Wait for a specific css selector to be in a specific state.
|
| 94 |
:param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
|
| 95 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name.
|
| 96 |
-
:param extra_headers: A dictionary of extra headers to add to
|
| 97 |
:return: A Response object with `url`, `text`, `content`, `status`, `reason`, `encoding`, `cookies`, `headers`, `request_headers`, and the `adaptor` class for parsing, of course.
|
| 98 |
"""
|
| 99 |
engine = CamoufoxEngine(
|
|
@@ -122,10 +122,10 @@ class PlayWrightFetcher(BaseFetcher):
|
|
| 122 |
Using this Fetcher class, you can do requests with:
|
| 123 |
- Vanilla Playwright without any modifications other than the ones you chose.
|
| 124 |
- Stealthy Playwright with the stealth mode I wrote for it. It's still a work in progress but it bypasses many online tests like bot.sannysoft.com
|
| 125 |
-
Some of the things stealth mode
|
| 126 |
1) Patches the CDP runtime fingerprint.
|
| 127 |
-
2) Mimics some of real browsers' properties by
|
| 128 |
-
3) Using custom flags on launch to hide
|
| 129 |
4) Generates real browser's headers of the same type and same user OS then append it to the request.
|
| 130 |
- Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
|
| 131 |
- NSTBrowser's docker browserless option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
|
@@ -148,15 +148,15 @@ class PlayWrightFetcher(BaseFetcher):
|
|
| 148 |
This can help save your proxy usage but be careful with this option as it makes some websites never finish loading.
|
| 149 |
:param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
|
| 150 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 151 |
-
:param timeout: The timeout in milliseconds that
|
| 152 |
-
:param page_action: Added for automation. A function that takes the `page` object,
|
| 153 |
:param wait_selector: Wait for a specific css selector to be in a specific state.
|
| 154 |
:param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
|
| 155 |
:param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
|
| 156 |
:param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
|
| 157 |
:param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
|
| 158 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name.
|
| 159 |
-
:param extra_headers: A dictionary of extra headers to add to
|
| 160 |
:param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.
|
| 161 |
:param nstbrowser_mode: Enables NSTBrowser mode, it have to be used with `cdp_url` argument or it will get completely ignored.
|
| 162 |
:param nstbrowser_config: The config you want to send with requests to the NSTBrowser. If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config.
|
|
|
|
| 13 |
"""Make basic HTTP GET request for you but with some added flavors.
|
| 14 |
:param url: Target url.
|
| 15 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 16 |
+
:param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds.
|
| 17 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 18 |
create a referer header as if this request had came from Google's search of this URL's domain.
|
| 19 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.get()` function so check httpx documentation for details.
|
|
|
|
| 26 |
"""Make basic HTTP POST request for you but with some added flavors.
|
| 27 |
:param url: Target url.
|
| 28 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 29 |
+
:param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds.
|
| 30 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 31 |
create a referer header as if this request came from Google's search of this URL's domain.
|
| 32 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.post()` function so check httpx documentation for details.
|
|
|
|
| 39 |
"""Make basic HTTP PUT request for you but with some added flavors.
|
| 40 |
:param url: Target url
|
| 41 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 42 |
+
:param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds.
|
| 43 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 44 |
create a referer header as if this request came from Google's search of this URL's domain.
|
| 45 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.put()` function so check httpx documentation for details.
|
|
|
|
| 52 |
"""Make basic HTTP DELETE request for you but with some added flavors.
|
| 53 |
:param url: Target url
|
| 54 |
:param follow_redirects: As the name says -- if enabled (default), redirects will be followed.
|
| 55 |
+
:param timeout: The time to wait for the request to finish in seconds. The default is 10 seconds.
|
| 56 |
:param stealthy_headers: If enabled (default), Fetcher will create and add real browser's headers and
|
| 57 |
create a referer header as if this request came from Google's search of this URL's domain.
|
| 58 |
:param kwargs: Any additional keyword arguments are passed directly to `httpx.delete()` function so check httpx documentation for details.
|
|
|
|
| 85 |
This can help save your proxy usage but be careful with this option as it makes some websites never finish loading.
|
| 86 |
:param block_webrtc: Blocks WebRTC entirely.
|
| 87 |
:param addons: List of Firefox addons to use. Must be paths to extracted addons.
|
| 88 |
+
:param humanize: Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window.
|
| 89 |
:param allow_webgl: Whether to allow WebGL. To prevent leaks, only use this for special cases.
|
| 90 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 91 |
+
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
|
| 92 |
+
:param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
|
| 93 |
:param wait_selector: Wait for a specific css selector to be in a specific state.
|
| 94 |
:param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
|
| 95 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name.
|
| 96 |
+
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 97 |
:return: A Response object with `url`, `text`, `content`, `status`, `reason`, `encoding`, `cookies`, `headers`, `request_headers`, and the `adaptor` class for parsing, of course.
|
| 98 |
"""
|
| 99 |
engine = CamoufoxEngine(
|
|
|
|
| 122 |
Using this Fetcher class, you can do requests with:
|
| 123 |
- Vanilla Playwright without any modifications other than the ones you chose.
|
| 124 |
- Stealthy Playwright with the stealth mode I wrote for it. It's still a work in progress but it bypasses many online tests like bot.sannysoft.com
|
| 125 |
+
Some of the things stealth mode does include:
|
| 126 |
1) Patches the CDP runtime fingerprint.
|
| 127 |
+
2) Mimics some of the real browsers' properties by injecting several JS files and using custom options.
|
| 128 |
+
3) Using custom flags on launch to hide Playwright even more and make it faster.
|
| 129 |
4) Generates real browser's headers of the same type and same user OS then append it to the request.
|
| 130 |
- Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
|
| 131 |
- NSTBrowser's docker browserless option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
|
|
|
| 148 |
This can help save your proxy usage but be careful with this option as it makes some websites never finish loading.
|
| 149 |
:param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
|
| 150 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 151 |
+
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000
|
| 152 |
+
:param page_action: Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again.
|
| 153 |
:param wait_selector: Wait for a specific css selector to be in a specific state.
|
| 154 |
:param wait_selector_state: The state to wait for the selector given with `wait_selector`. Default state is `attached`.
|
| 155 |
:param stealth: Enables stealth mode, check the documentation to see what stealth mode does currently.
|
| 156 |
:param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
|
| 157 |
:param disable_webgl: Disables WebGL and WebGL 2.0 support entirely.
|
| 158 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name.
|
| 159 |
+
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 160 |
:param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP.
|
| 161 |
:param nstbrowser_mode: Enables NSTBrowser mode, it have to be used with `cdp_url` argument or it will get completely ignored.
|
| 162 |
:param nstbrowser_config: The config you want to send with requests to the NSTBrowser. If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config.
|