Karim shoair commited on
Commit ·
1c41516
1
Parent(s): be68867
docs: Update overview page
Browse files- docs/overview.md +37 -42
docs/overview.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
|
| 2 |
|
| 3 |
-
Here's an HTML document generated by ChatGPT we will be using as an example throughout this page:
|
| 4 |
```html
|
| 5 |
<html>
|
| 6 |
<head>
|
|
@@ -71,8 +71,8 @@ Here's an HTML document generated by ChatGPT we will be using as an example thro
|
|
| 71 |
```
|
| 72 |
Starting with loading raw HTML above like this
|
| 73 |
```python
|
| 74 |
-
from scrapling.parser import
|
| 75 |
-
page =
|
| 76 |
page # <data='<html><head><title>Complex Web Page</tit...'>
|
| 77 |
```
|
| 78 |
Get all text content on the page recursively
|
|
@@ -101,7 +101,7 @@ section_elements = page.find_all('section', {'id':"products"})
|
|
| 101 |
section_elements = page.find_all('section', id="products")
|
| 102 |
# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
|
| 103 |
```
|
| 104 |
-
Find all `section` elements
|
| 105 |
```python
|
| 106 |
section_elements = page.find_all('section', {'id*':"product"})
|
| 107 |
```
|
|
@@ -110,12 +110,12 @@ Find all `h3` elements whose text content matches this regex `Product \d`
|
|
| 110 |
page.find_all('h3', re.compile(r'Product \d'))
|
| 111 |
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
|
| 112 |
```
|
| 113 |
-
Find all `h3` and `h2` elements whose text content matches regex `Product` only
|
| 114 |
```python
|
| 115 |
page.find_all(['h3', 'h2'], re.compile(r'Product'))
|
| 116 |
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
|
| 117 |
```
|
| 118 |
-
Find all elements
|
| 119 |
```python
|
| 120 |
page.find_by_text('Products', first_match=False)
|
| 121 |
# [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
|
|
@@ -225,12 +225,12 @@ Using the elements we found above
|
|
| 225 |
>>> page.css_first('[data-id="1"]').has_class('product')
|
| 226 |
True
|
| 227 |
```
|
| 228 |
-
If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like the one below
|
| 229 |
```python
|
| 230 |
for ancestor in quote.iterancestors():
|
| 231 |
# do something with it...
|
| 232 |
```
|
| 233 |
-
You can search for a specific ancestor of an element that satisfies a function; all you need to do is
|
| 234 |
```python
|
| 235 |
>>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
|
| 236 |
<data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
|
|
@@ -242,22 +242,10 @@ Instead of passing the raw HTML to Scrapling, you can get a website's response d
|
|
| 242 |
A fetcher is made for every use case.
|
| 243 |
|
| 244 |
### HTTP Requests
|
| 245 |
-
For simple HTTP requests, there's a `Fetcher` class that can be imported as below:
|
| 246 |
```python
|
| 247 |
from scrapling.fetchers import Fetcher
|
| 248 |
-
|
| 249 |
-
But that's class, so you will need to create an instance of the Fetcher first like this:
|
| 250 |
-
```python
|
| 251 |
-
from scrapling.fetchers import Fetcher
|
| 252 |
-
fetcher = Fetcher()
|
| 253 |
-
page = fetcher.get('https://httpbin.org/get')
|
| 254 |
-
```
|
| 255 |
-
This is intended, and you will find it with all fetchers because there are settings you can pass to `Fetcher()` initialization, but more on this later.
|
| 256 |
-
|
| 257 |
-
If you are going to use the default settings anyway, you can do this instead for a cleaner approach:
|
| 258 |
-
```python
|
| 259 |
-
from scrapling.fetchers import Fetcher
|
| 260 |
-
page = Fetcher.get('https://httpbin.org/get')
|
| 261 |
```
|
| 262 |
With that out of the way, here's how to do all HTTP methods:
|
| 263 |
```python
|
|
@@ -267,7 +255,7 @@ With that out of the way, here's how to do all HTTP methods:
|
|
| 267 |
>>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
|
| 268 |
>>> page = Fetcher.delete('https://httpbin.org/delete')
|
| 269 |
```
|
| 270 |
-
For Async requests, you will
|
| 271 |
```python
|
| 272 |
>>> from scrapling.fetchers import AsyncFetcher
|
| 273 |
>>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
|
|
@@ -276,52 +264,59 @@ For Async requests, you will just replace the import like below:
|
|
| 276 |
>>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
|
| 277 |
```
|
| 278 |
|
| 279 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
-
This is just the tip of this fetcher; check the
|
| 282 |
|
| 283 |
### Dynamic loading
|
| 284 |
We have you covered if you deal with dynamic websites like most today!
|
| 285 |
|
| 286 |
-
The `
|
| 287 |
```python
|
| 288 |
-
>>> from scrapling.fetchers import
|
| 289 |
-
>>> page =
|
| 290 |
>>> page.css_first("#search a::attr(href)")
|
| 291 |
'https://github.com/D4Vinci/Scrapling'
|
| 292 |
>>> # The async version of fetch
|
| 293 |
-
>>> page = await
|
| 294 |
>>> page.css_first("#search a::attr(href)")
|
| 295 |
'https://github.com/D4Vinci/Scrapling'
|
| 296 |
```
|
| 297 |
-
It's
|
| 298 |
|
| 299 |
-
- Vanilla Playwright without any modifications other than the ones you chose.
|
| 300 |
-
- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode but bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode.
|
| 301 |
-
- Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
|
| 302 |
-
- [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
| 303 |
|
| 304 |
-
> Note: All requests done by this fetcher are
|
| 305 |
|
| 306 |
-
Again, this is just the tip of this fetcher. Check out the
|
| 307 |
|
| 308 |
### Dynamic anti-protection loading
|
| 309 |
We also have you covered if you deal with dynamic websites with annoying anti-protections!
|
| 310 |
|
| 311 |
-
The `StealthyFetcher` class uses a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most
|
| 312 |
```python
|
| 313 |
-
>>>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 314 |
>>> page.status == 200
|
| 315 |
True
|
| 316 |
-
>>> page = StealthyFetcher
|
| 317 |
>>> # The async version of fetch
|
| 318 |
-
>>> page = await StealthyFetcher
|
| 319 |
>>> page.status == 200
|
| 320 |
True
|
| 321 |
```
|
| 322 |
-
> Note: All requests done by this fetcher are
|
| 323 |
|
| 324 |
-
Again, this is just the tip of this fetcher. Check out the
|
| 325 |
|
| 326 |
---
|
| 327 |
|
|
|
|
| 1 |
We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
|
| 2 |
|
| 3 |
+
Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page:
|
| 4 |
```html
|
| 5 |
<html>
|
| 6 |
<head>
|
|
|
|
| 71 |
```
|
| 72 |
Starting with loading raw HTML above like this
|
| 73 |
```python
|
| 74 |
+
from scrapling.parser import Selector
|
| 75 |
+
page = Selector(html_doc)
|
| 76 |
page # <data='<html><head><title>Complex Web Page</tit...'>
|
| 77 |
```
|
| 78 |
Get all text content on the page recursively
|
|
|
|
| 101 |
section_elements = page.find_all('section', id="products")
|
| 102 |
# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
|
| 103 |
```
|
| 104 |
+
Find all `section` elements whose `id` attribute value contains `product`
|
| 105 |
```python
|
| 106 |
section_elements = page.find_all('section', {'id*':"product"})
|
| 107 |
```
|
|
|
|
| 110 |
page.find_all('h3', re.compile(r'Product \d'))
|
| 111 |
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
|
| 112 |
```
|
| 113 |
+
Find all `h3` and `h2` elements whose text content matches the regex `Product` only
|
| 114 |
```python
|
| 115 |
page.find_all(['h3', 'h2'], re.compile(r'Product'))
|
| 116 |
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
|
| 117 |
```
|
| 118 |
+
Find all elements whose text content matches exactly `Products` (Whitespaces are not taken into consideration)
|
| 119 |
```python
|
| 120 |
page.find_by_text('Products', first_match=False)
|
| 121 |
# [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
|
|
|
|
| 225 |
>>> page.css_first('[data-id="1"]').has_class('product')
|
| 226 |
True
|
| 227 |
```
|
| 228 |
+
If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below
|
| 229 |
```python
|
| 230 |
for ancestor in quote.iterancestors():
|
| 231 |
# do something with it...
|
| 232 |
```
|
| 233 |
+
You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a `Selector` object as an argument and returns `True` if the condition is satisfied or `False` otherwise, like below:
|
| 234 |
```python
|
| 235 |
>>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
|
| 236 |
<data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
|
|
|
|
| 242 |
A fetcher is made for every use case.
|
| 243 |
|
| 244 |
### HTTP Requests
|
| 245 |
+
For simple HTTP requests, there's a `Fetcher` class that can be imported and used as below:
|
| 246 |
```python
|
| 247 |
from scrapling.fetchers import Fetcher
|
| 248 |
+
page = Fetcher.get('https://httpbin.org/get', impersonate="chrome")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
```
|
| 250 |
With that out of the way, here's how to do all HTTP methods:
|
| 251 |
```python
|
|
|
|
| 255 |
>>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
|
| 256 |
>>> page = Fetcher.delete('https://httpbin.org/delete')
|
| 257 |
```
|
| 258 |
+
For Async requests, you will replace the import like below:
|
| 259 |
```python
|
| 260 |
>>> from scrapling.fetchers import AsyncFetcher
|
| 261 |
>>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
|
|
|
|
| 264 |
>>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
|
| 265 |
```
|
| 266 |
|
| 267 |
+
> Notes:
|
| 268 |
+
>
|
| 269 |
+
> 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
|
| 270 |
+
> 2. The `impersonate` argument allows you to fake the TLS fingerprint for a specific version of a browser.
|
| 271 |
+
> 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
|
| 272 |
|
| 273 |
+
This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
|
| 274 |
|
| 275 |
### Dynamic loading
|
| 276 |
We have you covered if you deal with dynamic websites like most today!
|
| 277 |
|
| 278 |
+
The `DynamicFetcher` class (previously known as `PlayWrightFetcher`) provides many options to fetch/load websites' pages through browsers.
|
| 279 |
```python
|
| 280 |
+
>>> from scrapling.fetchers import DynamicFetcher
|
| 281 |
+
>>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
|
| 282 |
>>> page.css_first("#search a::attr(href)")
|
| 283 |
'https://github.com/D4Vinci/Scrapling'
|
| 284 |
>>> # The async version of fetch
|
| 285 |
+
>>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
|
| 286 |
>>> page.css_first("#search a::attr(href)")
|
| 287 |
'https://github.com/D4Vinci/Scrapling'
|
| 288 |
```
|
| 289 |
+
It's built on top of [Playwright](https://playwright.dev/python/) and it's currently providing three main run options that can be mixed as you want:
|
| 290 |
|
| 291 |
+
- Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser.
|
| 292 |
+
- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
|
| 293 |
+
- Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
|
|
|
|
| 294 |
|
| 295 |
+
> Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
|
| 296 |
|
| 297 |
+
Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 298 |
|
| 299 |
### Dynamic anti-protection loading
|
| 300 |
We also have you covered if you deal with dynamic websites with annoying anti-protections!
|
| 301 |
|
| 302 |
+
The `StealthyFetcher` class uses a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most bot detections by default. Scrapling offers a faster custom version, includes extra tools, and features easy configurations to further increase undetectability.
|
| 303 |
```python
|
| 304 |
+
>>> from scrapling.fetchers import StealthyFetcher
|
| 305 |
+
>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection') # Running headless by default
|
| 306 |
+
>>> page.status == 200
|
| 307 |
+
True
|
| 308 |
+
>>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True) # Solve Cloudflare captcha automatically if presented
|
| 309 |
>>> page.status == 200
|
| 310 |
True
|
| 311 |
+
>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
|
| 312 |
>>> # The async version of fetch
|
| 313 |
+
>>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection')
|
| 314 |
>>> page.status == 200
|
| 315 |
True
|
| 316 |
```
|
| 317 |
+
> Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
|
| 318 |
|
| 319 |
+
Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 320 |
|
| 321 |
---
|
| 322 |
|