Karim shoair commited on
Commit
1c41516
·
1 Parent(s): be68867

docs: Update overview page

Browse files
Files changed (1) hide show
  1. docs/overview.md +37 -42
docs/overview.md CHANGED
@@ -1,6 +1,6 @@
1
  We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
2
 
3
- Here's an HTML document generated by ChatGPT we will be using as an example throughout this page:
4
  ```html
5
  <html>
6
  <head>
@@ -71,8 +71,8 @@ Here's an HTML document generated by ChatGPT we will be using as an example thro
71
  ```
72
  Starting with loading raw HTML above like this
73
  ```python
74
- from scrapling.parser import Adaptor
75
- page = Adaptor(html_doc)
76
  page # <data='<html><head><title>Complex Web Page</tit...'>
77
  ```
78
  Get all text content on the page recursively
@@ -101,7 +101,7 @@ section_elements = page.find_all('section', {'id':"products"})
101
  section_elements = page.find_all('section', id="products")
102
  # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
103
  ```
104
- Find all `section` elements that its `id` attribute value contains `product`
105
  ```python
106
  section_elements = page.find_all('section', {'id*':"product"})
107
  ```
@@ -110,12 +110,12 @@ Find all `h3` elements whose text content matches this regex `Product \d`
110
  page.find_all('h3', re.compile(r'Product \d'))
111
  # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
112
  ```
113
- Find all `h3` and `h2` elements whose text content matches regex `Product` only
114
  ```python
115
  page.find_all(['h3', 'h2'], re.compile(r'Product'))
116
  # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
117
  ```
118
- Find all elements that its text content matches exactly `Products` (Whitespaces are not taken into consideration)
119
  ```python
120
  page.find_by_text('Products', first_match=False)
121
  # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
@@ -225,12 +225,12 @@ Using the elements we found above
225
  >>> page.css_first('[data-id="1"]').has_class('product')
226
  True
227
  ```
228
- If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like the one below
229
  ```python
230
  for ancestor in quote.iterancestors():
231
  # do something with it...
232
  ```
233
- You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an `Adaptor` object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
234
  ```python
235
  >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
236
  <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
@@ -242,22 +242,10 @@ Instead of passing the raw HTML to Scrapling, you can get a website's response d
242
  A fetcher is made for every use case.
243
 
244
  ### HTTP Requests
245
- For simple HTTP requests, there's a `Fetcher` class that can be imported as below:
246
  ```python
247
  from scrapling.fetchers import Fetcher
248
- ```
249
- But that's class, so you will need to create an instance of the Fetcher first like this:
250
- ```python
251
- from scrapling.fetchers import Fetcher
252
- fetcher = Fetcher()
253
- page = fetcher.get('https://httpbin.org/get')
254
- ```
255
- This is intended, and you will find it with all fetchers because there are settings you can pass to `Fetcher()` initialization, but more on this later.
256
-
257
- If you are going to use the default settings anyway, you can do this instead for a cleaner approach:
258
- ```python
259
- from scrapling.fetchers import Fetcher
260
- page = Fetcher.get('https://httpbin.org/get')
261
  ```
262
  With that out of the way, here's how to do all HTTP methods:
263
  ```python
@@ -267,7 +255,7 @@ With that out of the way, here's how to do all HTTP methods:
267
  >>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
268
  >>> page = Fetcher.delete('https://httpbin.org/delete')
269
  ```
270
- For Async requests, you will just replace the import like below:
271
  ```python
272
  >>> from scrapling.fetchers import AsyncFetcher
273
  >>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
@@ -276,52 +264,59 @@ For Async requests, you will just replace the import like below:
276
  >>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
277
  ```
278
 
279
- > Note: You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from Google's search of this URL's domain. It's enabled by default.
 
 
 
 
280
 
281
- This is just the tip of this fetcher; check the full page from [here](fetching/static.md)
282
 
283
  ### Dynamic loading
284
  We have you covered if you deal with dynamic websites like most today!
285
 
286
- The `PlayWrightFetcher` class provides many options to fetch/load websites' pages through browsers.
287
  ```python
288
- >>> from scrapling.fetchers import PlayWrightFetcher
289
- >>> page = PlayWrightFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
290
  >>> page.css_first("#search a::attr(href)")
291
  'https://github.com/D4Vinci/Scrapling'
292
  >>> # The async version of fetch
293
- >>> page = await PlayWrightFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
294
  >>> page.css_first("#search a::attr(href)")
295
  'https://github.com/D4Vinci/Scrapling'
296
  ```
297
- It's named like that because it's built on top of [Playwright](https://playwright.dev/python/), and it currently provides 4 main run options that can be mixed as you want:
298
 
299
- - Vanilla Playwright without any modifications other than the ones you chose.
300
- - Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode but bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode.
301
- - Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
302
- - [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
303
 
304
- > Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
305
 
306
- Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
307
 
308
  ### Dynamic anti-protection loading
309
  We also have you covered if you deal with dynamic websites with annoying anti-protections!
310
 
311
- The `StealthyFetcher` class uses a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to further increase performance and undetectability.
312
  ```python
313
- >>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
 
 
 
 
314
  >>> page.status == 200
315
  True
316
- >>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
317
  >>> # The async version of fetch
318
- >>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')
319
  >>> page.status == 200
320
  True
321
  ```
322
- > Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
323
 
324
- Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
325
 
326
  ---
327
 
 
1
  We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
2
 
3
+ Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page:
4
  ```html
5
  <html>
6
  <head>
 
71
  ```
72
  Starting with loading raw HTML above like this
73
  ```python
74
+ from scrapling.parser import Selector
75
+ page = Selector(html_doc)
76
  page # <data='<html><head><title>Complex Web Page</tit...'>
77
  ```
78
  Get all text content on the page recursively
 
101
  section_elements = page.find_all('section', id="products")
102
  # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
103
  ```
104
+ Find all `section` elements whose `id` attribute value contains `product`
105
  ```python
106
  section_elements = page.find_all('section', {'id*':"product"})
107
  ```
 
110
  page.find_all('h3', re.compile(r'Product \d'))
111
  # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
112
  ```
113
+ Find all `h3` and `h2` elements whose text content matches the regex `Product` only
114
  ```python
115
  page.find_all(['h3', 'h2'], re.compile(r'Product'))
116
  # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
117
  ```
118
+ Find all elements whose text content matches exactly `Products` (Whitespaces are not taken into consideration)
119
  ```python
120
  page.find_by_text('Products', first_match=False)
121
  # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
 
225
  >>> page.css_first('[data-id="1"]').has_class('product')
226
  True
227
  ```
228
+ If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below
229
  ```python
230
  for ancestor in quote.iterancestors():
231
  # do something with it...
232
  ```
233
+ You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a `Selector` object as an argument and returns `True` if the condition is satisfied or `False` otherwise, like below:
234
  ```python
235
  >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
236
  <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
 
242
  A fetcher is made for every use case.
243
 
244
  ### HTTP Requests
245
+ For simple HTTP requests, there's a `Fetcher` class that can be imported and used as below:
246
  ```python
247
  from scrapling.fetchers import Fetcher
248
+ page = Fetcher.get('https://httpbin.org/get', impersonate="chrome")
 
 
 
 
 
 
 
 
 
 
 
 
249
  ```
250
  With that out of the way, here's how to do all HTTP methods:
251
  ```python
 
255
  >>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
256
  >>> page = Fetcher.delete('https://httpbin.org/delete')
257
  ```
258
+ For Async requests, you will replace the import like below:
259
  ```python
260
  >>> from scrapling.fetchers import AsyncFetcher
261
  >>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
 
264
  >>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
265
  ```
266
 
267
+ > Notes:
268
+ >
269
+ > 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
270
+ > 2. The `impersonate` argument allows you to fake the TLS fingerprint for a specific version of a browser.
271
+ > 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
272
 
273
+ This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
274
 
275
  ### Dynamic loading
276
  We have you covered if you deal with dynamic websites like most today!
277
 
278
+ The `DynamicFetcher` class (previously known as `PlayWrightFetcher`) provides many options to fetch/load websites' pages through browsers.
279
  ```python
280
+ >>> from scrapling.fetchers import DynamicFetcher
281
+ >>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
282
  >>> page.css_first("#search a::attr(href)")
283
  'https://github.com/D4Vinci/Scrapling'
284
  >>> # The async version of fetch
285
+ >>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
286
  >>> page.css_first("#search a::attr(href)")
287
  'https://github.com/D4Vinci/Scrapling'
288
  ```
289
+ It's built on top of [Playwright](https://playwright.dev/python/) and it's currently providing three main run options that can be mixed as you want:
290
 
291
+ - Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser.
292
+ - Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
293
+ - Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
 
294
 
295
+ > Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
296
 
297
+ Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
298
 
299
  ### Dynamic anti-protection loading
300
  We also have you covered if you deal with dynamic websites with annoying anti-protections!
301
 
302
+ The `StealthyFetcher` class uses a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most bot detections by default. Scrapling offers a faster custom version, includes extra tools, and features easy configurations to further increase undetectability.
303
  ```python
304
+ >>> from scrapling.fetchers import StealthyFetcher
305
+ >>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection') # Running headless by default
306
+ >>> page.status == 200
307
+ True
308
+ >>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True) # Solve Cloudflare captcha automatically if presented
309
  >>> page.status == 200
310
  True
311
+ >>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
312
  >>> # The async version of fetch
313
+ >>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection')
314
  >>> page.status == 200
315
  True
316
  ```
317
+ > Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
318
 
319
+ Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
320
 
321
  ---
322