Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Aug 24, 2025

Commit

1c41516

1 Parent(s): be68867

docs: Update overview page

Browse files

Files changed (1) hide show

docs/overview.md +37 -42

docs/overview.md CHANGED Viewed

@@ -1,6 +1,6 @@
 We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
-Here's an HTML document generated by ChatGPT we will be using as an example throughout this page:
 ```html
 <html>
   <head>
@@ -71,8 +71,8 @@ Here's an HTML document generated by ChatGPT we will be using as an example thro
 ```
 Starting with loading raw HTML above like this
 ```python
-from scrapling.parser import Adaptor
-page = Adaptor(html_doc)
 page  # <data='<html><head><title>Complex Web Page</tit...'>
 ```
 Get all text content on the page recursively
@@ -101,7 +101,7 @@ section_elements = page.find_all('section', {'id':"products"})
 section_elements = page.find_all('section', id="products")
 # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
 ```
-Find all `section` elements that its `id` attribute value contains `product`
 ```python
 section_elements = page.find_all('section', {'id*':"product"})
 ```
@@ -110,12 +110,12 @@ Find all `h3` elements whose text content matches this regex `Product \d`
 page.find_all('h3', re.compile(r'Product \d'))
 # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
 ```
-Find all `h3` and `h2` elements whose text content matches regex `Product` only
 ```python
 page.find_all(['h3', 'h2'], re.compile(r'Product'))
 # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
 ```
-Find all elements that its text content matches exactly `Products` (Whitespaces are not taken into consideration)
 ```python
 page.find_by_text('Products', first_match=False)
 # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
@@ -225,12 +225,12 @@ Using the elements we found above
 >>> page.css_first('[data-id="1"]').has_class('product')
 True
 ```
-If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like the one below
 ```python
 for ancestor in quote.iterancestors():
     # do something with it...
 ```
-You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an `Adaptor` object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
 ```python
 >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
 <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
@@ -242,22 +242,10 @@ Instead of passing the raw HTML to Scrapling, you can get a website's response d
 A fetcher is made for every use case.
 ### HTTP Requests
-For simple HTTP requests, there's a `Fetcher` class that can be imported as below:
 ```python
 from scrapling.fetchers import Fetcher
-```
-But that's class, so you will need to create an instance of the Fetcher first like this:
-```python
-from scrapling.fetchers import Fetcher
-fetcher = Fetcher()
-page = fetcher.get('https://httpbin.org/get')
-```
-This is intended, and you will find it with all fetchers because there are settings you can pass to `Fetcher()` initialization, but more on this later.
-If you are going to use the default settings anyway, you can do this instead for a cleaner approach:
-```python
-from scrapling.fetchers import Fetcher
-page = Fetcher.get('https://httpbin.org/get')
 ```
 With that out of the way, here's how to do all HTTP methods:
 ```python
@@ -267,7 +255,7 @@ With that out of the way, here's how to do all HTTP methods:
 >>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
 >>> page = Fetcher.delete('https://httpbin.org/delete')
 ```
-For Async requests, you will just replace the import like below:
 ```python
 >>> from scrapling.fetchers import AsyncFetcher
 >>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
@@ -276,52 +264,59 @@ For Async requests, you will just replace the import like below:
 >>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
 ```
-> Note: You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from Google's search of this URL's domain. It's enabled by default.
-This is just the tip of this fetcher; check the full page from [here](fetching/static.md)
 ### Dynamic loading
 We have you covered if you deal with dynamic websites like most today!
-The `PlayWrightFetcher` class provides many options to fetch/load websites' pages through browsers.
 ```python
->>> from scrapling.fetchers import PlayWrightFetcher
->>> page = PlayWrightFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
 >>> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
 >>> # The async version of fetch
->>> page = await PlayWrightFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
 >>> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
 ```
-It's named like that because it's built on top of [Playwright](https://playwright.dev/python/), and it currently provides 4 main run options that can be mixed as you want:
-- Vanilla Playwright without any modifications other than the ones you chose.
-- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode but bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode.
-- Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
-- [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
-> Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
-Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
 ### Dynamic anti-protection loading
 We also have you covered if you deal with dynamic websites with annoying anti-protections!
-The `StealthyFetcher` class uses a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to further increase performance and undetectability.
 ```python
->>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
 >>> page.status == 200
 True
->>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
 >>> # The async version of fetch
->>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')
 >>> page.status == 200
 True
 ```
-> Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
-Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
 ---

 We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
+Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page:
 ```html
 <html>
   <head>
 ```
 Starting with loading raw HTML above like this
 ```python
+from scrapling.parser import Selector
+page = Selector(html_doc)
 page  # <data='<html><head><title>Complex Web Page</tit...'>
 ```
 Get all text content on the page recursively
 section_elements = page.find_all('section', id="products")
 # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
 ```
+Find all `section` elements whose `id` attribute value contains `product`
 ```python
 section_elements = page.find_all('section', {'id*':"product"})
 ```
 page.find_all('h3', re.compile(r'Product \d'))
 # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
 ```
+Find all `h3` and `h2` elements whose text content matches the regex `Product` only
 ```python
 page.find_all(['h3', 'h2'], re.compile(r'Product'))
 # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
 ```
+Find all elements whose text content matches exactly `Products` (Whitespaces are not taken into consideration)
 ```python
 page.find_by_text('Products', first_match=False)
 # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
 >>> page.css_first('[data-id="1"]').has_class('product')
 True
 ```
+If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below
 ```python
 for ancestor in quote.iterancestors():
     # do something with it...
 ```
+You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a `Selector` object as an argument and returns `True` if the condition is satisfied or `False` otherwise, like below:
 ```python
 >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
 <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
 A fetcher is made for every use case.
 ### HTTP Requests
+For simple HTTP requests, there's a `Fetcher` class that can be imported and used as below:
 ```python
 from scrapling.fetchers import Fetcher
+page = Fetcher.get('https://httpbin.org/get', impersonate="chrome")
 ```
 With that out of the way, here's how to do all HTTP methods:
 ```python
 >>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
 >>> page = Fetcher.delete('https://httpbin.org/delete')
 ```
+For Async requests, you will replace the import like below:
 ```python
 >>> from scrapling.fetchers import AsyncFetcher
 >>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
 >>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
 ```
+> Notes:
+>
+> 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
+> 2. The `impersonate` argument allows you to fake the TLS fingerprint for a specific version of a browser.
+> 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
+This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
 ### Dynamic loading
 We have you covered if you deal with dynamic websites like most today!
+The `DynamicFetcher` class (previously known as `PlayWrightFetcher`) provides many options to fetch/load websites' pages through browsers.
 ```python
+>>> from scrapling.fetchers import DynamicFetcher
+>>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)  # Vanilla Playwright option
 >>> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
 >>> # The async version of fetch
+>>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
 >>> page.css_first("#search a::attr(href)")
 'https://github.com/D4Vinci/Scrapling'
 ```
+It's built on top of [Playwright](https://playwright.dev/python/) and it's currently providing three main run options that can be mixed as you want:
+- Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser.
+- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
+- Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
+> Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
+Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
 ### Dynamic anti-protection loading
 We also have you covered if you deal with dynamic websites with annoying anti-protections!
+The `StealthyFetcher` class uses a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most bot detections by default. Scrapling offers a faster custom version, includes extra tools, and features easy configurations to further increase undetectability.
 ```python
+>>> from scrapling.fetchers import StealthyFetcher
+>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection')  # Running headless by default
+>>> page.status == 200
+True
+>>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True)  # Solve Cloudflare captcha automatically if presented
 >>> page.status == 200
 True
+>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
 >>> # The async version of fetch
+>>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection')
 >>> page.status == 200
 True
 ```
+> Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
+Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
 ---