Karim shoair commited on
Commit
baf9852
·
1 Parent(s): 75573dc

Corrections and rephrasing

Browse files
Files changed (1) hide show
  1. README.md +50 -50
README.md CHANGED
@@ -145,7 +145,7 @@ As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which
145
  | Scrapling | 2.51 | 1.0x |
146
  | AutoScraper | 11.41 | 4.546x |
147
 
148
- Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at same task.
149
 
150
  > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
151
 
@@ -154,7 +154,7 @@ Scrapling is a breeze to get started with - Starting from version 0.2, we requir
154
  ```bash
155
  pip3 install scrapling
156
  ```
157
- - For using the `StealthyFetcher`, go in the commandline and download the browser with
158
  <details><summary>Windows OS</summary>
159
 
160
  ```bash
@@ -185,21 +185,21 @@ On a fresh installation of Linux, you may also need the following Firefox depend
185
 
186
  <small> See the official <a href="https://camoufox.com/python/installation/#download-the-browser">Camoufox documentation</a> for more info on installation</small>
187
 
188
- - If you are going to use the `PlayWrightFetcher` options, then install playwright's chromium browser with:
189
  ```commandline
190
  playwright install chromium
191
  ```
192
- - If you are going to use normal requests only with `Fetcher` class then update the fingerprints files with:
193
  ```commandline
194
  python -m browserforge update
195
  ```
196
 
197
  ## Fetching Websites Features
198
- All fetcher-type classes are imported with the same way
199
  ```python
200
  from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
201
  ```
202
- And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to `Adaptor` class.
203
  > [!NOTE]
204
  > The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
205
  ### Fetcher
@@ -213,7 +213,7 @@ For all methods, you have `stealth_headers` which makes `Fetcher` create and use
213
  >> page = Fetcher().delete('https://httpbin.org/delete')
214
  ```
215
  ### StealthyFetcher
216
- This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which is by default bypasses most of anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
217
  ```python
218
  >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
219
  >> page.status == 200
@@ -228,14 +228,14 @@ True
228
  | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
229
  | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
230
  | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
231
- | extra_headers | A dictionary of extra headers to add with the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
232
  | block_webrtc | Blocks WebRTC entirely. | ✔️ |
233
- | page_action | Added for automation. A function that takes the `page` object, do the automation you need, then return `page` again. | ✔️ |
234
  | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
235
- | humanize | Humanize the cursor movement. Takes either True, or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
236
  | allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
237
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
238
- | timeout | The timeout in milliseconds that's used in all operations and waits through the page. Default is 30000. | ✔️ |
239
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
240
  | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
241
 
@@ -244,18 +244,18 @@ True
244
  This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
245
 
246
  ### PlayWrightFetcher
247
- This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed together as you want.
248
  ```python
249
  >> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
250
  >> page.adaptor.css_first("#search a::attr(href)")
251
  'https://github.com/D4Vinci/Scrapling'
252
  ```
253
- Using this Fetcher class, you can do requests with:
254
  1) Vanilla Playwright without any modifications other than the ones you chose.
255
- 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/).</br> Some of the things this fetcher's stealth mode do includes:
256
  * Patching the CDP runtime fingerprint.
257
- * Mimics some of real browsers' properties by injects several JS files and using custom options.
258
- * Using custom flags on launch to hide playwright even more and make it faster.
259
  * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
260
  3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
261
  4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
@@ -271,12 +271,12 @@ Add that to a lot of controlling/hiding options as you will see in the arguments
271
  | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
272
  | useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | ✔️ |
273
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
274
- | timeout | The timeout in milliseconds that's used in all operations and waits through the page. Default is 30000. | ✔️ |
275
- | page_action | Added for automation. A function that takes the `page` object, do the automation you need, then return `page` again. | ✔️ |
276
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
277
  | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
278
  | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
279
- | extra_headers | A dictionary of extra headers to add with the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
280
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
281
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
282
  | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
@@ -343,7 +343,7 @@ You can select elements by their text content in multiple ways, here's a full ex
343
  ```python
344
  >>> page = Fetcher().get('https://books.toscrape.com/index.html').adaptor
345
 
346
- >>> page.find_by_text('Tipping the Velvet') # Find the first element that its text fully matches this text
347
  <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
348
 
349
  >>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
@@ -410,7 +410,7 @@ Let's say you are scraping a page with a structure like this:
410
  </section>
411
  </div>
412
  ```
413
- and you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
414
  ```python
415
  page.css('#p1')
416
  ```
@@ -435,7 +435,7 @@ When website owners implement structural changes like
435
  </div>
436
  </div>
437
  ```
438
- The selector will no longer function and your code needs maintenance. That's where Scrapling auto-matching feature comes into play.
439
 
440
  ```python
441
  from scrapling import Adaptor
@@ -448,12 +448,12 @@ if not element: # One day website changes?
448
  ```
449
  > How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
450
 
451
- #### Real World Scenario
452
- Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon and take a copy of its source then wait for the website to do the change. Of course, that's nearly impossible to know unless I know website's owner but that will make it a staged test haha.
453
 
454
  To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/), pretty old huh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
455
 
456
- If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` this selector is too specific because it was generated by Google Chrome.
457
  Now let's test the same selector in both versions
458
  ```python
459
  >> from scrapling import Fetcher
@@ -472,9 +472,9 @@ Now let's test the same selector in both versions
472
  ... print('Scrapling found the same element in the old design and the new design!')
473
  'Scrapling found the same element in the old design and the new design!'
474
  ```
475
- Note that I used a new argument called `automatch_domain`, this because for Scrapling these are two different URLs not the website so it isolates their data. To tell Scrapling they are the same website, we the pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.
476
 
477
- In real world scenario, the code will be the same expect it will use the same URL for both requests so you won't need to use `automatch_domain` argument. This is the closest example I can give to real world cases so I hope it didn't confuse you :)
478
 
479
  **Notes:**
480
  1. For the two examples above I used one time the `Adaptor` class and the second time the `Fetcher` class just to show you that you can create the `Adaptor` object by yourself if you have the source or fetch the source using any `Fetcher` class then it will create the `Adaptor` object for you on the `.adaptor` property.
@@ -494,25 +494,25 @@ In real world scenario, the code will be the same expect it will use the same UR
494
  ```
495
 
496
  ### Find elements by filters
497
- Inspired by BeautifulSoup's `find_all` function you can find elements by using `find_all`/`find` methods. Both methods can take multiple types of filters and returns all elements in the pages that all these filters apply to.
498
 
499
  * To be more specific:
500
  * Any string passed is considered a tag name
501
  * Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
502
- * Any dictionary is considered a mapping of HTML element(s) attributes name and attributes values.
503
  * Any regex patterns passed are used as filters
504
  * Any functions passed are used as filters
505
- * Any keyword argument passed is considered as a HTML element attribute with its value.
506
 
507
  So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
508
  <br/>It filters all elements in the current page/element in the following order:
509
 
510
  1. All elements with the passed tag name(s).
511
- 2. All elements that matches all passed attribute(s).
512
- 3. All elements that matches all passed regex patterns.
513
- 4. All elements that fulfills all passed function(s).
514
 
515
- Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes passed, the process starts from that layer and so one. **But the order in which you pass the arguments doesn't matter.**
516
 
517
  Examples to clear any confusion :)
518
 
@@ -525,7 +525,7 @@ Examples to clear any confusion :)
525
  <data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
526
  ...]
527
 
528
- # Find all div elements with class that equals `quote`.
529
  >> page.find_all('div', class_='quote')
530
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
531
  <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
@@ -537,29 +537,29 @@ Examples to clear any confusion :)
537
  <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
538
  ...]
539
 
540
- # Find all elements with class that equals `quote`.
541
  >> page.find_all({'class': 'quote'})
542
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
543
  <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
544
  ...]
545
 
546
- # Find all div elements with class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
547
  >> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
548
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
549
 
550
- # Find all elements that doesn't have children.
551
  >> page.find_all(lambda element: len(element.children) > 0)
552
  [<data='<html lang="en"><head><meta charset="UTF...'>,
553
  <data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
554
  <data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
555
  ...]
556
 
557
- # Find all elements that contains the word 'world' in its content.
558
  >> page.find_all(lambda element: "world" in element.text)
559
  [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
560
  <data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
561
 
562
- # Find all span elements that matches the given regex
563
  >> page.find_all('span', re.compile(r'world'))
564
  [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
565
 
@@ -586,12 +586,12 @@ Here's what else you can do with Scrapling:
586
  ```
587
  - Saving and retrieving elements manually to auto-match them outside the `css` and the `xpath` methods but you have to set the identifier by yourself.
588
 
589
- - To save element to the database:
590
  ```python
591
  >>> element = page.find_by_text('Tipping the Velvet', first_match=True)
592
  >>> page.save(element, 'my_special_element')
593
  ```
594
- - Now later when you want to retrieve it and relocate it in the page with auto-matching, it would be like this
595
  ```python
596
  >>> element_dict = page.retrieve('my_special_element')
597
  >>> page.relocate(element_dict, adaptor_type=True)
@@ -615,7 +615,7 @@ expensive_products = page.css('.product_pod').filter(
615
 
616
  - Searching results for the first one that matches a function
617
  ```python
618
- # Find all the product with price '53.23'
619
  page.css('.product_pod').search(
620
  lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
621
  )
@@ -636,7 +636,7 @@ page.css('.product_pod').search(
636
  case_sensitive= False, # Set the regex to ignore letters case while compiling it
637
  )
638
  ```
639
- Hence all of these methods are actually methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
640
 
641
 
642
  - Doing operations on the text content itself includes
@@ -648,13 +648,13 @@ page.css('.product_pod').search(
648
  ```python
649
  page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
650
  ```
651
- - Sort all characters in the string as if it was a list and return the new string
652
  ```python
653
  quote.sort(reverse=False)
654
  ```
655
  > To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
656
 
657
- - Any element's attributes are not exactly a dictionary but a sub-class of [mapping](https://docs.python.org/3/glossary.html#term-mapping) called `AttributesHandler` that's read-only so it's faster and string values returned are actually `TextHandler` objects so all operations above can be done on them, standard dictionary operations that doesn't modify the data, and more :)
658
  - Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches)
659
  ```python
660
  >>> for item in element.attrib.search_values('catalogue', partial=True):
@@ -681,7 +681,7 @@ There are a lot of deep details skipped here to make this as short as possible s
681
 
682
  Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
683
 
684
- To give a detailed documentation of the library, it will need a website. I'm trying to rush creating the website, researching new ideas, and add more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. But you can help by using the [sponsor button](https://github.com/sponsors/D4Vinci) above :)
685
 
686
  ## ⚡ Enlightening Questions and FAQs
687
  This section addresses common questions about Scrapling, please read this section before opening an issue.
@@ -696,7 +696,7 @@ This section addresses common questions about Scrapling, please read this sectio
696
  Together both are used to retrieve the element's unique properties from the database later.
697
  4. Now later when you enable the `auto_match` parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one.
698
  5. The comparison between elements is not exact but more about finding how similar these values are, so everything is taken into consideration even the values' order like the order in which the element class names were written before and the order in which the same element class names are written now.
699
- 6. The score for each element is stored in the table and in the end, the element(s) with the highest combined similarity scores are returned.
700
 
701
  ### How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?
702
  Not a big problem as it depends on your usage. The word `default` will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
@@ -725,7 +725,7 @@ Pretty much yeah, almost all features you get from BeautifulSoup can be found or
725
  Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
726
 
727
  ### Is Scrapling thread-safe?
728
- Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its own state.
729
 
730
  ## Sponsors
731
  [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
@@ -737,7 +737,7 @@ Please read the [contributing file](https://github.com/D4Vinci/Scrapling/blob/ma
737
 
738
  ## Disclaimer for Scrapling Project
739
  > [!CAUTION]
740
- > This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like `robots.txt` file, for example.
741
 
742
  ## License
743
  This work is licensed under BSD-3
 
145
  | Scrapling | 2.51 | 1.0x |
146
  | AutoScraper | 11.41 | 4.546x |
147
 
148
+ Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.
149
 
150
  > All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
151
 
 
154
  ```bash
155
  pip3 install scrapling
156
  ```
157
+ - For using the `StealthyFetcher`, go to the command line and download the browser with
158
  <details><summary>Windows OS</summary>
159
 
160
  ```bash
 
185
 
186
  <small> See the official <a href="https://camoufox.com/python/installation/#download-the-browser">Camoufox documentation</a> for more info on installation</small>
187
 
188
+ - If you are going to use the `PlayWrightFetcher` options, then install Playwright's Chromium browser with:
189
  ```commandline
190
  playwright install chromium
191
  ```
192
+ - If you are going to use normal requests only with the `Fetcher` class then update the fingerprints files with:
193
  ```commandline
194
  python -m browserforge update
195
  ```
196
 
197
  ## Fetching Websites Features
198
+ All fetcher-type classes are imported in the same way
199
  ```python
200
  from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
201
  ```
202
+ And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to the `Adaptor` class.
203
  > [!NOTE]
204
  > The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
205
  ### Fetcher
 
213
  >> page = Fetcher().delete('https://httpbin.org/delete')
214
  ```
215
  ### StealthyFetcher
216
+ This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which by default bypasses most of the anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
217
  ```python
218
  >> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
219
  >> page.status == 200
 
228
  | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
229
  | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
230
  | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
231
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
232
  | block_webrtc | Blocks WebRTC entirely. | ✔️ |
233
+ | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
234
  | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
235
+ | humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
236
  | allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | ✔️ |
237
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
238
+ | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
239
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
240
  | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
241
 
 
244
  This list isn't final so expect a lot more additions and flexibility to be added in the next versions!
245
 
246
  ### PlayWrightFetcher
247
+ This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed as you want.
248
  ```python
249
  >> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
250
  >> page.adaptor.css_first("#search a::attr(href)")
251
  'https://github.com/D4Vinci/Scrapling'
252
  ```
253
+ Using this Fetcher class, you can make requests with:
254
  1) Vanilla Playwright without any modifications other than the ones you chose.
255
+ 2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/).</br> Some of the things this fetcher's stealth mode does include:
256
  * Patching the CDP runtime fingerprint.
257
+ * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
258
+ * Using custom flags on launch to hide Playwright even more and make it faster.
259
  * Generates real browser's headers of the same type and same user OS then append it to the request's headers.
260
  3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
261
  4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
 
271
  | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | ✔️ |
272
  | useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | ✔️ |
273
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
274
+ | timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | ✔️ |
275
+ | page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | ✔️ |
276
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
277
  | wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
278
  | google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | ✔️ |
279
+ | extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
280
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
281
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
282
  | stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | ✔️ |
 
343
  ```python
344
  >>> page = Fetcher().get('https://books.toscrape.com/index.html').adaptor
345
 
346
+ >>> page.find_by_text('Tipping the Velvet') # Find the first element whose text fully matches this text
347
  <data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
348
 
349
  >>> page.find_by_text('Tipping the Velvet', first_match=False) # Get all matches if there are more
 
410
  </section>
411
  </div>
412
  ```
413
+ And you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
414
  ```python
415
  page.css('#p1')
416
  ```
 
435
  </div>
436
  </div>
437
  ```
438
+ The selector will no longer function and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
439
 
440
  ```python
441
  from scrapling import Adaptor
 
448
  ```
449
  > How does the auto-matching work? Check the [FAQs](#-enlightening-questions-and-faqs) section for that and other possible issues while auto-matching.
450
 
451
+ #### Real-World Scenario
452
+ Let's use a real website as an example and use one of the fetchers to fetch its source. To do this we need to find a website that will change its design/structure soon, take a copy of its source then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner but that will make it a staged test haha.
453
 
454
  To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/), pretty old huh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
455
 
456
+ If I want to extract the Questions button from the old design I can use a selector like this `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
457
  Now let's test the same selector in both versions
458
  ```python
459
  >> from scrapling import Fetcher
 
472
  ... print('Scrapling found the same element in the old design and the new design!')
473
  'Scrapling found the same element in the old design and the new design!'
474
  ```
475
+ Note that I used a new argument called `automatch_domain`, this is because for Scrapling these are two different URLs, not the website so it isolates their data. To tell Scrapling they are the same website, we then pass the domain we want to use for saving auto-match data for them both so Scrapling doesn't isolate them.
476
 
477
+ In a real-world scenario, the code will be the same except it will use the same URL for both requests so you won't need to use the `automatch_domain` argument. This is the closest example I can give to real-world cases so I hope it didn't confuse you :)
478
 
479
  **Notes:**
480
  1. For the two examples above I used one time the `Adaptor` class and the second time the `Fetcher` class just to show you that you can create the `Adaptor` object by yourself if you have the source or fetch the source using any `Fetcher` class then it will create the `Adaptor` object for you on the `.adaptor` property.
 
494
  ```
495
 
496
  ### Find elements by filters
497
+ Inspired by BeautifulSoup's `find_all` function you can find elements by using `find_all`/`find` methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.
498
 
499
  * To be more specific:
500
  * Any string passed is considered a tag name
501
  * Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
502
+ * Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
503
  * Any regex patterns passed are used as filters
504
  * Any functions passed are used as filters
505
+ * Any keyword argument passed is considered as an HTML element attribute with its value.
506
 
507
  So the way it works is after collecting all passed arguments and keywords, each filter passes its results to the following filter in a waterfall-like filtering system.
508
  <br/>It filters all elements in the current page/element in the following order:
509
 
510
  1. All elements with the passed tag name(s).
511
+ 2. All elements that match all passed attribute(s).
512
+ 3. All elements that match all passed regex patterns.
513
+ 4. All elements that fulfill all passed function(s).
514
 
515
+ Note: The filtering process always starts from the first filter it finds in the filtering order above so if no tag name(s) are passed but attributes are passed, the process starts from that layer and so on. **But the order in which you pass the arguments doesn't matter.**
516
 
517
  Examples to clear any confusion :)
518
 
 
525
  <data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
526
  ...]
527
 
528
+ # Find all div elements with a class that equals `quote`.
529
  >> page.find_all('div', class_='quote')
530
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
531
  <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
 
537
  <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
538
  ...]
539
 
540
+ # Find all elements with a class that equals `quote`.
541
  >> page.find_all({'class': 'quote'})
542
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
543
  <data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
544
  ...]
545
 
546
+ # Find all div elements with a class that equals `quote`, and contains the element `.text` which contains the word 'world' in its content.
547
  >> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
548
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
549
 
550
+ # Find all elements that don't have children.
551
  >> page.find_all(lambda element: len(element.children) > 0)
552
  [<data='<html lang="en"><head><meta charset="UTF...'>,
553
  <data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
554
  <data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
555
  ...]
556
 
557
+ # Find all elements that contain the word 'world' in its content.
558
  >> page.find_all(lambda element: "world" in element.text)
559
  [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
560
  <data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
561
 
562
+ # Find all span elements that match the given regex
563
  >> page.find_all('span', re.compile(r'world'))
564
  [<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
565
 
 
586
  ```
587
  - Saving and retrieving elements manually to auto-match them outside the `css` and the `xpath` methods but you have to set the identifier by yourself.
588
 
589
+ - To save an element to the database:
590
  ```python
591
  >>> element = page.find_by_text('Tipping the Velvet', first_match=True)
592
  >>> page.save(element, 'my_special_element')
593
  ```
594
+ - Now later when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this
595
  ```python
596
  >>> element_dict = page.retrieve('my_special_element')
597
  >>> page.relocate(element_dict, adaptor_type=True)
 
615
 
616
  - Searching results for the first one that matches a function
617
  ```python
618
+ # Find all the products with price '53.23'
619
  page.css('.product_pod').search(
620
  lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
621
  )
 
636
  case_sensitive= False, # Set the regex to ignore letters case while compiling it
637
  )
638
  ```
639
+ Hence all of these methods are methods from the `TextHandler` within that contains the text content so the same can be done directly if you call the `.text` property or equivalent selector function.
640
 
641
 
642
  - Doing operations on the text content itself includes
 
648
  ```python
649
  page.xpath('//script/text()').re_first(r'var dataLayer = (.+);').json()
650
  ```
651
+ - Sort all characters in the string as if it were a list and return the new string
652
  ```python
653
  quote.sort(reverse=False)
654
  ```
655
  > To be clear, `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work with it.
656
 
657
+ - Any element's attributes are not exactly a dictionary but a sub-class of [mapping](https://docs.python.org/3/glossary.html#term-mapping) called `AttributesHandler` that's read-only so it's faster and string values returned are actually `TextHandler` objects so all operations above can be done on them, standard dictionary operations that don't modify the data, and more :)
658
  - Unlike standard dictionaries, here you can search by values too and can do partial searches. It might be handy in some cases (returns a generator of matches)
659
  ```python
660
  >>> for item in element.attrib.search_values('catalogue', partial=True):
 
681
 
682
  Note that implementing your storage system can be complex as there are some strict rules such as inheriting from the same abstract class, following the singleton design pattern used in other classes, and more. So make sure to read the docs first.
683
 
684
+ To give detailed documentation of the library, it will need a website. I'm trying to rush creating the website, researching new ideas, and adding more features/tests/benchmarks but time is tight with too many spinning plates between work, personal life, and working on Scrapling. But you can help by using the [sponsor button](https://github.com/sponsors/D4Vinci) above :)
685
 
686
  ## ⚡ Enlightening Questions and FAQs
687
  This section addresses common questions about Scrapling, please read this section before opening an issue.
 
696
  Together both are used to retrieve the element's unique properties from the database later.
697
  4. Now later when you enable the `auto_match` parameter for both the Adaptor instance and the method call. The element properties are retrieved and Scrapling loops over all elements in the page and compares each one's unique properties to the unique properties we already have for this element and a score is calculated for each one.
698
  5. The comparison between elements is not exact but more about finding how similar these values are, so everything is taken into consideration even the values' order like the order in which the element class names were written before and the order in which the same element class names are written now.
699
+ 6. The score for each element is stored in the table, and in the end, the element(s) with the highest combined similarity scores are returned.
700
 
701
  ### How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?
702
  Not a big problem as it depends on your usage. The word `default` will be used in place of the URL field while saving the element's unique properties. So this will only be an issue if you used the same identifier later for a different website that you didn't pass the URL parameter while initializing it as well. The save process will overwrite the previous data and auto-matching uses the latest saved properties only.
 
725
  Of course, you can find elements by text/regex, find similar elements in a more reliable way than AutoScraper, and finally save/retrieve elements manually to use later as the model feature in AutoScraper. I have pulled all top articles about AutoScraper from Google and tested Scrapling against examples in them. In all examples, Scrapling got the same results as AutoScraper in much less time.
726
 
727
  ### Is Scrapling thread-safe?
728
+ Yes, Scrapling instances are thread-safe. Each Adaptor instance maintains its state.
729
 
730
  ## Sponsors
731
  [![Capsolver Banner](https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/CapSolver.png)](https://www.capsolver.com/?utm_source=github&utm_medium=repo&utm_campaign=scraping&utm_term=Scrapling)
 
737
 
738
  ## Disclaimer for Scrapling Project
739
  > [!CAUTION]
740
+ > This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international laws regarding data scraping and privacy. The authors and contributors are not responsible for any misuse of this software. This library should not be used to violate the rights of others, for unethical purposes, or to use data in an unauthorized or illegal manner. Do not use it on any website unless you have permission from the website owner or within their allowed rules like the `robots.txt` file, for example.
741
 
742
  ## License
743
  This work is licensed under BSD-3