Karim shoair commited on
Commit
7f474cb
·
1 Parent(s): 3df2133

docs: Update the 'main classes' page

Browse files
Files changed (1) hide show
  1. docs/parsing/main_classes.md +92 -72
docs/parsing/main_classes.md CHANGED
@@ -1,41 +1,41 @@
1
  ## Introduction
2
- After exploring the various ways to select elements with Scrapling and related features, Let's take a step back and examine the [Adaptor](#adaptor) class generally and other objects to better understand the parsing engine.
3
 
4
- The [Adaptor](#adaptor) class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. You can always import it with any of the following imports
5
  ```python
6
- from scrapling import Adaptor
7
- from scrapling.parser import Adaptor
8
  ```
9
- then use it directly as you already learned in the [overview](../overview.md) page
10
  ```python
11
- adaptor = Adaptor(
12
- text='<html>...</html>',
13
  url='https://example.com'
14
  )
15
 
16
  # Then select elements as you like
17
- elements = adaptor.css('.product')
18
  ```
19
- In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, an [Adaptor](#adaptor) object. Any operation you do, like selection, navigation, etc., will return either an [Adaptor](#adaptor) object or an [Adaptors](#adaptors) object, given that the result is element/elements from the page, not text or similar.
20
 
21
- In other words, the main page is a [Adaptor](#adaptor) object, and the elements within are [Adaptor](#adaptor) objects, and so on. Any text, such as the text content inside elements or the text inside element attributes, is a [TextHandler](#texthandler) object, and the attributes of each element are stored as [AttributesHandler](#attributeshandler). We will return to both objects later, so let's focus on the [Adaptor](#adaptor) object.
22
 
23
- ## Adaptor
24
  ### Arguments explained
25
- The most important ones are `text` and `body`. Both are used to pass the HTML code you want to parse, but the first one accepts `str`, and the latter accepts `bytes` like how you used to do with `parsel` :)
26
 
27
- Otherwise, you have the arguments `url`, `auto_match`, `storage`, and `storage_args`. All these arguments are settings used with the `auto_match` feature, and they don't make a difference if you are not going to use that feature, so just ignore them for now, and we will explain them in the [automatch](automatch.md) feature page.
28
 
29
- Then you have the arguments for adjustments for parsing or adjusting/manipulating the HTML while the library parsing it:
30
 
31
  - **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
32
- - **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default, as it can mess up your scraping in many ways.
33
- - **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML. This also means when you check for the raw html content, you will find it doesn't have the cdata.
34
 
35
  I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed.
36
- You may notice that I'm doing that a lot, and that's because it's something you don't need to know to use the library. The development section will cover these missing parts if you are that interested.
37
 
38
- After that, for the main page and elements within, most properties don't get initialized until you use it like the text content of a page/element, and this is one of the reasons for Scrapling speed :)
39
 
40
  ### Properties
41
  You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section.
@@ -81,15 +81,15 @@ Let's say we are parsing this HTML page for simplicity:
81
  ```
82
  Load the page directly as shown before:
83
  ```python
84
- from scrapling import Adaptor
85
- page = Adaptor(html_doc)
86
  ```
87
  Get all text content on the page recursively
88
  ```python
89
  >>> page.get_all_text()
90
  'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
91
  ```
92
- Get the first article as explained before; we will use it as an example
93
  ```python
94
  article = page.find('article')
95
  ```
@@ -98,7 +98,7 @@ With the same logic, get all text content on the element recursively
98
  >>> article.get_all_text()
99
  'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
100
  ```
101
- But if you try to get the direct text content, it will be empty; notice the logic difference
102
  ```python
103
  >>> article.text
104
  ''
@@ -107,10 +107,10 @@ The `get_all_text` method has the following optional arguments:
107
 
108
  1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'
109
  2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
110
- 3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results. The default is `('script', 'style',)`.
111
  4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
112
 
113
- By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON, then use `.json()` on it
114
  ```python
115
  >>> script = page.find('script')
116
  >>> script.json()
@@ -121,7 +121,7 @@ Let's continue to get the element tag
121
  >>> article.tag
122
  'article'
123
  ```
124
- If you used it on the page directly, you will find you are operating on the root `html` element
125
  ```python
126
  >>> page.tag
127
  'html'
@@ -133,6 +133,17 @@ Getting the attributes of the element
133
  >>> print(article.attrib)
134
  {'class': 'product', 'data-id': '1'}
135
  ```
 
 
 
 
 
 
 
 
 
 
 
136
  Get the HTML content of the element
137
  ```python
138
  >>> article.html_content
@@ -143,7 +154,7 @@ It's the same if you used the `.body` property
143
  >>> article.body
144
  '<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
145
  ```
146
- Get the prettified version of the HTML content of the element
147
  ```python
148
  >>> print(article.prettify())
149
  <article class="product" data-id="1"><h3>Product 1</h3>
@@ -175,12 +186,12 @@ Same case with XPath
175
  ```
176
 
177
  ### Traversal
178
- Using the elements we found above, we will go over the properties/methods for moving in the page in detail.
179
 
180
  If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online for a better understanding.
181
 
182
  If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
183
- Simply put, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
184
  This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
185
 
186
  Accessing the parent of an element
@@ -238,7 +249,7 @@ Get the siblings of an element
238
  ```
239
  Get the next element of the current element
240
  ```python
241
- >>> article.next # gets the next element, the same logic applies to `quote.previous`
242
  <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
243
  ```
244
  The same logic applies to the `previous` property
@@ -258,7 +269,7 @@ If your case needs more than the element's parent, you can iterate over the whol
258
  for ancestor in article.iterancestors():
259
  # do something with it...
260
  ```
261
- You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an [Adaptor](#adaptor) object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
262
  ```python
263
  >>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
264
  <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
@@ -266,10 +277,10 @@ You can search for a specific ancestor of an element that satisfies a function;
266
  >>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
267
  <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
268
  ```
269
- ## Adaptors
270
- The class `Adaptors` is the "List" version of the [Adaptor](#adaptor) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Adaptor](#adaptor) instances within more straightforward.
271
 
272
- In the [Adaptor](#adaptor) class, all methods/properties that should return a group of elements return them as an [Adaptors](#adaptors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
273
 
274
  - If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
275
  ```python
@@ -284,18 +295,18 @@ In the [Adaptor](#adaptor) class, all methods/properties that should return a gr
284
  ```
285
  - If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
286
  ```python
287
- >>> page.css('.price_color') # -> Adaptors
288
  >>> page.css('.product_pod a::attr(href)') # -> TextHandlers
289
  >>> page.css('.price_color, .product_pod a::attr(href)') # -> List
290
  ```
291
 
292
- Let's see what [Adaptors](#adaptors) class adds to the table with that out of the way.
293
  ### Properties
294
  Apart from the normal operations on Python lists like iteration, slicing, etc...
295
 
296
  You can do the following:
297
 
298
- Execute CSS and XPath selectors directly on the [Adaptor](#adaptor) instances it has while the arguments and the return types are the same as [Adaptor](#adaptor)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
299
  ```python
300
  >>> page.css('.product_pod a')
301
  [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
@@ -315,9 +326,9 @@ Execute CSS and XPath selectors directly on the [Adaptor](#adaptor) instances it
315
  <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
316
  ...]
317
  ```
318
- Run the `re` and `re_first` methods directly. They take the same arguments passed as the [Adaptor](#adaptor) class. I'm still leaving these methods to be explained in the [TextHandler](#texthandler) section below.
319
 
320
- However, in this class, the `re_first` behaves differently as it runs `re` on each [Adaptor](#adaptor) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal that has all the results combined in one [TextHandlers](#texthandlers) instance.
321
  ```python
322
  >>> page.css('.price_color').re(r'[\d\.]+')
323
  ['51.77',
@@ -334,14 +345,14 @@ However, in this class, the `re_first` behaves differently as it runs `re` on ea
334
  'sharp-objects_997',
335
  ...]
336
  ```
337
- With the `search` method, you can search quickly in the available [Adaptor](#adaptor) classes. The function you pass must accept an [Adaptor](#adaptor) instance as the first argument and return True/False. The method will return the first [Adaptor](#adaptor) instance that satisfies the function; otherwise, it will return `None`.
338
  ```python
339
  # Find all the products with price '53.23'
340
  >>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
341
  >>> page.css('.product_pod').search(search_function)
342
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
343
  ```
344
- You can use the `filter` method, too, which takes a function like the `search` method but returns an `Adaptors` instance of all the [Adaptor](#adaptor) classes that satisfy the function
345
  ```python
346
  # Find all products with prices over $50
347
  >>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
@@ -351,26 +362,35 @@ You can use the `filter` method, too, which takes a function like the `search` m
351
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
352
  ...]
353
  ```
 
 
 
 
 
 
 
 
 
354
 
355
  ## TextHandler
356
  This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
357
 
358
- TextHandler is a subclass of the standard Python string, so you can do anything with it. So, what is the difference that requires a different naming?
359
 
360
- Of course, TextHandler provides extra methods and properties that the standard Python strings can't do. We will review them now, but remember that all methods and properties in all classes that return string(s) are returning TextHandler, which opens the door for creativity and makes the code shorter and cleaner, as you will see. Also, you can import it directly and use it on any string, which we will explain later.
361
  ### Usage
362
- First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a TextHandler again, so you can chain them as you want. If you find a method or property that returns a standard string instead of TextHandler, please open an issue, and we will override it as well.
363
 
364
- First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([Adaptor](#adaptor), [Adaptors](#adaptors), and [TextHandlers](#texthandlers)), so they will take the same arguments as well.
365
 
366
- The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
367
 
368
  Also, it takes other helpful arguments, which are:
369
 
370
  - **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
371
- - **clean_match**: It's disabled by default. This makes the method ignore all whitespaces and consecutive spaces while matching.
372
- - **case_sensitive**: It's enabled by default. As the name implies, disabling it will make the regex ignore letters case while compiling it.
373
-
374
  You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
375
  ```python
376
  >>> page.css('.price_color').re(r'[\d\.]+')
@@ -405,25 +425,25 @@ First, we start with the `re` and `re_first` methods. These are the same methods
405
  >>> test_string.re('hi there', clean_match=True, case_sensitive=False)
406
  ['hi There']
407
  ```
408
- Another use of the idea of replacing strings with `TextHandler` everywhere is a property like `html_content` returns `TextHandler` so you can do regex on the HTML content if you want:
409
  ```python
410
  >>> page.html_content.re('div class=".*">(.*)</div')
411
  ['In stock: 5', 'In stock: 3', 'Out of stock']
412
  ```
413
 
414
- - You also have the `.json()` method, which tries to convert the content to a json object quickly if possible; otherwise, it throws an error
415
  ```python
416
  >>> page.css_first('#page-data::text')
417
  '\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
418
  >>> page.css_first('#page-data::text').json()
419
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
420
  ```
421
- Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically like this
422
  ```python
423
  >>> page.css_first('#page-data').json()
424
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
425
  ```
426
- The [Adaptor](#adaptor) class adds one thing here, too; let's say this is the page we are working with:
427
  ```html
428
  <html>
429
  <body>
@@ -438,42 +458,42 @@ First, we start with the `re` and `re_first` methods. These are the same methods
438
  </body>
439
  </html>
440
  ```
441
- The [Adaptor](#adaptor) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
442
  So, as you know here, if you did something like this
443
  ```python
444
  >>> page.css_first('div::text').json()
445
  ```
446
- You will get an error because the `div` tag doesn't have direct text content that can be serialized to JSON; it actually doesn't have text content at all.<br/><br/>
447
  In this case, the `get_all_text` method comes to the rescue, so you can do something like that
448
  ```python
449
  >>> page.css_first('div').get_all_text(ignore_tags=[]).json()
450
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
451
  ```
452
  I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
453
- Another related behavior you should be aware of is the case while using any of the fetchers, which we will explain later. If you have a JSON response like this example:
454
  ```python
455
- >>> page = Adaptor("""{"some_key": "some_value"}""")
456
  ```
457
- Because the [Adaptor](#adaptor) class is optimized to deal with HTML pages, it will deal with it as a broken HTML response and fix it, so if you used the `html_content` property, you get this
458
  ```python
459
  >>> page.html_content
460
  '<html><body><p>{"some_key": "some_value"}</p></body></html>'
461
  ```
462
- Here, you can use `json` method directly, and it will work
463
  ```python
464
  >>> page.json()
465
  {'some_key': 'some_value'}
466
  ```
467
- You might wonder how this happened while the `html` tag lacks direct text?<br/>
468
- Well, for these cases like JSON responses, I made the `.json()` method inside the [Adaptor](#adaptor) class to check if the current element doesn't have text content; it will use the `get_all_text` method directly.<br/><br/>It might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
469
 
470
- - Another handy method is `.clean()`, this will remove all white spaces and consecutive spaces for you and return a new `TextHandler`, wonderful
471
  ```python
472
  >>> TextHandler('\n wonderful idea, \reh?').clean()
473
  'wonderful idea, eh?'
474
  ```
475
 
476
- - Another method that might be helpful in some cases is the `.sort()` method to sort the string for you as you do with lists
477
  ```python
478
  >>> TextHandler('acb').sort()
479
  'abc'
@@ -487,19 +507,19 @@ Or do it in reverse:
487
  Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
488
 
489
  ## TextHandlers
490
- You probably guessed it: This class is similar to [Adaptors](#adaptors) and [Adaptor](#adaptor), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
491
 
492
- The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added here with time.
493
 
494
  ## AttributesHandler
495
- This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [Adaptor](#adaptor) instance, in other words.
496
  ```python
497
  >>> print(page.find('script').attrib)
498
  {'id': 'page-data', 'type': 'application/json'}
499
  >>> type(page.find('script').attrib).__name__
500
  'AttributesHandler'
501
  ```
502
- Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method/properties other than those allowing you to modify/override the data.
503
 
504
  It currently adds two extra simple methods:
505
 
@@ -530,10 +550,10 @@ It currently adds two extra simple methods:
530
 
531
  Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
532
 
533
- - The `json_string` property
534
 
535
- This property converts current attributes to JSON string if the attributes are JSON serializable; otherwise, it throws an error
536
- ```python
537
- >>> page.find('script').attrib.json_string
538
- b'{"id":"page-data","type":"application/json"}'
539
- ```
 
1
  ## Introduction
2
+ After exploring the various ways to select elements with Scrapling and related features, let's take a step back and examine the [Selector](#selector) class generally and other objects to better understand the parsing engine.
3
 
4
+ The [Selector](#selector) class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. You can always import it with any of the following imports
5
  ```python
6
+ from scrapling import Selector
7
+ from scrapling.parser import Selector
8
  ```
9
+ Then use it directly as you already learned in the [overview](../overview.md) page
10
  ```python
11
+ page = Selector(
12
+ '<html>...</html>',
13
  url='https://example.com'
14
  )
15
 
16
  # Then select elements as you like
17
+ elements = page.css('.product')
18
  ```
19
+ In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, a [Selector](#selector) object. Any operation you do, like selection, navigation, etc., will return either a [Selector](#selector) object or a [Selectors](#selectors) object, given that the result is element/elements from the page, not text or similar.
20
 
21
+ In other words, the main page is a [Selector](#selector) object, and the elements within are [Selector](#selector) objects, and so on. Any text, such as the text content inside elements or the text inside element attributes, is a [TextHandler](#texthandler) object, and the attributes of each element are stored as [AttributesHandler](#attributeshandler). We will return to both objects later, so let's focus on the [Selector](#selector) object.
22
 
23
+ ## Selector
24
  ### Arguments explained
25
+ The most important one is `content`, it's used to pass the HTML code you want to parse, and it accepts the HTML content as `str` or `bytes`.
26
 
27
+ Otherwise, you have the arguments `url`, `adaptive`, `storage`, and `storage_args`. All these arguments are settings used with the `adaptive` feature, and they don't make a difference if you are not going to use that feature, so just ignore them for now, and we will explain them in the [adaptive](adaptive.md) feature page.
28
 
29
+ Then you have the arguments for parsing adjustments or adjusting/manipulating the HTML content while the library is parsing it:
30
 
31
  - **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
32
+ - **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default because it can cause issues with your scraping in various ways.
33
+ - **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML.
34
 
35
  I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed.
36
+ You may notice that I'm doing that a lot because it involves advanced features that you don't need to know to use the library. The development section will cover these missing parts if you are very invested.
37
 
38
+ After that, for the main page and elements within, most properties are lazily loaded. This means they don't get initialized until you use them like the text content of a page/element, and this is one of the reasons for Scrapling speed :)
39
 
40
  ### Properties
41
  You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section.
 
81
  ```
82
  Load the page directly as shown before:
83
  ```python
84
+ from scrapling import Selector
85
+ page = Selector(html_doc)
86
  ```
87
  Get all text content on the page recursively
88
  ```python
89
  >>> page.get_all_text()
90
  'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
91
  ```
92
+ Get the first article, as explained before; we will use it as an example
93
  ```python
94
  article = page.find('article')
95
  ```
 
98
  >>> article.get_all_text()
99
  'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
100
  ```
101
+ But if you try to get the direct text content, it will be empty because it doesn't have direct text in the HTML code above
102
  ```python
103
  >>> article.text
104
  ''
 
107
 
108
  1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'
109
  2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
110
+ 3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results and ignore any elements nested within them. The default is `('script', 'style',)`.
111
  4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
112
 
113
+ By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON, use `.json()` on it
114
  ```python
115
  >>> script = page.find('script')
116
  >>> script.json()
 
121
  >>> article.tag
122
  'article'
123
  ```
124
+ If you use it on the page directly, you will find that you are operating on the root `html` element
125
  ```python
126
  >>> page.tag
127
  'html'
 
133
  >>> print(article.attrib)
134
  {'class': 'product', 'data-id': '1'}
135
  ```
136
+ Access a specific attribute with any method of the following
137
+ ```python
138
+ >>> article.attrib['class']
139
+ >>> article.attrib.get('class')
140
+ >>> article['class'] # new in v0.3
141
+ ```
142
+ Check if the attributes contain a specific attribute with any of the methods below
143
+ ```python
144
+ >>> 'class' in article.attrib
145
+ >>> 'class' in article # new in v0.3
146
+ ```
147
  Get the HTML content of the element
148
  ```python
149
  >>> article.html_content
 
154
  >>> article.body
155
  '<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
156
  ```
157
+ Get the prettified version of the element's HTML content
158
  ```python
159
  >>> print(article.prettify())
160
  <article class="product" data-id="1"><h3>Product 1</h3>
 
186
  ```
187
 
188
  ### Traversal
189
+ Using the elements we found above, we will go over the properties/methods for moving on the page in detail.
190
 
191
  If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online for a better understanding.
192
 
193
  If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
194
+ In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
195
  This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
196
 
197
  Accessing the parent of an element
 
249
  ```
250
  Get the next element of the current element
251
  ```python
252
+ >>> article.next
253
  <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
254
  ```
255
  The same logic applies to the `previous` property
 
269
  for ancestor in article.iterancestors():
270
  # do something with it...
271
  ```
272
+ You can search for a specific ancestor of an element that satisfies a search function; all you need to do is to pass a function that takes a [Selector](#selector) object as an argument and return `True` if the condition satisfies or `False` otherwise, like below:
273
  ```python
274
  >>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
275
  <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
 
277
  >>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
278
  <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
279
  ```
280
+ ## Selectors
281
+ The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
282
 
283
+ In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
284
 
285
  - If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
286
  ```python
 
295
  ```
296
  - If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
297
  ```python
298
+ >>> page.css('.price_color') # -> Selectors
299
  >>> page.css('.product_pod a::attr(href)') # -> TextHandlers
300
  >>> page.css('.price_color, .product_pod a::attr(href)') # -> List
301
  ```
302
 
303
+ Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
304
  ### Properties
305
  Apart from the normal operations on Python lists like iteration, slicing, etc...
306
 
307
  You can do the following:
308
 
309
+ Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the arguments and the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
310
  ```python
311
  >>> page.css('.product_pod a')
312
  [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
 
326
  <data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
327
  ...]
328
  ```
329
+ Run the `re` and `re_first` methods directly. They take the same arguments passed to the [Selector](#selector) class. I'm still leaving these methods to be explained in the [TextHandler](#texthandler) section below.
330
 
331
+ However, in this class, the `re_first` behaves differently as it runs `re` on each [Selector](#selector) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal, that has all the [TextHandler](#texthandler) instances combined in one [TextHandlers](#texthandlers) instance.
332
  ```python
333
  >>> page.css('.price_color').re(r'[\d\.]+')
334
  ['51.77',
 
345
  'sharp-objects_997',
346
  ...]
347
  ```
348
+ With the `search` method, you can search quickly in the available [Selector](#selector) instances. The function you pass must accept a [Selector](#selector) instance as the first argument and return True/False. The method will return the first [Selector](#selector) instance that satisfies the function; otherwise, it will return `None`.
349
  ```python
350
  # Find all the products with price '53.23'
351
  >>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
352
  >>> page.css('.product_pod').search(search_function)
353
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
354
  ```
355
+ You can use the `filter` method, too, which takes a function like the `search` method but returns an `Selectors` instance of all the [Selector](#selector) instances that satisfy the function
356
  ```python
357
  # Find all products with prices over $50
358
  >>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
 
362
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
363
  ...]
364
  ```
365
+ If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
366
+ ```python
367
+ page.css('.product_pod').length
368
+ ```
369
+ instead of this
370
+ ```python
371
+ len(page.css('.product_pod'))
372
+ ```
373
+ Yup, like JavaScript :)
374
 
375
  ## TextHandler
376
  This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
377
 
378
+ TextHandler is a subclass of the standard Python string, so you can do anything with it that you can do with a Python string. So, what is the difference that requires a different naming?
379
 
380
+ Of course, TextHandler provides extra methods and properties that standard Python strings can't do. We will review them now, but remember that all methods and properties in all classes that return string(s) return TextHandler, which opens the door for creativity and makes the code shorter and cleaner, as you will see. Also, you can import it directly and use it on any string, which we will explain [later](../development/scrapling_custom_types.md).
381
  ### Usage
382
+ First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
383
 
384
+ First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers)), so they will take the same arguments as well.
385
 
386
+ - The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
387
 
388
  Also, it takes other helpful arguments, which are:
389
 
390
  - **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
391
+ - **clean_match**: It's disabled by default. This makes the method ignore all whitespaces and consecutive spaces while matching.
392
+ - **case_sensitive**: It's enabled by default. As the name implies, disabling it will make the regex ignore the case of letters while compiling it.
393
+
394
  You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
395
  ```python
396
  >>> page.css('.price_color').re(r'[\d\.]+')
 
425
  >>> test_string.re('hi there', clean_match=True, case_sensitive=False)
426
  ['hi There']
427
  ```
428
+ Another use of the idea of replacing strings with `TextHandler` everywhere is that a property like `html_content` returns `TextHandler`, so you can do regex on the HTML content if you want:
429
  ```python
430
  >>> page.html_content.re('div class=".*">(.*)</div')
431
  ['In stock: 5', 'In stock: 3', 'Out of stock']
432
  ```
433
 
434
+ - You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
435
  ```python
436
  >>> page.css_first('#page-data::text')
437
  '\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
438
  >>> page.css_first('#page-data::text').json()
439
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
440
  ```
441
+ Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
442
  ```python
443
  >>> page.css_first('#page-data').json()
444
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
445
  ```
446
+ The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
447
  ```html
448
  <html>
449
  <body>
 
458
  </body>
459
  </html>
460
  ```
461
+ The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
462
  So, as you know here, if you did something like this
463
  ```python
464
  >>> page.css_first('div::text').json()
465
  ```
466
+ You will get an error because the `div` tag doesn't have direct text content that can be serialized to JSON; it actually doesn't have direct text content at all.<br/><br/>
467
  In this case, the `get_all_text` method comes to the rescue, so you can do something like that
468
  ```python
469
  >>> page.css_first('div').get_all_text(ignore_tags=[]).json()
470
  {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
471
  ```
472
  I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
473
+ Another related behavior to be aware of occurs when using any of the fetchers, which we will explain later. If you have a JSON response like this example:
474
  ```python
475
+ >>> page = Selector("""{"some_key": "some_value"}""")
476
  ```
477
+ Because the [Selector](#selector) class is optimized to deal with HTML pages, it will deal with it as a broken HTML response and fix it, so if you used the `html_content` property, you get this
478
  ```python
479
  >>> page.html_content
480
  '<html><body><p>{"some_key": "some_value"}</p></body></html>'
481
  ```
482
+ Here, you can use the `json` method directly, and it will work
483
  ```python
484
  >>> page.json()
485
  {'some_key': 'some_value'}
486
  ```
487
+ You might wonder how this happened while the `html` tag doesn't have direct text?<br/>
488
+ Well, for cases like JSON responses, I made the [Selector](#selector) class maintain a raw copy of the content passed to it. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it used the `get_all_text` method directly.<br/><br/>This might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
489
 
490
+ - Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
491
  ```python
492
  >>> TextHandler('\n wonderful idea, \reh?').clean()
493
  'wonderful idea, eh?'
494
  ```
495
 
496
+ - Another method that might be helpful in some cases is the `.sort()` method to sort the string for you, as you do with lists
497
  ```python
498
  >>> TextHandler('acb').sort()
499
  'abc'
 
507
  Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
508
 
509
  ## TextHandlers
510
+ You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
511
 
512
+ The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added over time.
513
 
514
  ## AttributesHandler
515
+ This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [Selector](#selector) instance, in other words.
516
  ```python
517
  >>> print(page.find('script').attrib)
518
  {'id': 'page-data', 'type': 'application/json'}
519
  >>> type(page.find('script').attrib).__name__
520
  'AttributesHandler'
521
  ```
522
+ Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method and properties, except those that allow you to modify/override the data.
523
 
524
  It currently adds two extra simple methods:
525
 
 
550
 
551
  Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
552
 
553
+ - The `json_string` property
554
 
555
+ This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error
556
+ ```python
557
+ >>>page.find('script').attrib.json_string
558
+ b'{"id":"page-data","type":"application/json"}'
559
+ ```