,
...]
```
You can safely access the first or last element without worrying about index errors:
```python
>>> page.css('.product').first # First Selector or None
>>> page.css('.product').last # Last Selector or None
>>> page.css('.nonexistent').first # Returns None instead of raising IndexError
```
If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
```python
page.css('.product_pod').length
```
which is equivalent to
```python
len(page.css('.product_pod'))
```
Yup, like JavaScript :)
## TextHandler
This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
TextHandler is a subclass of the standard Python string, so you can do anything with it that you can do with a Python string. So, what is the difference that requires a different naming?
Of course, TextHandler provides extra methods and properties that standard Python strings can't do. We will review them now, but remember that all methods and properties in all classes that return string(s) return TextHandler, which opens the door for creativity and makes the code shorter and cleaner, as you will see. Also, you can import it directly and use it on any string, which we will explain [later](../development/scrapling_custom_types.md).
### Usage
First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a `TextHandler` again, so you can chain them as you want. If you find a method or property that returns a standard string instead of `TextHandler`, please open an issue, and we will override it as well.
First, we start with the `re` and `re_first` methods. These are the same methods that exist in the other classes ([Selector](#selector), [Selectors](#selectors), and [TextHandlers](#texthandlers)), so they accept the same arguments.
- The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but, as you probably figured out from the name, it returns only the first result as a `TextHandler` instance.
Also, it takes other helpful arguments, which are:
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
- **clean_match**: It's disabled by default. This causes the method to ignore all whitespace, including consecutive spaces, while matching.
- **case_sensitive**: It's enabled by default. As the name implies, disabling it causes the regex to ignore letter case during compilation.
You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
```python
>>> page.css('.price_color').re(r'[\d\.]+')
['51.77',
'53.74',
'50.10',
'47.82',
'54.23',
...]
>>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
['a-light-in-the-attic_1000',
'tipping-the-velvet_999',
'soumission_998',
'sharp-objects_997',
...]
```
To explain the other arguments better, we will use a custom string for each example below
```python
>>> from scrapling import TextHandler
>>> test_string = TextHandler('hi there') # Hence the two spaces
>>> test_string.re('hi there')
>>> test_string.re('hi there', clean_match=True) # Using `clean_match` will clean the string before matching the regex
['hi there']
>>> test_string2 = TextHandler('Oh, Hi Mark')
>>> test_string2.re_first('oh, hi Mark')
>>> test_string2.re_first('oh, hi Mark', case_sensitive=False) # Hence disabling `case_sensitive`
'Oh, Hi Mark'
# Mixing arguments
>>> test_string.re('hi there', clean_match=True, case_sensitive=False)
['hi There']
```
Another use of the idea of replacing strings with `TextHandler` everywhere is that a property like `html_content` returns `TextHandler`, so you can do regex on the HTML content if you want:
```python
>>> page.html_content.re('div class=".*">(.*)
>> page.css('#page-data::text').get()
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
>>> page.css('#page-data::text').get().json()
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
```
Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
```python
>>> page.css('#page-data')[0].json()
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
```
The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
```html
```
The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.
So, as you know here, if you did something like this
```python
>>> page.css('div::text').get().json()
```
You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.
In this case, the `get_all_text` method comes to the rescue, so you can do something like that
```python
>>> page.css('div')[0].get_all_text(ignore_tags=[]).json()
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
```
I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.
Another related behavior to be aware of occurs when using any fetcher, which we will explain later. If you have a JSON response like this example:
```python
>>> page = Selector("""{"some_key": "some_value"}""")
```
Because the [Selector](#selector) class is optimized to deal with HTML pages, it will deal with it as a broken HTML response and fix it, so if you used the `html_content` property, you get this
```python
>>> page.html_content
'{"some_key": "some_value"}
'
```
Here, you can use the `json` method directly, and it will work
```python
>>> page.json()
{'some_key': 'some_value'}
```
You might wonder how this happened, given that the `html` tag doesn't contain direct text.
Well, for cases like JSON responses, I made the [Selector](#selector) class keep a raw copy of the content it receives. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is unavailable, as with the elements, it checks the current element's text content; otherwise, it uses the `get_all_text` method directly.
- Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
```python
>>> TextHandler('\n wonderful idea, \reh?').clean()
'wonderful idea, eh?'
```
Also, you can pass the `remove_entities` argument to make `clean` replace HTML entities with their corresponding characters.
- Another method that might be helpful in some cases is the `.sort()` method to sort the string for you, as you do with lists
```python
>>> TextHandler('acb').sort()
'abc'
```
Or do it in reverse:
```python
>>> TextHandler('acb').sort(reverse=True)
'cba'
```
Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
## TextHandlers
You probably guessed it: This class is similar to [Selectors](#selectors) and [Selector](#selector), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
The only difference is that the `re_first` method logic here runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`. Nothing new needs to be explained here, but new methods will be added over time.
## AttributesHandler
This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element/[Selector](#selector) instance.
```python
>>> print(page.find('script').attrib)
{'id': 'page-data', 'type': 'application/json'}
>>> type(page.find('script').attrib).__name__
'AttributesHandler'
```
Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method and properties, except those that allow you to modify/override the data.
It currently adds two extra simple methods:
- The `search_values` method
In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values rather than keys, you will need some additional code lines. This method does that for you. It allows you to search the current attributes by values and returns a dictionary of each matching item.
A simple example would be
```python
>>> for i in page.find('script').attrib.search_values('page-data'):
print(i)
{'id': 'page-data'}
```
But this method provides the `partial` argument as well, which allows you to search by part of the value:
```python
>>> for i in page.find('script').attrib.search_values('page', partial=True):
print(i)
{'id': 'page-data'}
```
These examples won't happen in the real world; most likely, a more real-world example would be using it with the `find_all` method to find all elements that have a specific value in their arguments:
```python
>>> page.find_all(lambda element: list(element.attrib.search_values('product')))
[,
,
]
```
All these elements have 'product' as the value for the `class` attribute.
Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
- The `json_string` property
This property converts current attributes to a JSON string if the attributes are JSON serializable; otherwise, it throws an error.
```python
>>>page.find('script').attrib.json_string
b'{"id":"page-data","type":"application/json"}'
```