Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Feb 10

Commit

960e783

1 Parent(s): e6492de

docs: updating `main classes` page and some corrections

Browse files

Files changed (1) hide show

docs/parsing/main_classes.md +67 -28

docs/parsing/main_classes.md CHANGED Viewed

@@ -165,10 +165,10 @@ print(article.prettify())
     <div class="hidden stock">In stock: 5</div>
 </article>
 ```
-Use the `.body` property to get the raw content of the page
 ```python
 >>> page.body
-'<html>\n  <head>\n    <title>Some page</title>\n  </head>\n  <body>\n    <div class="product-list">\n      <article class="product" data-id="1">\n        <h3>Product 1</h3>\n        <p class="description">This is product 1</p>\n        <span class="price">$10.99</span>\n        <div class="hidden stock">In stock: 5</div>\n      </article>\n\n      <article class="product" data-id="2">\n        <h3>Product 2</h3>\n        <p class="description">This is product 2</p>\n        <span class="price">$20.99</span>\n        <div class="hidden stock">In stock: 3</div>\n      </article>\n\n      <article class="product" data-id="3">\n        <h3>Product 3</h3>\n        <p class="description">This is product 3</p>\n        <span class="price">$15.99</span>\n        <div class="hidden stock">Out of stock</div>\n      </article>\n    </div>\n\n    <script id="page-data" type="application/json">\n      {\n        "lastUpdated": "2024-09-22T10:30:00Z",\n        "totalProducts": 3\n      }\n    </script>\n  </body>\n</html>'
 ```
 To get all the ancestors in the DOM tree of this element
 ```python
@@ -233,7 +233,7 @@ This element returns the same result as the `children` property because its chil
 Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
 ```python
->>> products_list = page.css_first('.product-list')
 >>> products_list.children
 [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
  <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
@@ -262,7 +262,7 @@ Get the next element of the current element
 The same logic applies to the `previous` property
 ```python
 >>> article.previous  # It's the first child, so it doesn't have a previous element
->>> second_article = page.css_first('.product[data-id="2"]')
 >>> second_article.previous
 <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
 ```
@@ -287,27 +287,57 @@ You can search for a specific ancestor of an element that satisfies a search fun
 ## Selectors
 The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
-In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
-- If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
-    ```python
-    >>> page.css('a::text')              # -> TextHandlers
-    >>> page.xpath('//a/text()')         # -> TextHandlers
-    >>> page.css_first('a::text')        # -> TextHandler
-    >>> page.xpath_first('//a/text()')   # -> TextHandler
-    >>> page.css('a::attr(href)')        # -> TextHandlers
-    >>> page.xpath('//a/@href')          # -> TextHandlers
-    >>> page.css_first('a::attr(href)')  # -> TextHandler
-    >>> page.xpath_first('//a/@href')    # -> TextHandler
-    ```
-- If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
-  ```python
-  >>> page.css('.price_color')                               # -> Selectors
-  >>> page.css('.product_pod a::attr(href)')                # -> TextHandlers
-  >>> page.css('.price_color, .product_pod a::attr(href)')  # -> List
-  ```
-Let's see what [Selectors](#selectors) class adds to the table with that out of the way.
 ### Properties
 Apart from the standard operations on Python lists, such as iteration and slicing.
@@ -369,6 +399,15 @@ You can use the `filter` method, too, which takes a function like the `search` m
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
 ...]
 ```
 If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
 ```python
 page.css('.product_pod').length
@@ -440,14 +479,14 @@ First, we start with the `re` and `re_first` methods. These are the same methods
 - You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
   ```python
-  >>> page.css_first('#page-data::text')
     '\n      {\n        "lastUpdated": "2024-09-22T10:30:00Z",\n        "totalProducts": 3\n      }\n    '
-  >>> page.css_first('#page-data::text').json()
     {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
   ```
   Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
   ```python
-  >>> page.css_first('#page-data').json()
   {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
   ```
   The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
@@ -468,12 +507,12 @@ First, we start with the `re` and `re_first` methods. These are the same methods
   The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
   So, as you know here, if you did something like this
   ```python
-  >>> page.css_first('div::text').json()
   ```
   You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.<br/><br/>
   In this case, the `get_all_text` method comes to the rescue, so you can do something like that
   ```python
-  >>> page.css_first('div').get_all_text(ignore_tags=[]).json()
     {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
   ```
   I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>

     <div class="hidden stock">In stock: 5</div>
 </article>
 ```
+Use the `.body` property to get the raw content of the page. Starting from v0.4, when used on a `Response` object from fetchers, `.body` always returns `bytes`.
 ```python
 >>> page.body
+'<html>\n  <head>\n    <title>Some page</title>\n  </head>\n  ...'
 ```
 To get all the ancestors in the DOM tree of this element
 ```python
 Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
 ```python
+>>> products_list = page.css('.product-list')[0]
 >>> products_list.children
 [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
  <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
 The same logic applies to the `previous` property
 ```python
 >>> article.previous  # It's the first child, so it doesn't have a previous element
+>>> second_article = page.css('.product[data-id="2"]')[0]
 >>> second_article.previous
 <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
 ```
 ## Selectors
 The class `Selectors` is the "List" version of the [Selector](#selector) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Selector](#selector) instances within more straightforward.
+In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
+Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties gracefully return empty/default values.
+```python
+>>> page.css('a::text')              # -> Selectors (of text node Selectors)
+>>> page.xpath('//a/text()')         # -> Selectors
+>>> page.css('a::text').get()        # -> TextHandler (the first text value)
+>>> page.css('a::text').getall()     # -> TextHandlers (all text values)
+>>> page.css('a::attr(href)')        # -> Selectors
+>>> page.xpath('//a/@href')          # -> Selectors
+>>> page.css('.price_color')         # -> Selectors
+```
+### Data extraction methods
+Starting with v0.4, [Selector](#selector) and [Selectors](#selectors) both provide `get()`, `getall()`, and their aliases `extract_first` and `extract` (following Scrapy conventions). The old `get_all()` method has been removed.
+**On a [Selector](#selector) object:**
+- `get()` returns a `TextHandler` — for text node selectors, it returns the text value; for HTML element selectors, it returns the serialized outer HTML.
+- `getall()` returns a `TextHandlers` list containing the single serialized string.
+- `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
+```python
+>>> page.css('h3')[0].get()        # Outer HTML of the element
+'<h3>Product 1</h3>'
+>>> page.css('h3::text')[0].get()  # Text value of the text node
+'Product 1'
+```
+**On a [Selectors](#selectors) object:**
+- `get(default=None)` returns the serialized string of the **first** element, or `default` if the list is empty.
+- `getall()` serializes **all** elements and returns a `TextHandlers` list.
+- `extract_first` is an alias for `get()`, and `extract` is an alias for `getall()`.
+```python
+>>> page.css('.price::text').get()      # First price text
+'$10.99'
+>>> page.css('.price::text').getall()   # All price texts
+['$10.99', '$20.99', '$15.99']
+>>> page.css('.price::text').get('')    # With default value
+'$10.99'
+```
+These methods work seamlessly with all selection types (CSS, XPath, `find`, etc.) and are the recommended way to extract text and attribute values in a Scrapy-compatible style.
+Now, let's see what [Selectors](#selectors) class adds to the table with that out of the way.
 ### Properties
 Apart from the standard operations on Python lists, such as iteration and slicing.
  <data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
 ...]
 ```
+You can safely access the first or last element without worrying about index errors:
+```python
+>>> page.css('.product').first   # First Selector or None
+<data='<article class="product" data-id="1"><h3...'>
+>>> page.css('.product').last    # Last Selector or None
+<data='<article class="product" data-id="3"><h3...'>
+>>> page.css('.nonexistent').first  # Returns None instead of raising IndexError
+```
 If you are too lazy like me and want to know the number of [Selector](#selector) instances in a [Selectors](#selectors) instance. You can do this:
 ```python
 page.css('.product_pod').length
 - You also have the `.json()` method, which tries to convert the content to a JSON object quickly if possible; otherwise, it throws an error
   ```python
+  >>> page.css('#page-data::text').get()
     '\n      {\n        "lastUpdated": "2024-09-22T10:30:00Z",\n        "totalProducts": 3\n      }\n    '
+  >>> page.css('#page-data::text').get().json()
     {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
   ```
   Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically, like this
   ```python
+  >>> page.css('#page-data')[0].json()
   {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
   ```
   The [Selector](#selector) class adds one thing here, too; let's say this is the page we are working with:
   The [Selector](#selector) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
   So, as you know here, if you did something like this
   ```python
+  >>> page.css('div::text').get().json()
   ```
   You will get an error because the `div` tag doesn't have any direct text content that can be serialized to JSON; it doesn't have any direct text content at all.<br/><br/>
   In this case, the `get_all_text` method comes to the rescue, so you can do something like that
   ```python
+  >>> page.css('div')[0].get_all_text(ignore_tags=[]).json()
     {'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
   ```
   I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>