| # Writing your retrieval system |
|
|
| Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the `adaptive` feature. |
|
|
| You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other. |
|
|
| So first, to make your storage class work, it must do the big 3: |
|
|
| 1. Inherit from the abstract class `scrapling.core.storage.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic. |
| 2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes. |
| 3. Implement methods `save` and `retrieve`, as you see from the type hints: |
| - The method `save` returns nothing and will get two arguments from the library |
| * The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the `element_to_dict` function in the submodule `scrapling.core.utils._StorageTools` to maintain the same format, and then saved to your database as you wish. |
| * The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the `adaptive` data will be messed up. |
| - The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`. |
|
|
| > If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py) file |
|
|
| If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe. |
|
|
| Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py); it's heavily commented :) |
|
|
|
|
| ## Real-World Example: Redis Storage |
|
|
| Here's a more practical example generated by AI using Redis: |
|
|
| ```python |
| import redis |
| import orjson |
| from functools import lru_cache |
| from scrapling.core.storage import StorageSystemMixin |
| from scrapling.core.utils import _StorageTools |
| |
| @lru_cache(None) |
| class RedisStorage(StorageSystemMixin): |
| def __init__(self, host='localhost', port=6379, db=0, url=None): |
| super().__init__(url) |
| self.redis = redis.Redis( |
| host=host, |
| port=port, |
| db=db, |
| decode_responses=False |
| ) |
| |
| def save(self, element, identifier: str) -> None: |
| # Convert element to dictionary |
| element_dict = _StorageTools.element_to_dict(element) |
| |
| # Create key |
| key = f"scrapling:{self._get_base_url()}:{identifier}" |
| |
| # Store as JSON |
| self.redis.set( |
| key, |
| orjson.dumps(element_dict) |
| ) |
| |
| def retrieve(self, identifier: str) -> dict | None: |
| # Get data |
| key = f"scrapling:{self._get_base_url()}:{identifier}" |
| data = self.redis.get(key) |
| |
| # Parse JSON if exists |
| if data: |
| return orjson.loads(data) |
| return None |
| ``` |