Spaces:
Sleeping
Sleeping
| # Writing your retrieval system | |
| Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the `adaptive` feature. | |
| You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other. | |
| So first, to make your storage class work, it must do the big 3: | |
| 1. Inherit from the abstract class `scrapling.core.storage.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic. | |
| 2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes. | |
| 3. Implement methods `save` and `retrieve`, as you see from the type hints: | |
| - The method `save` returns nothing and will get two arguments from the library | |
| * The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the `element_to_dict` function in the submodule `scrapling.core.utils._StorageTools` to maintain the same format, and then saved to your database as you wish. | |
| * The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the `adaptive` data will be messed up. | |
| - The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`. | |
| > If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py) file | |
| If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe. | |
| Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py); it's heavily commented :) | |
| ## Real-World Example: Redis Storage | |
| Here's a more practical example generated by AI using Redis: | |
| ```python | |
| import redis | |
| import orjson | |
| from functools import lru_cache | |
| from scrapling.core.storage import StorageSystemMixin | |
| from scrapling.core.utils import _StorageTools | |
| @lru_cache(None) | |
| class RedisStorage(StorageSystemMixin): | |
| def __init__(self, host='localhost', port=6379, db=0, url=None): | |
| super().__init__(url) | |
| self.redis = redis.Redis( | |
| host=host, | |
| port=port, | |
| db=db, | |
| decode_responses=False | |
| ) | |
| def save(self, element, identifier: str) -> None: | |
| # Convert element to dictionary | |
| element_dict = _StorageTools.element_to_dict(element) | |
| # Create key | |
| key = f"scrapling:{self._get_base_url()}:{identifier}" | |
| # Store as JSON | |
| self.redis.set( | |
| key, | |
| orjson.dumps(element_dict) | |
| ) | |
| def retrieve(self, identifier: str) -> dict | None: | |
| # Get data | |
| key = f"scrapling:{self._get_base_url()}:{identifier}" | |
| data = self.redis.get(key) | |
| # Parse JSON if exists | |
| if data: | |
| return orjson.loads(data) | |
| return None | |
| ``` |