Spaces:

AUXteam
/

Scraper_hub

Sleeping

App Files Files Community

Scraper_hub / docs /development /adaptive_storage_system.md

AUXteam

Upload folder using huggingface_hub

94ec243 verified about 1 month ago

preview code

raw

history blame contribute delete

3.56 kB

	# Writing your retrieval system

	Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the `adaptive` feature.

	You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other.

	So first, to make your storage class work, it must do the big 3:

	1. Inherit from the abstract class `scrapling.core.storage.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic.
	2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes.
	3. Implement methods `save` and `retrieve`, as you see from the type hints:
	- The method `save` returns nothing and will get two arguments from the library
	* The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the `element_to_dict` function in the submodule `scrapling.core.utils._StorageTools` to maintain the same format, and then saved to your database as you wish.
	* The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the `adaptive` data will be messed up.
	- The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`.

	> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py) file

	If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe.

	Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py); it's heavily commented :)


	## Real-World Example: Redis Storage

	Here's a more practical example generated by AI using Redis:

	```python
	import redis
	import orjson
	from functools import lru_cache
	from scrapling.core.storage import StorageSystemMixin
	from scrapling.core.utils import _StorageTools

	@lru_cache(None)
	class RedisStorage(StorageSystemMixin):
	def __init__(self, host='localhost', port=6379, db=0, url=None):
	super().__init__(url)
	self.redis = redis.Redis(
	host=host,
	port=port,
	db=db,
	decode_responses=False
	)

	def save(self, element, identifier: str) -> None:
	# Convert element to dictionary
	element_dict = _StorageTools.element_to_dict(element)

	# Create key
	key = f"scrapling:{self._get_base_url()}:{identifier}"

	# Store as JSON
	self.redis.set(
	key,
	orjson.dumps(element_dict)
	)

	def retrieve(self, identifier: str) -> dict \| None:
	# Get data
	key = f"scrapling:{self._get_base_url()}:{identifier}"
	data = self.redis.get(key)

	# Parse JSON if exists
	if data:
	return orjson.loads(data)
	return None
	```