Karim shoair commited on
Commit ·
7b427e7
1
Parent(s): 4893321
docs: update storage dev tutorial
Browse files
docs/development/{automatch_storage_system.md → adaptive_storage_system.md}
RENAMED
|
@@ -1,22 +1,22 @@
|
|
| 1 |
-
Scrapling uses SQLite by default, but this tutorial covers writing your storage system to store element properties there for
|
| 2 |
|
| 3 |
You might want to use FireBase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because the spiders will share with each other.
|
| 4 |
|
| 5 |
So first, to make your storage class work, it must do the big 3:
|
| 6 |
|
| 7 |
-
1. Inherit from the abstract class `scrapling.core.
|
| 8 |
2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes.
|
| 9 |
3. Implement methods `save` and `retrieve`, as you see from the type hints:
|
| 10 |
- The method `save` returns nothing and will get two arguments from the library
|
| 11 |
* The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the function `element_to_dict` in submodule `scrapling.core.utils._StorageTools` to keep the same format and save it to your database as you wish.
|
| 12 |
-
* The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the
|
| 13 |
- The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`.
|
| 14 |
|
| 15 |
-
> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/
|
| 16 |
|
| 17 |
-
If your class meets these criteria, the rest is
|
| 18 |
|
| 19 |
-
Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/
|
| 20 |
|
| 21 |
|
| 22 |
## Real-World Example: Redis Storage
|
|
@@ -27,7 +27,7 @@ Here's a more practical example generated by AI using Redis:
|
|
| 27 |
import redis
|
| 28 |
import orjson
|
| 29 |
from functools import lru_cache
|
| 30 |
-
from scrapling.core.
|
| 31 |
from scrapling.core.utils import _StorageTools
|
| 32 |
|
| 33 |
@lru_cache(None)
|
|
|
|
| 1 |
+
Scrapling uses SQLite by default, but this tutorial covers writing your storage system to store element properties there for `adaptive` feature.
|
| 2 |
|
| 3 |
You might want to use FireBase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because the spiders will share with each other.
|
| 4 |
|
| 5 |
So first, to make your storage class work, it must do the big 3:
|
| 6 |
|
| 7 |
+
1. Inherit from the abstract class `scrapling.core.storage.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic.
|
| 8 |
2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes.
|
| 9 |
3. Implement methods `save` and `retrieve`, as you see from the type hints:
|
| 10 |
- The method `save` returns nothing and will get two arguments from the library
|
| 11 |
* The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the function `element_to_dict` in submodule `scrapling.core.utils._StorageTools` to keep the same format and save it to your database as you wish.
|
| 12 |
+
* The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the `adaptive` data will be messed up.
|
| 13 |
- The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`.
|
| 14 |
|
| 15 |
+
> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py) file
|
| 16 |
|
| 17 |
+
If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe.
|
| 18 |
|
| 19 |
+
Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py); it's heavily commented :)
|
| 20 |
|
| 21 |
|
| 22 |
## Real-World Example: Redis Storage
|
|
|
|
| 27 |
import redis
|
| 28 |
import orjson
|
| 29 |
from functools import lru_cache
|
| 30 |
+
from scrapling.core.storage import StorageSystemMixin
|
| 31 |
from scrapling.core.utils import _StorageTools
|
| 32 |
|
| 33 |
@lru_cache(None)
|