Karim shoair commited on
Commit
4d1a1ad
·
1 Parent(s): ccf41bc

docs: Add a page about the interactive shell

Browse files
Files changed (1) hide show
  1. docs/cli/interactive-shell.md +235 -0
docs/cli/interactive-shell.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Scrapling Interactive Shell Guide
2
+
3
+ <script src="https://asciinema.org/a/736339.js" id="asciicast-736339" async data-autoplay="1" data-loop="1" data-cols="225" data-rows="40" data-start-at="00:06" data-speed="1.5"></script>
4
+
5
+ **Powerful Web Scraping REPL for Developers and Data Scientists**
6
+
7
+ The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
8
+
9
+ ## Why use the Interactive Shell?
10
+
11
+ The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
12
+
13
+ - **Rapid prototyping**: Test scraping strategies instantly
14
+ - **Data exploration**: Interactively navigate and extract from websites
15
+ - **Learning Scrapling**: Experiment with features in real-time
16
+ - **Debugging scrapers**: Step through requests and inspect results
17
+ - **Converting workflows**: Transform curl commands from browser DevTools to a Fetcher request in a one-liner
18
+
19
+ ## Getting Started
20
+
21
+ ### Launch the Shell
22
+
23
+ ```bash
24
+ # Start the interactive shell
25
+ scrapling shell
26
+
27
+ # Execute code and exit (useful for scripting)
28
+ scrapling shell -c "get('https://quotes.toscrape.com'); print(len(page.css('.quote')))"
29
+
30
+ # Set logging level
31
+ scrapling shell --loglevel info
32
+ ```
33
+
34
+ Once launched, you'll see the Scrapling banner and can immediately start scraping as the video above shows:
35
+
36
+ ```python
37
+ # No imports needed - everything is ready!
38
+ >>> get('https://news.ycombinator.com')
39
+
40
+ >>> # Explore the page structure
41
+ >>> page.css('a')[:5] # Look at first 5 links
42
+
43
+ >>> # Refine your selectors
44
+ >>> stories = page.css('.titleline>a')
45
+ >>> len(stories)
46
+ 30
47
+
48
+ >>> # Extract specific data
49
+ >>> for story in stories[:3]:
50
+ ... title = story.text
51
+ ... url = story['href']
52
+ ... print(f"{title}: {url}")
53
+
54
+ >>> # Try different approaches
55
+ >>> titles = page.css('.titleline>a::text') # Direct text extraction
56
+ >>> urls = page.css('.titleline>a::attr(href)') # Direct attribute extraction
57
+ ```
58
+
59
+ ## Built-in Shortcuts
60
+
61
+ The shell provides convenient shortcuts that eliminate boilerplate code:
62
+
63
+ - **`get(url, **kwargs)`** - HTTP GET request (instead of `Fetcher.get`)
64
+ - **`post(url, **kwargs)`** - HTTP POST request (instead of `Fetcher.post`)
65
+ - **`put(url, **kwargs)`** - HTTP PUT request (instead of `Fetcher.put`)
66
+ - **`delete(url, **kwargs)`** - HTTP DELETE request (instead of `Fetcher.delete`)
67
+ - **`fetch(url, **kwargs)`** - Browser-based fetch (instead of `DynamicFetcher.fetch`)
68
+ - **`stealthy_fetch(url, **kwargs)`** - Stealthy browser fetch (instead of `StealthyFetcher.fetch`)
69
+
70
+ The most commonly used classes are automatically available without any import, including `Fetcher`, `AsyncFetcher`, `DynamicFetcher`, `StealthyFetcher`, and `Selector`.
71
+
72
+ ### Smart Page Management
73
+
74
+ The shell automatically tracks your requests and pages:
75
+
76
+ - **Current Page Access**
77
+
78
+ The `page` and `response` commands are automatically updated with the last fetched page:
79
+
80
+ ```python
81
+ >>> get('https://quotes.toscrape.com')
82
+ >>> # 'page' and 'response' both refer to the last fetched page
83
+ >>> page.url
84
+ 'https://quotes.toscrape.com'
85
+ >>> response.status # Same as page.status
86
+ 200
87
+ ```
88
+
89
+ - **Page History**
90
+
91
+ The `pages` command keeps track of the last five pages (it's a `Selectors` object):
92
+
93
+ ```python
94
+ >>> get('https://site1.com')
95
+ >>> get('https://site2.com')
96
+ >>> get('https://site3.com')
97
+
98
+ >>> # Access last 5 pages
99
+ >>> len(pages) # `Selectors` object with `page` history
100
+ 3
101
+ >>> pages[0].url # First page in history
102
+ 'https://site1.com'
103
+ >>> pages[-1].url # Most recent page
104
+ 'https://site3.com'
105
+
106
+ >>> # Work with historical pages
107
+ >>> for i, old_page in enumerate(pages):
108
+ ... print(f"Page {i}: {old_page.url} - {old_page.status}")
109
+ ```
110
+
111
+ ## Additional helpful commands
112
+
113
+ ### Page Visualization
114
+
115
+ View scraped pages in your browser:
116
+
117
+ ```python
118
+ >>> get('https://quotes.toscrape.com')
119
+ >>> view(page) # Opens the page HTML in your default browser
120
+ ```
121
+
122
+ ### Curl Command Integration
123
+
124
+ The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests, which are `uncurl` and `curl2fetcher`. First, you need to copy a request as a curl command like the following:
125
+
126
+ <img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
127
+
128
+ - **Convert Curl command to Request Object**
129
+
130
+ ```python
131
+ >>> curl_cmd = '''curl 'https://httpbin.org/post' \
132
+ ... -X POST \
133
+ ... -H 'Content-Type: application/json' \
134
+ ... -d '{"name": "test", "value": 123}' '''
135
+
136
+ >>> request = uncurl(curl_cmd)
137
+ >>> request.method
138
+ 'post'
139
+ >>> request.url
140
+ 'https://httpbin.org/post'
141
+ >>> request.headers
142
+ {'Content-Type': 'application/json'}
143
+ ```
144
+
145
+ - **Execute Curl Command Directly**
146
+
147
+ ```python
148
+ >>> # Convert and execute in one step
149
+ >>> curl2fetcher(curl_cmd)
150
+ >>> page.status
151
+ 200
152
+ >>> page.json()['json']
153
+ {'name': 'test', 'value': 123}
154
+ ```
155
+
156
+ ### IPython Features
157
+
158
+ The shell inherits all IPython capabilities:
159
+
160
+ ```python
161
+ >>> # Magic commands
162
+ >>> %time page = get('https://example.com') # Time execution
163
+ >>> %history # Show command history
164
+ >>> %save filename.py 1-10 # Save commands 1-10 to file
165
+
166
+ >>> # Tab completion works everywhere
167
+ >>> page.c<TAB> # Shows: css, css_first, cookies, etc.
168
+ >>> Fetcher.<TAB> # Shows all Fetcher methods
169
+
170
+ >>> # Object inspection
171
+ >>> get? # Show get documentation
172
+ ```
173
+
174
+ ## Examples
175
+
176
+ Here are a few examples generated via AI:
177
+
178
+ #### E-commerce Data Collection
179
+
180
+ ```python
181
+ >>> # Start with product listing page
182
+ >>> catalog = get('https://shop.example.com/products')
183
+
184
+ >>> # Find product links
185
+ >>> product_links = catalog.css('.product-link::attr(href)')
186
+ >>> print(f"Found {len(product_links)} products")
187
+
188
+ >>> # Sample a few products first
189
+ >>> for link in product_links[:3]:
190
+ ... product = get(f"https://shop.example.com{link}")
191
+ ... name = product.css('.product-name::text').get('')
192
+ ... price = product.css('.price::text').get('')
193
+ ... print(f"{name}: {price}")
194
+
195
+ >>> # Scale up with sessions for efficiency
196
+ >>> from scrapling.fetchers import FetcherSession
197
+ >>> with FetcherSession() as session:
198
+ ... products = []
199
+ ... for link in product_links:
200
+ ... product = session.get(f"https://shop.example.com{link}")
201
+ ... products.append({
202
+ ... 'name': product.css('.product-name::text').get(''),
203
+ ... 'price': product.css('.price::text').get(''),
204
+ ... 'url': link
205
+ ... })
206
+ ```
207
+
208
+ #### API Integration and Testing
209
+
210
+ ```python
211
+ >>> # Test API endpoints interactively
212
+ >>> response = get('https://jsonplaceholder.typicode.com/posts/1')
213
+ >>> response.json()
214
+ {'userId': 1, 'id': 1, 'title': 'sunt aut...', 'body': 'quia et...'}
215
+
216
+ >>> # Test POST requests
217
+ >>> new_post = post('https://jsonplaceholder.typicode.com/posts',
218
+ ... json={'title': 'Test Post', 'body': 'Test content', 'userId': 1})
219
+ >>> new_post.json()['id']
220
+ 101
221
+
222
+ >>> # Test with different data
223
+ >>> updated = put(f'https://jsonplaceholder.typicode.com/posts/{new_post.json()["id"]}',
224
+ ... json={'title': 'Updated Title'})
225
+ ```
226
+
227
+ ## Getting Help
228
+
229
+ If you need help other than what is available in-terminal, you can:
230
+
231
+ - [Scrapling Documentation](https://scrapling.readthedocs.io/)
232
+ - [Discord Community](https://discord.gg/EMgGbDceNQ)
233
+ - [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues)
234
+
235
+ And that's it! Happy scraping! The shell makes web scraping as easy as a conversation.