hellorahulk commited on
Commit
43aa272
·
verified ·
1 Parent(s): 7dbdab0

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +46 -0
  2. README.md +75 -10
  3. app.py +457 -0
  4. requirements.txt +9 -0
Dockerfile ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ # Install system dependencies
4
+ RUN apt-get update && apt-get install -y \
5
+ wget \
6
+ gnupg \
7
+ && rm -rf /var/lib/apt/lists/*
8
+
9
+ # Install latest Chrome and its dependencies
10
+ RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
11
+ && echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
12
+ && apt-get update \
13
+ && apt-get install -y \
14
+ google-chrome-stable \
15
+ fonts-ipafont-gothic \
16
+ fonts-wqy-zenhei \
17
+ fonts-thai-tlwg \
18
+ fonts-kacst \
19
+ fonts-freefont-ttf \
20
+ libxss1 \
21
+ && rm -rf /var/lib/apt/lists/*
22
+
23
+ # Set up working directory
24
+ WORKDIR /app
25
+
26
+ # Copy requirements and install Python dependencies
27
+ COPY requirements.txt .
28
+ RUN pip install --no-cache-dir -r requirements.txt
29
+
30
+ # Install Playwright browsers
31
+ RUN playwright install chromium
32
+ RUN playwright install-deps
33
+
34
+ # Copy application code
35
+ COPY . .
36
+
37
+ # Set environment variables
38
+ ENV PYTHONUNBUFFERED=1
39
+ ENV GRADIO_SERVER_NAME=0.0.0.0
40
+ ENV GRADIO_SERVER_PORT=7860
41
+
42
+ # Expose port
43
+ EXPOSE 7860
44
+
45
+ # Start the application
46
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,10 +1,75 @@
1
- ---
2
- title: Crawlitall
3
- emoji: 🏢
4
- colorFrom: indigo
5
- colorTo: gray
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Crawl4AI Demo - Docker Deployment
2
+
3
+ This is a Docker-ready version of the Crawl4AI demo application, specifically designed for deployment on Hugging Face Spaces.
4
+
5
+ ## Features
6
+
7
+ - Web interface built with Gradio
8
+ - Support for multiple crawler types (Basic, LLM, Cosine, JSON/CSS)
9
+ - Configurable word count threshold
10
+ - Markdown output with metadata
11
+ - Sub-page crawling capabilities
12
+ - Lazy loading support
13
+ - Docker-optimized configuration
14
+
15
+ ## Deployment Instructions
16
+
17
+ 1. Create a new Space on Hugging Face:
18
+ - Go to huggingface.co/spaces
19
+ - Click "Create new Space"
20
+ - Choose "Docker" as the SDK
21
+ - Set the hardware requirements (recommended: CPU + 16GB RAM)
22
+
23
+ 2. Upload the files:
24
+ - Upload all files from this directory to your Space
25
+ - Make sure to include:
26
+ - `Dockerfile`
27
+ - `app.py`
28
+ - `requirements.txt`
29
+ - `README.md`
30
+
31
+ 3. The Space will automatically build and deploy the application.
32
+
33
+ ## Environment Variables
34
+
35
+ No environment variables are required for basic functionality. The application is configured to run out of the box.
36
+
37
+ ## Hardware Requirements
38
+
39
+ - CPU: 2+ cores recommended
40
+ - RAM: 16GB recommended
41
+ - Disk: 5GB minimum
42
+
43
+ ## Browser Support
44
+
45
+ The application uses Chrome in headless mode for web crawling. The Dockerfile includes all necessary dependencies.
46
+
47
+ ## Limitations
48
+
49
+ - Memory usage increases with the number of pages crawled
50
+ - Some websites may block automated crawling
51
+ - JavaScript-heavy sites may require additional configuration
52
+
53
+ ## Troubleshooting
54
+
55
+ If you encounter issues:
56
+
57
+ 1. Check the Space logs for error messages
58
+ 2. Ensure the Chrome browser is running correctly
59
+ 3. Verify network connectivity
60
+ 4. Check memory usage
61
+
62
+ ## Development
63
+
64
+ To run locally with Docker:
65
+
66
+ ```bash
67
+ docker build -t crawl4ai-demo .
68
+ docker run -p 7860:7860 crawl4ai-demo
69
+ ```
70
+
71
+ Visit http://localhost:7860 to access the application.
72
+
73
+ ## License
74
+
75
+ This project is licensed under the MIT License - see the LICENSE file for details.
app.py ADDED
@@ -0,0 +1,457 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Crawl4AI Demo Application (Docker Version)
3
+ =======================================
4
+
5
+ This is a modified version of the Crawl4AI demo application specifically designed
6
+ for deployment in a Docker container on Hugging Face Spaces.
7
+
8
+ Features:
9
+ ---------
10
+ - Web interface built with Gradio for interactive use
11
+ - Support for multiple crawler types (Basic, LLM, Cosine, JSON/CSS)
12
+ - Configurable word count threshold
13
+ - Markdown output with metadata
14
+ - Sub-page crawling capabilities
15
+ - Lazy loading support
16
+ - Docker-optimized configuration
17
+ """
18
+
19
+ import gradio as gr
20
+ import asyncio
21
+ from typing import Optional, Dict, Any, List, Set
22
+ from enum import Enum
23
+ from pydantic import BaseModel
24
+ from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode, BrowserConfig
25
+ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
26
+ import urllib.parse
27
+ import os
28
+
29
+ # Configure browser settings for Docker environment
30
+ CHROME_PATH = "/usr/bin/google-chrome-stable"
31
+ os.environ["CHROME_PATH"] = CHROME_PATH
32
+
33
+ class CrawlerType(str, Enum):
34
+ """Enumeration of supported crawler types."""
35
+ BASIC = "basic"
36
+ LLM = "llm"
37
+ COSINE = "cosine"
38
+ JSON_CSS = "json_css"
39
+
40
+ class ExtractionType(str, Enum):
41
+ """Enumeration of supported extraction strategies."""
42
+ DEFAULT = "default"
43
+ CSS = "css"
44
+ XPATH = "xpath"
45
+ LLM = "llm"
46
+ COMBINED = "combined"
47
+
48
+ class CrawlRequest(BaseModel):
49
+ """Request model for crawling operations."""
50
+ url: str
51
+ crawler_type: CrawlerType = CrawlerType.BASIC
52
+ extraction_type: ExtractionType = ExtractionType.DEFAULT
53
+ word_count_threshold: int = 100
54
+ css_selector: Optional[str] = None
55
+ xpath_query: Optional[str] = None
56
+ excluded_tags: Optional[list] = None
57
+ scan_full_page: bool = False
58
+ scroll_delay: float = 0.5
59
+ crawl_subpages: bool = False
60
+ max_depth: int = 1
61
+ exclude_external_links: bool = True
62
+ max_pages: int = 10
63
+
64
+ def create_extraction_strategy(extraction_type: ExtractionType, css_selector: Optional[str] = None, xpath_query: Optional[str] = None) -> Any:
65
+ """Create an extraction strategy based on the specified type."""
66
+ if extraction_type == ExtractionType.CSS and css_selector:
67
+ schema = {
68
+ "name": "Content",
69
+ "baseSelector": css_selector,
70
+ "fields": [
71
+ {"name": "title", "selector": "h1,h2", "type": "text"},
72
+ {"name": "text", "selector": "p", "type": "text"},
73
+ {"name": "links", "selector": "a", "type": "attribute", "attribute": "href"}
74
+ ]
75
+ }
76
+ return JsonCssExtractionStrategy(schema)
77
+ return None
78
+
79
+ async def crawl_with_subpages(request: CrawlRequest, base_url: str, current_depth: int = 1, visited: Set[str] = None) -> Dict:
80
+ """Recursively crawl pages including sub-pages up to the specified depth."""
81
+ if visited is None:
82
+ visited = set()
83
+
84
+ if current_depth > request.max_depth or len(visited) >= request.max_pages:
85
+ return None
86
+
87
+ normalized_url = urllib.parse.urljoin(request.url, '/')
88
+ if normalized_url in visited:
89
+ return None
90
+
91
+ run_config = CrawlerRunConfig(
92
+ cache_mode=CacheMode.BYPASS,
93
+ verbose=True,
94
+ word_count_threshold=request.word_count_threshold,
95
+ css_selector=request.css_selector,
96
+ excluded_tags=request.excluded_tags or ["nav", "footer", "header"],
97
+ exclude_external_links=request.exclude_external_links,
98
+ wait_for=f"css:{request.css_selector}" if request.css_selector else None,
99
+ wait_for_images=True,
100
+ page_timeout=30000,
101
+ scan_full_page=request.scan_full_page,
102
+ scroll_delay=request.scroll_delay,
103
+ extraction_strategy=create_extraction_strategy(
104
+ request.extraction_type,
105
+ request.css_selector,
106
+ request.xpath_query
107
+ )
108
+ )
109
+
110
+ # Docker-optimized browser configuration
111
+ browser_config = BrowserConfig(
112
+ headless=True,
113
+ viewport_width=1920,
114
+ viewport_height=1080,
115
+ chrome_path=CHROME_PATH,
116
+ args=[
117
+ "--no-sandbox",
118
+ "--disable-dev-shm-usage",
119
+ "--disable-gpu"
120
+ ]
121
+ )
122
+
123
+ results = {
124
+ "pages": [],
125
+ "total_links": 0,
126
+ "visited_pages": len(visited)
127
+ }
128
+
129
+ try:
130
+ async with AsyncWebCrawler(config=browser_config) as crawler:
131
+ result = await crawler.arun(url=request.url, config=run_config)
132
+
133
+ if not result.success:
134
+ print(f"Failed to crawl {request.url}: {result.error_message}")
135
+ return None
136
+
137
+ page_result = {
138
+ "url": request.url,
139
+ "markdown": result.markdown_v2 if hasattr(result, 'markdown_v2') else "",
140
+ "extracted_content": result.extracted_content if hasattr(result, 'extracted_content') else None,
141
+ "depth": current_depth
142
+ }
143
+ results["pages"].append(page_result)
144
+ visited.add(normalized_url)
145
+
146
+ if request.crawl_subpages and hasattr(result, 'links'):
147
+ internal_links = result.links.get("internal", [])
148
+ if internal_links:
149
+ results["total_links"] += len(internal_links)
150
+
151
+ for link in internal_links:
152
+ if len(visited) >= request.max_pages:
153
+ break
154
+
155
+ try:
156
+ normalized_link = urllib.parse.urljoin(request.url, link)
157
+ link_domain = urllib.parse.urlparse(normalized_link).netloc
158
+
159
+ if normalized_link in visited or (request.exclude_external_links and link_domain != base_url):
160
+ continue
161
+
162
+ sub_request = CrawlRequest(
163
+ **{**request.dict(), "url": normalized_link}
164
+ )
165
+
166
+ sub_result = await crawl_with_subpages(
167
+ sub_request,
168
+ base_url,
169
+ current_depth + 1,
170
+ visited
171
+ )
172
+
173
+ if sub_result:
174
+ results["pages"].extend(sub_result["pages"])
175
+ results["total_links"] += sub_result["total_links"]
176
+ results["visited_pages"] = len(visited)
177
+ except Exception as e:
178
+ print(f"Error processing link {link}: {str(e)}")
179
+ continue
180
+
181
+ return results
182
+ except Exception as e:
183
+ print(f"Error crawling {request.url}: {str(e)}")
184
+ return None
185
+
186
+ async def crawl_url(request: CrawlRequest) -> Dict:
187
+ """Crawl a URL and return the extracted content."""
188
+ try:
189
+ base_url = urllib.parse.urlparse(request.url).netloc
190
+
191
+ if request.crawl_subpages:
192
+ results = await crawl_with_subpages(request, base_url)
193
+ if not results or not results["pages"]:
194
+ raise Exception(f"Failed to crawl pages starting from {request.url}")
195
+
196
+ combined_markdown = "\\n\\n---\\n\\n".join(
197
+ f"## Page: {page['url']}\\n{page['markdown']}"
198
+ for page in results["pages"]
199
+ )
200
+
201
+ return {
202
+ "markdown": combined_markdown,
203
+ "metadata": {
204
+ "url": request.url,
205
+ "crawler_type": request.crawler_type.value,
206
+ "extraction_type": request.extraction_type.value,
207
+ "word_count_threshold": request.word_count_threshold,
208
+ "css_selector": request.css_selector,
209
+ "xpath_query": request.xpath_query,
210
+ "scan_full_page": request.scan_full_page,
211
+ "scroll_delay": request.scroll_delay,
212
+ "total_pages_crawled": results["visited_pages"],
213
+ "total_links_found": results["total_links"],
214
+ "max_depth_reached": min(request.max_depth, max(page["depth"] for page in results["pages"]))
215
+ },
216
+ "pages": results["pages"]
217
+ }
218
+ else:
219
+ wait_condition = f"css:{request.css_selector}" if request.css_selector else None
220
+
221
+ run_config = CrawlerRunConfig(
222
+ cache_mode=CacheMode.BYPASS,
223
+ word_count_threshold=request.word_count_threshold,
224
+ css_selector=request.css_selector,
225
+ excluded_tags=request.excluded_tags or ["nav", "footer", "header"],
226
+ wait_for=wait_condition,
227
+ wait_for_images=True,
228
+ page_timeout=30000,
229
+ scan_full_page=request.scan_full_page,
230
+ scroll_delay=request.scroll_delay,
231
+ extraction_strategy=create_extraction_strategy(
232
+ request.extraction_type,
233
+ request.css_selector,
234
+ request.xpath_query
235
+ )
236
+ )
237
+
238
+ # Docker-optimized browser configuration
239
+ browser_config = BrowserConfig(
240
+ headless=True,
241
+ viewport_width=1920,
242
+ viewport_height=1080,
243
+ chrome_path=CHROME_PATH,
244
+ args=[
245
+ "--no-sandbox",
246
+ "--disable-dev-shm-usage",
247
+ "--disable-gpu"
248
+ ]
249
+ )
250
+
251
+ async with AsyncWebCrawler(config=browser_config) as crawler:
252
+ result = await crawler.arun(url=request.url, config=run_config)
253
+
254
+ if not result.success:
255
+ raise Exception(result.error_message)
256
+
257
+ images = result.media.get("images", []) if hasattr(result, 'media') else []
258
+ image_info = "\n### Images Found\n" if images else ""
259
+ for i, img in enumerate(images[:5]):
260
+ image_info += f"- Image {i+1}: {img.get('src', 'N/A')}\n"
261
+ if img.get('alt'):
262
+ image_info += f" Alt: {img['alt']}\n"
263
+ if img.get('score'):
264
+ image_info += f" Score: {img['score']}\n"
265
+
266
+ return {
267
+ "markdown": result.markdown_v2 if hasattr(result, 'markdown_v2') else "",
268
+ "metadata": {
269
+ "url": request.url,
270
+ "crawler_type": request.crawler_type.value,
271
+ "extraction_type": request.extraction_type.value,
272
+ "word_count_threshold": request.word_count_threshold,
273
+ "css_selector": request.css_selector,
274
+ "xpath_query": request.xpath_query,
275
+ "scan_full_page": request.scan_full_page,
276
+ "scroll_delay": request.scroll_delay,
277
+ "wait_condition": wait_condition
278
+ },
279
+ "extracted_content": result.extracted_content if hasattr(result, 'extracted_content') else None,
280
+ "image_info": image_info
281
+ }
282
+ except Exception as e:
283
+ raise Exception(str(e))
284
+
285
+ async def gradio_crawl(
286
+ url: str,
287
+ crawler_type: str,
288
+ extraction_type: str,
289
+ word_count_threshold: int,
290
+ css_selector: str,
291
+ xpath_query: str,
292
+ scan_full_page: bool,
293
+ scroll_delay: float,
294
+ crawl_subpages: bool,
295
+ max_depth: int,
296
+ max_pages: int,
297
+ exclude_external_links: bool
298
+ ) -> tuple[str, str]:
299
+ """Handle crawling requests from the Gradio interface."""
300
+ try:
301
+ request = CrawlRequest(
302
+ url=url,
303
+ crawler_type=CrawlerType(crawler_type.lower()),
304
+ extraction_type=ExtractionType(extraction_type.lower()),
305
+ word_count_threshold=word_count_threshold,
306
+ css_selector=css_selector if css_selector else None,
307
+ xpath_query=xpath_query if xpath_query else None,
308
+ scan_full_page=scan_full_page,
309
+ scroll_delay=scroll_delay,
310
+ crawl_subpages=crawl_subpages,
311
+ max_depth=max_depth,
312
+ max_pages=max_pages,
313
+ exclude_external_links=exclude_external_links
314
+ )
315
+
316
+ result = await crawl_url(request)
317
+
318
+ markdown_content = str(result["markdown"]) if result.get("markdown") else ""
319
+
320
+ metadata_str = f"""### Metadata
321
+ - URL: {result['metadata']['url']}
322
+ - Crawler Type: {result['metadata']['crawler_type']}
323
+ - Extraction Type: {result['metadata']['extraction_type']}
324
+ - Word Count Threshold: {result['metadata']['word_count_threshold']}
325
+ - CSS Selector: {result['metadata']['css_selector'] or 'None'}
326
+ - XPath Query: {result['metadata']['xpath_query'] or 'None'}
327
+ - Full Page Scan: {result['metadata']['scan_full_page']}
328
+ - Scroll Delay: {result['metadata']['scroll_delay']}s"""
329
+
330
+ if crawl_subpages:
331
+ metadata_str += f"""
332
+ - Total Pages Crawled: {result['metadata'].get('total_pages_crawled', 0)}
333
+ - Total Links Found: {result['metadata'].get('total_links_found', 0)}
334
+ - Max Depth Reached: {result['metadata'].get('max_depth_reached', 1)}"""
335
+
336
+ if result.get('image_info'):
337
+ metadata_str += f"\n\n{result['image_info']}"
338
+
339
+ if result.get("extracted_content"):
340
+ metadata_str += f"\n\n### Extracted Content\n```json\n{result['extracted_content']}\n```"
341
+
342
+ return markdown_content, metadata_str
343
+ except Exception as e:
344
+ error_msg = f"Error: {str(e)}"
345
+ return error_msg, "Error occurred while crawling"
346
+
347
+ # Create Gradio interface with Docker-optimized settings
348
+ demo = gr.Interface(
349
+ fn=gradio_crawl,
350
+ inputs=[
351
+ gr.Textbox(
352
+ label="URL",
353
+ placeholder="Enter URL to crawl",
354
+ info="The webpage URL to extract content from"
355
+ ),
356
+ gr.Dropdown(
357
+ choices=["Basic", "LLM", "Cosine", "JSON/CSS"],
358
+ label="Crawler Type",
359
+ value="Basic",
360
+ info="Select the content extraction strategy"
361
+ ),
362
+ gr.Dropdown(
363
+ choices=["Default", "CSS", "XPath", "LLM", "Combined"],
364
+ label="Extraction Type",
365
+ value="Default",
366
+ info="Choose how to extract content from the page"
367
+ ),
368
+ gr.Slider(
369
+ minimum=50,
370
+ maximum=500,
371
+ value=100,
372
+ step=50,
373
+ label="Word Count Threshold",
374
+ info="Minimum number of words required for content extraction"
375
+ ),
376
+ gr.Textbox(
377
+ label="CSS Selector",
378
+ placeholder="e.g., article.content, main.post",
379
+ info="CSS selector to target specific content (used with CSS extraction type)"
380
+ ),
381
+ gr.Textbox(
382
+ label="XPath Query",
383
+ placeholder="e.g., //article[@class='content']",
384
+ info="XPath query to target specific content (used with XPath extraction type)"
385
+ ),
386
+ gr.Checkbox(
387
+ label="Scan Full Page",
388
+ value=False,
389
+ info="Enable to scroll through the entire page to load lazy content"
390
+ ),
391
+ gr.Slider(
392
+ minimum=0.1,
393
+ maximum=2.0,
394
+ value=0.5,
395
+ step=0.1,
396
+ label="Scroll Delay",
397
+ info="Delay between scroll steps in seconds when scanning full page"
398
+ ),
399
+ gr.Checkbox(
400
+ label="Crawl Sub-pages",
401
+ value=False,
402
+ info="Enable to crawl links found on the page"
403
+ ),
404
+ gr.Slider(
405
+ minimum=1,
406
+ maximum=5,
407
+ value=1,
408
+ step=1,
409
+ label="Max Crawl Depth",
410
+ info="Maximum depth for recursive crawling (1 = only direct links)"
411
+ ),
412
+ gr.Slider(
413
+ minimum=1,
414
+ maximum=50,
415
+ value=10,
416
+ step=5,
417
+ label="Max Pages",
418
+ info="Maximum number of pages to crawl"
419
+ ),
420
+ gr.Checkbox(
421
+ label="Exclude External Links",
422
+ value=True,
423
+ info="Only crawl links within the same domain"
424
+ )
425
+ ],
426
+ outputs=[
427
+ gr.Markdown(label="Generated Markdown"),
428
+ gr.Markdown(label="Metadata & Extraction Results")
429
+ ],
430
+ title="Crawl4AI Demo",
431
+ description="""
432
+ This demo allows you to extract content from web pages using different crawling and extraction strategies.
433
+
434
+ 1. Enter a URL to crawl
435
+ 2. Select a crawler type (Basic, LLM, Cosine, JSON/CSS)
436
+ 3. Choose an extraction strategy (Default, CSS, XPath, LLM, Combined)
437
+ 4. Configure additional options:
438
+ - Word count threshold for content filtering
439
+ - CSS selectors for targeting specific content
440
+ - XPath queries for precise extraction
441
+ - Full page scanning for lazy-loaded content
442
+ - Scroll delay for controlling page scanning speed
443
+ - Sub-page crawling with depth control
444
+ - Maximum number of pages to crawl
445
+ - External link filtering
446
+
447
+ The extracted content will be displayed in markdown format along with metadata and extraction results.
448
+ When sub-page crawling is enabled, content from all crawled pages will be combined in the output.
449
+ """,
450
+ examples=[
451
+ ["https://example.com", "Basic", "Default", 100, "", "", False, 0.5, False, 1, 10, True],
452
+ ["https://example.com/blog", "Basic", "CSS", 100, "article.post", "", True, 0.5, True, 2, 5, True],
453
+ ]
454
+ )
455
+
456
+ if __name__ == "__main__":
457
+ demo.launch(server_name="0.0.0.0", server_port=7860)
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ crawl4ai
2
+ gradio
3
+ python-dotenv
4
+ pydantic
5
+ playwright
6
+ aiofiles
7
+ python-multipart
8
+ typing-extensions
9
+ uvicorn