## Chunking Strategies 📚

Crawl4AI provides several powerful chunking strategies to divide text into manageable parts for further processing. Each strategy has unique characteristics and is suitable for different scenarios. Let's explore them one by one.

### RegexChunking

`RegexChunking` splits text using regular expressions. This is ideal for creating chunks based on specific patterns like paragraphs or sentences.

#### When to Use
- Great for structured text with consistent delimiters.
- Suitable for documents where specific patterns (e.g., double newlines, periods) indicate logical chunks.

#### Parameters
- `patterns` (list, optional): Regular expressions used to split the text. Default is to split by double newlines (`['\n\n']`).

#### Example
```python
from crawl4ai.chunking_strategy import RegexChunking

# Define patterns for splitting text
patterns = [r'\n\n', r'\. ']
chunker = RegexChunking(patterns=patterns)

# Sample text
text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."

# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```

### NlpSentenceChunking

`NlpSentenceChunking` uses NLP models to split text into sentences, ensuring accurate sentence boundaries.

#### When to Use
- Ideal for texts where sentence boundaries are crucial.
- Useful for creating chunks that preserve grammatical structures.

#### Parameters
- None.

#### Example
```python
from crawl4ai.chunking_strategy import NlpSentenceChunking

chunker = NlpSentenceChunking()

# Sample text
text = "This is a sample text. It will be split into sentences. Here's another sentence."

# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```

### TopicSegmentationChunking

`TopicSegmentationChunking` employs the TextTiling algorithm to segment text into topic-based chunks. This method identifies thematic boundaries.

#### When to Use
- Perfect for long documents with distinct topics.
- Useful when preserving topic continuity is more important than maintaining text order.

#### Parameters
- `num_keywords` (int, optional): Number of keywords for each topic segment. Default is `3`.

#### Example
```python
from crawl4ai.chunking_strategy import TopicSegmentationChunking

chunker = TopicSegmentationChunking(num_keywords=3)

# Sample text
text = "This document contains several topics. Topic one discusses AI. Topic two covers machine learning."

# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```

### FixedLengthWordChunking

`FixedLengthWordChunking` splits text into chunks based on a fixed number of words. This ensures each chunk has approximately the same length.

#### When to Use
- Suitable for processing large texts where uniform chunk size is important.
- Useful when the number of words per chunk needs to be controlled.

#### Parameters
- `chunk_size` (int, optional): Number of words per chunk. Default is `100`.

#### Example
```python
from crawl4ai.chunking_strategy import FixedLengthWordChunking

chunker = FixedLengthWordChunking(chunk_size=10)

# Sample text
text = "This is a sample text. It will be split into chunks of fixed length."

# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```

### SlidingWindowChunking

`SlidingWindowChunking` uses a sliding window approach to create overlapping chunks. Each chunk has a fixed length, and the window slides by a specified step size.

#### When to Use
- Ideal for creating overlapping chunks to preserve context.
- Useful for tasks where context from adjacent chunks is needed.

#### Parameters
- `window_size` (int, optional): Number of words in each chunk. Default is `100`.
- `step` (int, optional): Number of words to slide the window. Default is `50`.

#### Example
```python
from crawl4ai.chunking_strategy import SlidingWindowChunking

chunker = SlidingWindowChunking(window_size=10, step=5)

# Sample text
text = "This is a sample text. It will be split using a sliding window approach to preserve context."

# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```

With these chunking strategies, you can choose the best method to divide your text based on your specific needs. Whether you need precise sentence boundaries, topic-based segmentation, or uniform chunk sizes, Crawl4AI has you covered. Happy chunking! 📝✨
# Cosine Strategy

The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.

## How It Works

The Cosine Strategy:
1. Breaks down page content into meaningful chunks
2. Converts text into vector representations
3. Calculates similarity between chunks
4. Clusters similar content together
5. Ranks and filters content based on relevance

## Basic Usage

```python
from crawl4ai.extraction_strategy import CosineStrategy

strategy = CosineStrategy(
    semantic_filter="product reviews",    # Target content type
    word_count_threshold=10,             # Minimum words per cluster
    sim_threshold=0.3                    # Similarity threshold
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/reviews",
        extraction_strategy=strategy
    )
    
    content = result.extracted_content
```

## Configuration Options

### Core Parameters

```python
CosineStrategy(
    # Content Filtering
    semantic_filter: str = None,       # Keywords/topic for content filtering
    word_count_threshold: int = 10,    # Minimum words per cluster
    sim_threshold: float = 0.3,        # Similarity threshold (0.0 to 1.0)
    
    # Clustering Parameters
    max_dist: float = 0.2,            # Maximum distance for clustering
    linkage_method: str = 'ward',      # Clustering linkage method
    top_k: int = 3,                   # Number of top categories to extract
    
    # Model Configuration
    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
    
    verbose: bool = False             # Enable logging
)
```

### Parameter Details

1. **semantic_filter**
   - Sets the target topic or content type
   - Use keywords relevant to your desired content
   - Example: "technical specifications", "user reviews", "pricing information"

2. **sim_threshold**
   - Controls how similar content must be to be grouped together
   - Higher values (e.g., 0.8) mean stricter matching
   - Lower values (e.g., 0.3) allow more variation
   ```python
   # Strict matching
   strategy = CosineStrategy(sim_threshold=0.8)
   
   # Loose matching
   strategy = CosineStrategy(sim_threshold=0.3)
   ```

3. **word_count_threshold**
   - Filters out short content blocks
   - Helps eliminate noise and irrelevant content
   ```python
   # Only consider substantial paragraphs
   strategy = CosineStrategy(word_count_threshold=50)
   ```

4. **top_k**
   - Number of top content clusters to return
   - Higher values return more diverse content
   ```python
   # Get top 5 most relevant content clusters
   strategy = CosineStrategy(top_k=5)
   ```

## Use Cases

### 1. Article Content Extraction
```python
strategy = CosineStrategy(
    semantic_filter="main article content",
    word_count_threshold=100,  # Longer blocks for articles
    top_k=1                   # Usually want single main content
)

result = await crawler.arun(
    url="https://example.com/blog/post",
    extraction_strategy=strategy
)
```

### 2. Product Review Analysis
```python
strategy = CosineStrategy(
    semantic_filter="customer reviews and ratings",
    word_count_threshold=20,   # Reviews can be shorter
    top_k=10,                 # Get multiple reviews
    sim_threshold=0.4         # Allow variety in review content
)
```

### 3. Technical Documentation
```python
strategy = CosineStrategy(
    semantic_filter="technical specifications documentation",
    word_count_threshold=30,
    sim_threshold=0.6,        # Stricter matching for technical content
    max_dist=0.3             # Allow related technical sections
)
```

## Advanced Features

### Custom Clustering
```python
strategy = CosineStrategy(
    linkage_method='complete',  # Alternative clustering method
    max_dist=0.4,              # Larger clusters
    model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'  # Multilingual support
)
```

### Content Filtering Pipeline
```python
strategy = CosineStrategy(
    semantic_filter="pricing plans features",
    word_count_threshold=15,
    sim_threshold=0.5,
    top_k=3
)

async def extract_pricing_features(url: str):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            extraction_strategy=strategy
        )
        
        if result.success:
            content = json.loads(result.extracted_content)
            return {
                'pricing_features': content,
                'clusters': len(content),
                'similarity_scores': [item['score'] for item in content]
            }
```

## Best Practices

1. **Adjust Thresholds Iteratively**
   - Start with default values
   - Adjust based on results
   - Monitor clustering quality

2. **Choose Appropriate Word Count Thresholds**
   - Higher for articles (100+)
   - Lower for reviews/comments (20+)
   - Medium for product descriptions (50+)

3. **Optimize Performance**
   ```python
   strategy = CosineStrategy(
       word_count_threshold=10,  # Filter early
       top_k=5,                 # Limit results
       verbose=True             # Monitor performance
   )
   ```

4. **Handle Different Content Types**
   ```python
   # For mixed content pages
   strategy = CosineStrategy(
       semantic_filter="product features",
       sim_threshold=0.4,      # More flexible matching
       max_dist=0.3,          # Larger clusters
       top_k=3                # Multiple relevant sections
   )
   ```

## Error Handling

```python
try:
    result = await crawler.arun(
        url="https://example.com",
        extraction_strategy=strategy
    )
    
    if result.success:
        content = json.loads(result.extracted_content)
        if not content:
            print("No relevant content found")
    else:
        print(f"Extraction failed: {result.error_message}")
        
except Exception as e:
    print(f"Error during extraction: {str(e)}")
```

The Cosine Strategy is particularly effective when:
- Content structure is inconsistent
- You need semantic understanding
- You want to find similar content blocks
- Structure-based extraction (CSS/XPath) isn't reliable

It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.# Advanced Usage of JsonCssExtractionStrategy

While the basic usage of JsonCssExtractionStrategy is powerful for simple structures, its true potential shines when dealing with complex, nested HTML structures. This section will explore advanced usage scenarios, demonstrating how to extract nested objects, lists, and nested lists.

## Hypothetical Website Example

Let's consider a hypothetical e-commerce website that displays product categories, each containing multiple products. Each product has details, reviews, and related items. This complex structure will allow us to demonstrate various advanced features of JsonCssExtractionStrategy.

Assume the HTML structure looks something like this:

```html
<div class="category">
  <h2 class="category-name">Electronics</h2>
  <div class="product">
    <h3 class="product-name">Smartphone X</h3>
    <p class="product-price">$999</p>
    <div class="product-details">
      <span class="brand">TechCorp</span>
      <span class="model">X-2000</span>
    </div>
    <ul class="product-features">
      <li>5G capable</li>
      <li>6.5" OLED screen</li>
      <li>128GB storage</li>
    </ul>
    <div class="product-reviews">
      <div class="review">
        <span class="reviewer">John D.</span>
        <span class="rating">4.5</span>
        <p class="review-text">Great phone, love the camera!</p>
      </div>
      <div class="review">
        <span class="reviewer">Jane S.</span>
        <span class="rating">5</span>
        <p class="review-text">Best smartphone I've ever owned.</p>
      </div>
    </div>
    <ul class="related-products">
      <li>
        <span class="related-name">Phone Case</span>
        <span class="related-price">$29.99</span>
      </li>
      <li>
        <span class="related-name">Screen Protector</span>
        <span class="related-price">$9.99</span>
      </li>
    </ul>
  </div>
  <!-- More products... -->
</div>
```

Now, let's create a schema to extract this complex structure:

```python
schema = {
    "name": "E-commerce Product Catalog",
    "baseSelector": "div.category",
    "fields": [
        {
            "name": "category_name",
            "selector": "h2.category-name",
            "type": "text"
        },
        {
            "name": "products",
            "selector": "div.product",
            "type": "nested_list",
            "fields": [
                {
                    "name": "name",
                    "selector": "h3.product-name",
                    "type": "text"
                },
                {
                    "name": "price",
                    "selector": "p.product-price",
                    "type": "text"
                },
                {
                    "name": "details",
                    "selector": "div.product-details",
                    "type": "nested",
                    "fields": [
                        {
                            "name": "brand",
                            "selector": "span.brand",
                            "type": "text"
                        },
                        {
                            "name": "model",
                            "selector": "span.model",
                            "type": "text"
                        }
                    ]
                },
                {
                    "name": "features",
                    "selector": "ul.product-features li",
                    "type": "list",
                    "fields": [
                        {
                            "name": "feature",
                            "type": "text"
                        }
                    ]
                },
                {
                    "name": "reviews",
                    "selector": "div.review",
                    "type": "nested_list",
                    "fields": [
                        {
                            "name": "reviewer",
                            "selector": "span.reviewer",
                            "type": "text"
                        },
                        {
                            "name": "rating",
                            "selector": "span.rating",
                            "type": "text"
                        },
                        {
                            "name": "comment",
                            "selector": "p.review-text",
                            "type": "text"
                        }
                    ]
                },
                {
                    "name": "related_products",
                    "selector": "ul.related-products li",
                    "type": "list",
                    "fields": [
                        {
                            "name": "name",
                            "selector": "span.related-name",
                            "type": "text"
                        },
                        {
                            "name": "price",
                            "selector": "span.related-price",
                            "type": "text"
                        }
                    ]
                }
            ]
        }
    ]
}
```

This schema demonstrates several advanced features:

1. **Nested Objects**: The `details` field is a nested object within each product.
2. **Simple Lists**: The `features` field is a simple list of text items.
3. **Nested Lists**: The `products` field is a nested list, where each item is a complex object.
4. **Lists of Objects**: The `reviews` and `related_products` fields are lists of objects.

Let's break down the key concepts:

### Nested Objects

To create a nested object, use `"type": "nested"` and provide a `fields` array for the nested structure:

```python
{
    "name": "details",
    "selector": "div.product-details",
    "type": "nested",
    "fields": [
        {
            "name": "brand",
            "selector": "span.brand",
            "type": "text"
        },
        {
            "name": "model",
            "selector": "span.model",
            "type": "text"
        }
    ]
}
```

### Simple Lists

For a simple list of identical items, use `"type": "list"`:

```python
{
    "name": "features",
    "selector": "ul.product-features li",
    "type": "list",
    "fields": [
        {
            "name": "feature",
            "type": "text"
        }
    ]
}
```

### Nested Lists

For a list of complex objects, use `"type": "nested_list"`:

```python
{
    "name": "products",
    "selector": "div.product",
    "type": "nested_list",
    "fields": [
        // ... fields for each product
    ]
}
```

### Lists of Objects

Similar to nested lists, but typically used for simpler objects within the list:

```python
{
    "name": "related_products",
    "selector": "ul.related-products li",
    "type": "list",
    "fields": [
        {
            "name": "name",
            "selector": "span.related-name",
            "type": "text"
        },
        {
            "name": "price",
            "selector": "span.related-price",
            "type": "text"
        }
    ]
}
```

## Using the Advanced Schema

To use this advanced schema with AsyncWebCrawler:

```python
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_complex_product_data():
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )

        assert result.success, "Failed to crawl the page"

        product_data = json.loads(result.extracted_content)
        print(json.dumps(product_data, indent=2))

asyncio.run(extract_complex_product_data())
```

This will produce a structured JSON output that captures the complex hierarchy of the product catalog, including nested objects, lists, and nested lists.

## Tips for Advanced Usage

1. **Start Simple**: Begin with a basic schema and gradually add complexity.
2. **Test Incrementally**: Test each part of your schema separately before combining them.
3. **Use Chrome DevTools**: The Element Inspector is invaluable for identifying the correct selectors.
4. **Handle Missing Data**: Use the `default` key in your field definitions to handle cases where data might be missing.
5. **Leverage Transforms**: Use the `transform` key to clean or format extracted data (e.g., converting prices to numbers).
6. **Consider Performance**: Very complex schemas might slow down extraction. Balance complexity with performance needs.

By mastering these advanced techniques, you can use JsonCssExtractionStrategy to extract highly structured data from even the most complex web pages, making it a powerful tool for web scraping and data analysis tasks.# JSON CSS Extraction Strategy with AsyncWebCrawler

The `JsonCssExtractionStrategy` is a powerful feature of Crawl4AI that allows you to extract structured data from web pages using CSS selectors. This method is particularly useful when you need to extract specific data points from a consistent HTML structure, such as tables or repeated elements. Here's how to use it with the AsyncWebCrawler.

## Overview

The `JsonCssExtractionStrategy` works by defining a schema that specifies:
1. A base CSS selector for the repeating elements
2. Fields to extract from each element, each with its own CSS selector

This strategy is fast and efficient, as it doesn't rely on external services like LLMs for extraction.

## Example: Extracting Cryptocurrency Prices from Coinbase

Let's look at an example that extracts cryptocurrency prices from the Coinbase explore page.

```python
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def extract_structured_data_using_css_extractor():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    
    # Define the extraction schema
    schema = {
        "name": "Coinbase Crypto Prices",
        "baseSelector": ".cds-tableRow-t45thuk",
        "fields": [
            {
                "name": "crypto",
                "selector": "td:nth-child(1) h2",
                "type": "text",
            },
            {
                "name": "symbol",
                "selector": "td:nth-child(1) p",
                "type": "text",
            },
            {
                "name": "price",
                "selector": "td:nth-child(2)",
                "type": "text",
            }
        ],
    }

    # Create the extraction strategy
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    # Use the AsyncWebCrawler with the extraction strategy
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.coinbase.com/explore",
            extraction_strategy=extraction_strategy,
            bypass_cache=True,
        )

        assert result.success, "Failed to crawl the page"

        # Parse the extracted content
        crypto_prices = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(crypto_prices)} cryptocurrency prices")
        print(json.dumps(crypto_prices[0], indent=2))

    return crypto_prices

# Run the async function
asyncio.run(extract_structured_data_using_css_extractor())
```

## Explanation of the Schema

The schema defines how to extract the data:

- `name`: A descriptive name for the extraction task.
- `baseSelector`: The CSS selector for the repeating elements (in this case, table rows).
- `fields`: An array of fields to extract from each element:
  - `name`: The name to give the extracted data.
  - `selector`: The CSS selector to find the specific data within the base element.
  - `type`: The type of data to extract (usually "text" for textual content).

## Advantages of JsonCssExtractionStrategy

1. **Speed**: CSS selectors are fast to execute, making this method efficient for large datasets.
2. **Precision**: You can target exactly the elements you need.
3. **Structured Output**: The result is already structured as JSON, ready for further processing.
4. **No External Dependencies**: Unlike LLM-based strategies, this doesn't require any API calls to external services.

## Tips for Using JsonCssExtractionStrategy

1. **Inspect the Page**: Use browser developer tools to identify the correct CSS selectors.
2. **Test Selectors**: Verify your selectors in the browser console before using them in the script.
3. **Handle Dynamic Content**: If the page uses JavaScript to load content, you may need to combine this with JS execution (see the Advanced Usage section).
4. **Error Handling**: Always check the `result.success` flag and handle potential failures.

## Advanced Usage: Combining with JavaScript Execution

For pages that load data dynamically, you can combine the `JsonCssExtractionStrategy` with JavaScript execution:

```python
async def extract_dynamic_structured_data():
    schema = {
        "name": "Dynamic Crypto Prices",
        "baseSelector": ".crypto-row",
        "fields": [
            {"name": "name", "selector": ".crypto-name", "type": "text"},
            {"name": "price", "selector": ".crypto-price", "type": "text"},
        ]
    }

    js_code = """
    window.scrollTo(0, document.body.scrollHeight);
    await new Promise(resolve => setTimeout(resolve, 2000));  // Wait for 2 seconds
    """

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://example.com/crypto-prices",
            extraction_strategy=extraction_strategy,
            js_code=js_code,
            wait_for=".crypto-row:nth-child(20)",  # Wait for 20 rows to load
            bypass_cache=True,
        )

        crypto_data = json.loads(result.extracted_content)
        print(f"Extracted {len(crypto_data)} cryptocurrency entries")

asyncio.run(extract_dynamic_structured_data())
```

This advanced example demonstrates how to:
1. Execute JavaScript to trigger dynamic content loading.
2. Wait for a specific condition (20 rows loaded) before extraction.
3. Extract data from the dynamically loaded content.

By mastering the `JsonCssExtractionStrategy`, you can efficiently extract structured data from a wide variety of web pages, making it a valuable tool in your web scraping toolkit.

For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](./css-advanced.md).# LLM Extraction with AsyncWebCrawler

Crawl4AI's AsyncWebCrawler allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages asynchronously. Below are two examples demonstrating how to use `LLMExtractionStrategy` for different purposes with the AsyncWebCrawler.

## Example 1: Extract Structured Data

In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.

```python
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

async def extract_openai_fees():
    url = 'https://openai.com/api/pricing/'

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o", # Or use ollama like provider="ollama/nemotron"
                api_token=os.getenv('OPENAI_API_KEY'),
                schema=OpenAIModelFee.model_json_schema(),
                extraction_type="schema",
                instruction="From the crawled content, extract all mentioned model names along with their "
                            "fees for input and output tokens. Make sure not to miss anything in the entire content. "
                            'One extracted model JSON format should look like this: '
                            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
            ),
            bypass_cache=True,
        )

    model_fees = json.loads(result.extracted_content)
    print(f"Number of models extracted: {len(model_fees)}")

    with open(".data/openai_fees.json", "w", encoding="utf-8") as f:
        json.dump(model_fees, f, indent=2)

asyncio.run(extract_openai_fees())
```

## Example 2: Extract Relevant Content

In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.

```python
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def extract_tech_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'),
                instruction="Extract only content related to technology"
            ),
            bypass_cache=True,
        )

    tech_content = json.loads(result.extracted_content)
    print(f"Number of tech-related items extracted: {len(tech_content)}")

    with open(".data/tech_content.json", "w", encoding="utf-8") as f:
        json.dump(tech_content, f, indent=2)

asyncio.run(extract_tech_content())
```

## Advanced Usage: Combining JS Execution with LLM Extraction

This example demonstrates how to combine JavaScript execution with LLM extraction to handle dynamic content:

```python
async def extract_dynamic_content():
    js_code = """
    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
    if (loadMoreButton) {
        loadMoreButton.click();
        await new Promise(resolve => setTimeout(resolve, 2000));
    }
    """

    wait_for = """
    () => {
        const articles = document.querySelectorAll('article.tease-card');
        return articles.length > 10;
    }
    """

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            wait_for=wait_for,
            css_selector="article.tease-card",
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'),
                instruction="Summarize each article, focusing on technology-related content"
            ),
            bypass_cache=True,
        )

    summaries = json.loads(result.extracted_content)
    print(f"Number of summarized articles: {len(summaries)}")

    with open(".data/tech_summaries.json", "w", encoding="utf-8") as f:
        json.dump(summaries, f, indent=2)

asyncio.run(extract_dynamic_content())
```

## Customizing LLM Provider

Crawl4AI uses the `litellm` library under the hood, which allows you to use any LLM provider you want. Just pass the correct model name and API token:

```python
extraction_strategy=LLMExtractionStrategy(
    provider="your_llm_provider/model_name",
    api_token="your_api_token",
    instruction="Your extraction instruction"
)
```

This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.

## Error Handling and Retries

When working with external LLM APIs, it's important to handle potential errors and implement retry logic. Here's an example of how you might do this:

```python
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

class LLMExtractionError(Exception):
    pass

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def extract_with_retry(crawler, url, extraction_strategy):
    try:
        result = await crawler.arun(url=url, extraction_strategy=extraction_strategy, bypass_cache=True)
        return json.loads(result.extracted_content)
    except Exception as e:
        raise LLMExtractionError(f"Failed to extract content: {str(e)}")

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        try:
            content = await extract_with_retry(
                crawler,
                "https://www.example.com",
                LLMExtractionStrategy(
                    provider="openai/gpt-4o",
                    api_token=os.getenv('OPENAI_API_KEY'),
                    instruction="Extract and summarize main points"
                )
            )
            print("Extracted content:", content)
        except LLMExtractionError as e:
            print(f"Extraction failed after retries: {e}")

asyncio.run(main())
```

This example uses the `tenacity` library to implement a retry mechanism with exponential backoff, which can help handle temporary failures or rate limiting from the LLM API.# Extraction Strategies Overview

Crawl4AI provides powerful extraction strategies to help you get structured data from web pages. Each strategy is designed for specific use cases and offers different approaches to data extraction.

## Available Strategies

### [LLM-Based Extraction](llm.md)

`LLMExtractionStrategy` uses Language Models to extract structured data from web content. This approach is highly flexible and can understand content semantically.

```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class Product(BaseModel):
    name: str
    price: float
    description: str

strategy = LLMExtractionStrategy(
    provider="ollama/llama2",
    schema=Product.schema(),
    instruction="Extract product details from the page"
)

result = await crawler.arun(
    url="https://example.com/product",
    extraction_strategy=strategy
)
```

**Best for:**
- Complex data structures
- Content requiring interpretation
- Flexible content formats
- Natural language processing

### [CSS-Based Extraction](css.md)

`JsonCssExtractionStrategy` extracts data using CSS selectors. This is fast, reliable, and perfect for consistently structured pages.

```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product Listing",
    "baseSelector": ".product-card",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "price", "selector": ".price", "type": "text"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
    ]
}

strategy = JsonCssExtractionStrategy(schema)

result = await crawler.arun(
    url="https://example.com/products",
    extraction_strategy=strategy
)
```

**Best for:**
- E-commerce product listings
- News article collections
- Structured content pages
- High-performance needs

### [Cosine Strategy](cosine.md)

`CosineStrategy` uses similarity-based clustering to identify and extract relevant content sections.

```python
from crawl4ai.extraction_strategy import CosineStrategy

strategy = CosineStrategy(
    semantic_filter="product reviews",    # Content focus
    word_count_threshold=10,             # Minimum words per cluster
    sim_threshold=0.3,                   # Similarity threshold
    max_dist=0.2,                        # Maximum cluster distance
    top_k=3                             # Number of top clusters to extract
)

result = await crawler.arun(
    url="https://example.com/reviews",
    extraction_strategy=strategy
)
```

**Best for:**
- Content similarity analysis
- Topic clustering
- Relevant content extraction
- Pattern recognition in text

## Strategy Selection Guide

Choose your strategy based on these factors:

1. **Content Structure**
   - Well-structured HTML → Use CSS Strategy
   - Natural language text → Use LLM Strategy
   - Mixed/Complex content → Use Cosine Strategy

2. **Performance Requirements**
   - Fastest: CSS Strategy
   - Moderate: Cosine Strategy
   - Variable: LLM Strategy (depends on provider)

3. **Accuracy Needs**
   - Highest structure accuracy: CSS Strategy
   - Best semantic understanding: LLM Strategy
   - Best content relevance: Cosine Strategy

## Combining Strategies

You can combine strategies for more powerful extraction:

```python
# First use CSS strategy for initial structure
css_result = await crawler.arun(
    url="https://example.com",
    extraction_strategy=css_strategy
)

# Then use LLM for semantic analysis
llm_result = await crawler.arun(
    url="https://example.com",
    extraction_strategy=llm_strategy
)
```

## Common Use Cases

1. **E-commerce Scraping**
   ```python
   # CSS Strategy for product listings
   schema = {
       "name": "Products",
       "baseSelector": ".product",
       "fields": [
           {"name": "name", "selector": ".title", "type": "text"},
           {"name": "price", "selector": ".price", "type": "text"}
       ]
   }
   ```

2. **News Article Extraction**
   ```python
   # LLM Strategy for article content
   class Article(BaseModel):
       title: str
       content: str
       author: str
       date: str

   strategy = LLMExtractionStrategy(
       provider="ollama/llama2",
       schema=Article.schema()
   )
   ```

3. **Content Analysis**
   ```python
   # Cosine Strategy for topic analysis
   strategy = CosineStrategy(
       semantic_filter="technology trends",
       top_k=5
   )
   ```

## Best Practices

1. **Choose the Right Strategy**
   - Start with CSS for structured data
   - Use LLM for complex interpretation
   - Try Cosine for content relevance

2. **Optimize Performance**
   - Cache LLM results
   - Keep CSS selectors specific
   - Tune similarity thresholds

3. **Handle Errors**
   ```python
   result = await crawler.arun(
       url="https://example.com",
       extraction_strategy=strategy
   )
   
   if not result.success:
       print(f"Extraction failed: {result.error_message}")
   else:
       data = json.loads(result.extracted_content)
   ```

Each strategy has its strengths and optimal use cases. Explore the detailed documentation for each strategy to learn more about their specific features and configurations.