deerflow

Running

App Files Files Community

deerflow / backend /docs /summarization.md

pjpjq

Deploy DeerFlow to Hugging Face Space

033ca06 verified 2 months ago

preview code

raw

history blame contribute delete

10.4 kB

	# Conversation Summarization

	DeerFlow includes automatic conversation summarization to handle long conversations that approach model token limits. When enabled, the system automatically condenses older messages while preserving recent context.

	## Overview

	The summarization feature uses LangChain's `SummarizationMiddleware` to monitor conversation history and trigger summarization based on configurable thresholds. When activated, it:

	1. Monitors message token counts in real-time
	2. Triggers summarization when thresholds are met
	3. Keeps recent messages intact while summarizing older exchanges
	4. Maintains AI/Tool message pairs together for context continuity
	5. Injects the summary back into the conversation

	## Configuration

	Summarization is configured in `config.yaml` under the `summarization` key:

	```yaml
	summarization:
	enabled: true
	model_name: null # Use default model or specify a lightweight model

	# Trigger conditions (OR logic - any condition triggers summarization)
	trigger:
	- type: tokens
	value: 4000
	# Additional triggers (optional)
	# - type: messages
	# value: 50
	# - type: fraction
	# value: 0.8 # 80% of model's max input tokens

	# Context retention policy
	keep:
	type: messages
	value: 20

	# Token trimming for summarization call
	trim_tokens_to_summarize: 4000

	# Custom summary prompt (optional)
	summary_prompt: null
	```

	### Configuration Options

	#### `enabled`
	- Type: Boolean
	- Default: `false`
	- Description: Enable or disable automatic summarization

	#### `model_name`
	- Type: String or null
	- Default: `null` (uses default model)
	- Description: Model to use for generating summaries. Recommended to use a lightweight, cost-effective model like `gpt-4o-mini` or equivalent.

	#### `trigger`
	- Type: Single `ContextSize` or list of `ContextSize` objects
	- Required: At least one trigger must be specified when enabled
	- Description: Thresholds that trigger summarization. Uses OR logic - summarization runs when ANY threshold is met.

	ContextSize Types:

	1. Token-based trigger: Activates when token count reaches the specified value
	```yaml
	trigger:
	type: tokens
	value: 4000
	```

	2. Message-based trigger: Activates when message count reaches the specified value
	```yaml
	trigger:
	type: messages
	value: 50
	```

	3. Fraction-based trigger: Activates when token usage reaches a percentage of the model's maximum input tokens
	```yaml
	trigger:
	type: fraction
	value: 0.8 # 80% of max input tokens
	```

	Multiple Triggers:
	```yaml
	trigger:
	- type: tokens
	value: 4000
	- type: messages
	value: 50
	```

	#### `keep`
	- Type: `ContextSize` object
	- Default: `{type: messages, value: 20}`
	- Description: Specifies how much recent conversation history to preserve after summarization.

	Examples:
	```yaml
	# Keep most recent 20 messages
	keep:
	type: messages
	value: 20

	# Keep most recent 3000 tokens
	keep:
	type: tokens
	value: 3000

	# Keep most recent 30% of model's max input tokens
	keep:
	type: fraction
	value: 0.3
	```

	#### `trim_tokens_to_summarize`
	- Type: Integer or null
	- Default: `4000`
	- Description: Maximum tokens to include when preparing messages for the summarization call itself. Set to `null` to skip trimming (not recommended for very long conversations).

	#### `summary_prompt`
	- Type: String or null
	- Default: `null` (uses LangChain's default prompt)
	- Description: Custom prompt template for generating summaries. The prompt should guide the model to extract the most important context.

	Default Prompt Behavior:
	The default LangChain prompt instructs the model to:
	- Extract highest quality/most relevant context
	- Focus on information critical to the overall goal
	- Avoid repeating completed actions
	- Return only the extracted context

	## How It Works

	### Summarization Flow

	1. Monitoring: Before each model call, the middleware counts tokens in the message history
	2. Trigger Check: If any configured threshold is met, summarization is triggered
	3. Message Partitioning: Messages are split into:
	- Messages to summarize (older messages beyond the `keep` threshold)
	- Messages to preserve (recent messages within the `keep` threshold)
	4. Summary Generation: The model generates a concise summary of the older messages
	5. Context Replacement: The message history is updated:
	- All old messages are removed
	- A single summary message is added
	- Recent messages are preserved
	6. AI/Tool Pair Protection: The system ensures AI messages and their corresponding tool messages stay together

	### Token Counting

	- Uses approximate token counting based on character count
	- For Anthropic models: ~3.3 characters per token
	- For other models: Uses LangChain's default estimation
	- Can be customized with a custom `token_counter` function

	### Message Preservation

	The middleware intelligently preserves message context:

	- Recent Messages: Always kept intact based on `keep` configuration
	- AI/Tool Pairs: Never split - if a cutoff point falls within tool messages, the system adjusts to keep the entire AI + Tool message sequence together
	- Summary Format: Summary is injected as a HumanMessage with the format:
	```
	Here is a summary of the conversation to date:

	[Generated summary text]
	```

	## Best Practices

	### Choosing Trigger Thresholds

	1. Token-based triggers: Recommended for most use cases
	- Set to 60-80% of your model's context window
	- Example: For 8K context, use 4000-6000 tokens

	2. Message-based triggers: Useful for controlling conversation length
	- Good for applications with many short messages
	- Example: 50-100 messages depending on average message length

	3. Fraction-based triggers: Ideal when using multiple models
	- Automatically adapts to each model's capacity
	- Example: 0.8 (80% of model's max input tokens)

	### Choosing Retention Policy (`keep`)

	1. Message-based retention: Best for most scenarios
	- Preserves natural conversation flow
	- Recommended: 15-25 messages

	2. Token-based retention: Use when precise control is needed
	- Good for managing exact token budgets
	- Recommended: 2000-4000 tokens

	3. Fraction-based retention: For multi-model setups
	- Automatically scales with model capacity
	- Recommended: 0.2-0.4 (20-40% of max input)

	### Model Selection

	- Recommended: Use a lightweight, cost-effective model for summaries
	- Examples: `gpt-4o-mini`, `claude-haiku`, or equivalent
	- Summaries don't require the most powerful models
	- Significant cost savings on high-volume applications

	- Default: If `model_name` is `null`, uses the default model
	- May be more expensive but ensures consistency
	- Good for simple setups

	### Optimization Tips

	1. Balance triggers: Combine token and message triggers for robust handling
	```yaml
	trigger:
	- type: tokens
	value: 4000
	- type: messages
	value: 50
	```

	2. Conservative retention: Keep more messages initially, adjust based on performance
	```yaml
	keep:
	type: messages
	value: 25 # Start higher, reduce if needed
	```

	3. Trim strategically: Limit tokens sent to summarization model
	```yaml
	trim_tokens_to_summarize: 4000 # Prevents expensive summarization calls
	```

	4. Monitor and iterate: Track summary quality and adjust configuration

	## Troubleshooting

	### Summary Quality Issues

	Problem: Summaries losing important context

	Solutions:
	1. Increase `keep` value to preserve more messages
	2. Decrease trigger thresholds to summarize earlier
	3. Customize `summary_prompt` to emphasize key information
	4. Use a more capable model for summarization

	### Performance Issues

	Problem: Summarization calls taking too long

	Solutions:
	1. Use a faster model for summaries (e.g., `gpt-4o-mini`)
	2. Reduce `trim_tokens_to_summarize` to send less context
	3. Increase trigger thresholds to summarize less frequently

	### Token Limit Errors

	Problem: Still hitting token limits despite summarization

	Solutions:
	1. Lower trigger thresholds to summarize earlier
	2. Reduce `keep` value to preserve fewer messages
	3. Check if individual messages are very large
	4. Consider using fraction-based triggers

	## Implementation Details

	### Code Structure

	- Configuration: `src/config/summarization_config.py`
	- Integration: `src/agents/lead_agent/agent.py`
	- Middleware: Uses `langchain.agents.middleware.SummarizationMiddleware`

	### Middleware Order

	Summarization runs after ThreadData and Sandbox initialization but before Title and Clarification:

	1. ThreadDataMiddleware
	2. SandboxMiddleware
	3. SummarizationMiddleware ← Runs here
	4. TitleMiddleware
	5. ClarificationMiddleware

	### State Management

	- Summarization is stateless - configuration is loaded once at startup
	- Summaries are added as regular messages in the conversation history
	- The checkpointer persists the summarized history automatically

	## Example Configurations

	### Minimal Configuration
	```yaml
	summarization:
	enabled: true
	trigger:
	type: tokens
	value: 4000
	keep:
	type: messages
	value: 20
	```

	### Production Configuration
	```yaml
	summarization:
	enabled: true
	model_name: gpt-4o-mini # Lightweight model for cost efficiency
	trigger:
	- type: tokens
	value: 6000
	- type: messages
	value: 75
	keep:
	type: messages
	value: 25
	trim_tokens_to_summarize: 5000
	```

	### Multi-Model Configuration
	```yaml
	summarization:
	enabled: true
	model_name: gpt-4o-mini
	trigger:
	type: fraction
	value: 0.7 # 70% of model's max input
	keep:
	type: fraction
	value: 0.3 # Keep 30% of max input
	trim_tokens_to_summarize: 4000
	```

	### Conservative Configuration (High Quality)
	```yaml
	summarization:
	enabled: true
	model_name: gpt-4 # Use full model for high-quality summaries
	trigger:
	type: tokens
	value: 8000
	keep:
	type: messages
	value: 40 # Keep more context
	trim_tokens_to_summarize: null # No trimming
	```

	## References

	- [LangChain Summarization Middleware Documentation](https://docs.langchain.com/oss/python/langchain/middleware/built-in#summarization)
	- [LangChain Source Code](https://github.com/langchain-ai/langchain)