DavidrPatton's picture
Add datasets dataset
e65ef8e verified
# n8n Workflow Training Datasets
This directory contains training datasets for fine-tuning Large Language Models (LLMs) to generate n8n workflows from natural language descriptions.
## Dataset Format
Each dataset file is a JSON array containing training examples in a conversational format:
```json
[
{
"messages": [
{
"role": "user",
"content": "When a new email arrives in Gmail, save the attachment to Google Drive."
},
{
"role": "assistant",
"content": "{\"name\": \"Email to Drive\", \"nodes\": [...], \"connections\": {...}, \"active\": false}"
}
]
}
]
```
### Structure
- **`role: "user"`** - Natural language description of the workflow to create
- **`role: "assistant"`** - JSON representation of the complete n8n workflow
The assistant's response contains a valid n8n workflow definition with:
- `name`: Workflow name
- `nodes`: Array of node definitions (triggers, actions, transformations)
- `connections`: Object defining how nodes are connected
- `active`: Boolean indicating if workflow is active (usually `false` for templates)
## Dataset Files
### dataset_001.json
- **Size**: 2.5 MB
- **Examples**: 3,061 workflow examples
- **Status**: βœ… Valid JSON
- **Focus**: Common workflow patterns (Gmail, Slack, Google Sheets, Trello, Airtable, Notion, etc.)
### dataset_002.json
- **Size**: 4.9 MB
- **Status**: ⚠️ JSON parsing errors detected
- **Note**: May require cleaning before use
### dataset_003.json
- **Size**: 14.0 MB
- **Status**: ⚠️ JSON parsing errors detected
- **Note**: May require cleaning before use
## Common Workflow Patterns
Based on dataset_001.json analysis, the most common patterns include:
1. **Email Automation**
- Gmail β†’ Google Drive (save attachments)
- Gmail β†’ Slack (notifications)
- Gmail β†’ Airtable (create records)
2. **Spreadsheet Integration**
- Google Sheets β†’ Slack (new row notifications)
- Google Sheets β†’ Gmail (alerts)
- Airtable β†’ Google Sheets (sync data)
3. **Project Management**
- Trello β†’ Slack (card updates)
- Trello β†’ Google Calendar (deadline tracking)
- GitHub β†’ Trello (issue tracking)
4. **Notification Workflows**
- Slack reactions β†’ Airtable (logging)
- Calendar events β†’ Email reminders
- Notion updates β†’ Slack posts
## Usage for LLM Training
### Fine-tuning Format
These datasets are compatible with OpenAI's fine-tuning format and similar training pipelines. Each example teaches the model to:
1. Parse natural language workflow requests
2. Identify required n8n nodes
3. Configure node parameters
4. Establish proper connections between nodes
### Recommended Preprocessing
Before using these datasets:
1. **Validate JSON**: Verify all files parse correctly
2. **Deduplicate**: Remove duplicate examples (some duplicates exist)
3. **Filter**: Optionally filter by specific integrations or complexity
4. **Balance**: Ensure diverse node types are represented
### Example Use Cases
- Fine-tune GPT models to generate n8n workflows
- Train models to suggest workflow improvements
- Create workflow completion assistants
- Build n8n-specific code generation tools
## Integration with n8n-mcp
This repository complements the [n8n-mcp](https://github.com/yourusername/n8n-mcp) server by providing:
- **Static training data** for model fine-tuning
- **Example workflows** for reference
- **Pattern library** for common automations
While n8n-mcp provides real-time workflow execution and API access, these datasets enable LLMs to learn n8n's workflow generation patterns.
## Contributing
When adding new examples:
1. Follow the existing JSON structure
2. Ensure workflow JSON is valid n8n format
3. Use descriptive, natural language in user messages
4. Test workflows before adding to datasets
5. Avoid duplicates
## Known Issues
- Duplicate entries exist in dataset_001.json (minimal impact on training)
- dataset_002.json and dataset_003.json have JSON formatting errors
- Some placeholder values (e.g., `{{SHEET_ID}}`, `{{API_KEY}}`) are included - these are intentional for template-style workflows
## Tools
See `/scripts/analyze_datasets.py` for dataset analysis and statistics tools.