n8n Workflow Training Datasets

This directory contains training datasets for fine-tuning Large Language Models (LLMs) to generate n8n workflows from natural language descriptions.

Dataset Format

Each dataset file is a JSON array containing training examples in a conversational format:

[
  {
    "messages": [
      {
        "role": "user",
        "content": "When a new email arrives in Gmail, save the attachment to Google Drive."
      },
      {
        "role": "assistant",
        "content": "{\"name\": \"Email to Drive\", \"nodes\": [...], \"connections\": {...}, \"active\": false}"
      }
    ]
  }
]

Structure

role: "user" - Natural language description of the workflow to create
role: "assistant" - JSON representation of the complete n8n workflow

The assistant's response contains a valid n8n workflow definition with:

name: Workflow name
nodes: Array of node definitions (triggers, actions, transformations)
connections: Object defining how nodes are connected
active: Boolean indicating if workflow is active (usually false for templates)

Dataset Files

dataset_001.json

Size: 2.5 MB
Examples: 3,061 workflow examples
Status: ✅ Valid JSON
Focus: Common workflow patterns (Gmail, Slack, Google Sheets, Trello, Airtable, Notion, etc.)

dataset_002.json

Size: 4.9 MB
Status: ⚠️ JSON parsing errors detected
Note: May require cleaning before use

dataset_003.json

Size: 14.0 MB
Status: ⚠️ JSON parsing errors detected
Note: May require cleaning before use

Common Workflow Patterns

Based on dataset_001.json analysis, the most common patterns include:

Email Automation
- Gmail → Google Drive (save attachments)
- Gmail → Slack (notifications)
- Gmail → Airtable (create records)
Spreadsheet Integration
- Google Sheets → Slack (new row notifications)
- Google Sheets → Gmail (alerts)
- Airtable → Google Sheets (sync data)
Project Management
- Trello → Slack (card updates)
- Trello → Google Calendar (deadline tracking)
- GitHub → Trello (issue tracking)
Notification Workflows
- Slack reactions → Airtable (logging)
- Calendar events → Email reminders
- Notion updates → Slack posts

Usage for LLM Training

Fine-tuning Format

These datasets are compatible with OpenAI's fine-tuning format and similar training pipelines. Each example teaches the model to:

Parse natural language workflow requests
Identify required n8n nodes
Configure node parameters
Establish proper connections between nodes

Recommended Preprocessing

Before using these datasets:

Validate JSON: Verify all files parse correctly
Deduplicate: Remove duplicate examples (some duplicates exist)
Filter: Optionally filter by specific integrations or complexity
Balance: Ensure diverse node types are represented

Example Use Cases

Fine-tune GPT models to generate n8n workflows
Train models to suggest workflow improvements
Create workflow completion assistants
Build n8n-specific code generation tools

Integration with n8n-mcp

This repository complements the n8n-mcp server by providing:

Static training data for model fine-tuning
Example workflows for reference
Pattern library for common automations

While n8n-mcp provides real-time workflow execution and API access, these datasets enable LLMs to learn n8n's workflow generation patterns.

Contributing

When adding new examples:

Follow the existing JSON structure
Ensure workflow JSON is valid n8n format
Use descriptive, natural language in user messages
Test workflows before adding to datasets
Avoid duplicates

Known Issues

Duplicate entries exist in dataset_001.json (minimal impact on training)
dataset_002.json and dataset_003.json have JSON formatting errors
Some placeholder values (e.g., {{SHEET_ID}}, {{API_KEY}}) are included - these are intentional for template-style workflows

Tools

See /scripts/analyze_datasets.py for dataset analysis and statistics tools.