DavidrPatton's picture
Add datasets dataset
e65ef8e verified

N8N Training Dataset Collection

High-quality curated subset of n8n workflow examples optimized for LLM training.

Overview

Total Examples: 28,337
Purpose: Train LLMs for autonomous n8n workflow creation and error troubleshooting

Dataset Files

01_conversational_sft.jsonl

Examples: 9,979
Source: eclaude/n8n-workflows-sft (HuggingFace)
Format: Supervised Fine-Tuning (SFT)

Specialty: Conversational workflow generation

Structure:

{
  "instruction": "Create a workflow that sends Slack notifications when...",
  "response": "{n8n workflow JSON}"
}

Use For:

  • Training conversational AI assistants
  • Natural language → workflow conversion
  • Chat-based workflow generation

02_reasoning_with_thinking.jsonl

Examples: 5,361
Source: ruh-ai/n8n-workflow-dataset (HuggingFace)
Format: JSONL with thinking chains

Specialty: Reasoning and debugging (UNIQUE - only dataset with thinking chains!)

Structure:

{
  "prompt": "Build a workflow to process CSV files...",
  "thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...",
  "json": "{n8n workflow JSON}"
}

Use For:

  • Teaching logical reasoning
  • Error troubleshooting
  • Workflow debugging
  • Explaining design decisions

03_latest_features.jsonl

Examples: 2,737
Source: mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace)
Updated: December 26, 2024

Specialty: Most recent n8n features and nodes

Use For:

  • Current n8n API patterns
  • Latest node versions
  • Modern integrations
  • Avoiding deprecated patterns

04_advanced_workflows.json

Examples: 10,260
Source: Original validated collection
Format: JSONL (despite .json extension)

Specialty: Complex multi-node workflows

Use For:

  • Advanced integration patterns
  • Sophisticated business logic
  • Production workflow examples
  • Complex data transformations

Training Recommendations

Quick Start

import json
from pathlib import Path

def load_training_data():
    """Load all training datasets."""
    training_files = [
        '01_conversational_sft.jsonl',
        '02_reasoning_with_thinking.jsonl',
        '03_latest_features.jsonl',
        '04_advanced_workflows.json',
    ]
    
    examples = []
    for filename in training_files:
        with open(filename, 'r', encoding='utf-8') as f:
            first_char = f.read(1)
            f.seek(0)
            
            if first_char == '[':
                examples.extend(json.load(f))  # JSON array
            else:
                examples.extend(json.loads(line) for line in f if line.strip())
    
    return examples

data = load_training_data()
print(f"Loaded {len(data):,} training examples")

Train/Val/Test Split

from sklearn.model_selection import train_test_split

train, temp = train_test_split(data, test_size=0.2, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)

print(f"Train: {len(train):,}")  # 22,670
print(f"Val:   {len(val):,}")    # 2,834
print(f"Test:  {len(test):,}")   # 2,833

Why This Subset?

Quality over Quantity:

  • ✅ Curated best examples from each category
  • ✅ Complementary strengths (conversation, reasoning, latest, advanced)
  • ✅ 3x faster training than full 81K dataset
  • ✅ 3x lower compute cost
  • ✅ No duplicates

Unique Capabilities:

  • Conversational: Natural language understanding (01)
  • Reasoning: Step-by-step logic and debugging (02)
  • Current: Latest n8n features (03)
  • Advanced: Complex patterns (04)

Alternative: Full Dataset

For maximum coverage, see ../n8n_master.jsonl (81,649 unique workflows).

Use the full dataset if:

  • Quality subset shows coverage gaps
  • Production deployment needs
  • Comprehensive service knowledge required

File Formats

All files use JSONL (JSON Lines) format:

  • One JSON object per line
  • Easy to stream
  • Memory efficient
  • Industry standard

Note: 04_advanced_workflows.json is JSONL format despite the .json extension.

Dataset Statistics

Dataset Examples Size Specialty
01_conversational_sft 9,979 ~125 MB Conversations
02_reasoning_with_thinking 5,361 ~91 MB Debugging
03_latest_features 2,737 ~47 MB Current
04_advanced_workflows 10,260 ~13 MB Advanced
Total 28,337 ~276 MB Complete

Related Documentation

License

These datasets are aggregated from various sources. Please check individual source licenses:

  • eclaude datasets: Check HuggingFace repository
  • ruh-ai dataset: Check HuggingFace repository
  • mbakgun dataset: Check HuggingFace repository
  • Original datasets: Part of n8n-docs repository