N8N Training Dataset Collection
High-quality curated subset of n8n workflow examples optimized for LLM training.
Overview
Total Examples: 28,337
Purpose: Train LLMs for autonomous n8n workflow creation and error troubleshooting
Dataset Files
01_conversational_sft.jsonl
Examples: 9,979
Source: eclaude/n8n-workflows-sft (HuggingFace)
Format: Supervised Fine-Tuning (SFT)
Specialty: Conversational workflow generation
Structure:
{
"instruction": "Create a workflow that sends Slack notifications when...",
"response": "{n8n workflow JSON}"
}
Use For:
- Training conversational AI assistants
- Natural language → workflow conversion
- Chat-based workflow generation
02_reasoning_with_thinking.jsonl
Examples: 5,361
Source: ruh-ai/n8n-workflow-dataset (HuggingFace)
Format: JSONL with thinking chains
Specialty: Reasoning and debugging (UNIQUE - only dataset with thinking chains!)
Structure:
{
"prompt": "Build a workflow to process CSV files...",
"thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...",
"json": "{n8n workflow JSON}"
}
Use For:
- Teaching logical reasoning
- Error troubleshooting
- Workflow debugging
- Explaining design decisions
03_latest_features.jsonl
Examples: 2,737
Source: mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace)
Updated: December 26, 2024
Specialty: Most recent n8n features and nodes
Use For:
- Current n8n API patterns
- Latest node versions
- Modern integrations
- Avoiding deprecated patterns
04_advanced_workflows.json
Examples: 10,260
Source: Original validated collection
Format: JSONL (despite .json extension)
Specialty: Complex multi-node workflows
Use For:
- Advanced integration patterns
- Sophisticated business logic
- Production workflow examples
- Complex data transformations
Training Recommendations
Quick Start
import json
from pathlib import Path
def load_training_data():
"""Load all training datasets."""
training_files = [
'01_conversational_sft.jsonl',
'02_reasoning_with_thinking.jsonl',
'03_latest_features.jsonl',
'04_advanced_workflows.json',
]
examples = []
for filename in training_files:
with open(filename, 'r', encoding='utf-8') as f:
first_char = f.read(1)
f.seek(0)
if first_char == '[':
examples.extend(json.load(f)) # JSON array
else:
examples.extend(json.loads(line) for line in f if line.strip())
return examples
data = load_training_data()
print(f"Loaded {len(data):,} training examples")
Train/Val/Test Split
from sklearn.model_selection import train_test_split
train, temp = train_test_split(data, test_size=0.2, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)
print(f"Train: {len(train):,}") # 22,670
print(f"Val: {len(val):,}") # 2,834
print(f"Test: {len(test):,}") # 2,833
Why This Subset?
Quality over Quantity:
- ✅ Curated best examples from each category
- ✅ Complementary strengths (conversation, reasoning, latest, advanced)
- ✅ 3x faster training than full 81K dataset
- ✅ 3x lower compute cost
- ✅ No duplicates
Unique Capabilities:
- Conversational: Natural language understanding (01)
- Reasoning: Step-by-step logic and debugging (02)
- Current: Latest n8n features (03)
- Advanced: Complex patterns (04)
Alternative: Full Dataset
For maximum coverage, see ../n8n_master.jsonl (81,649 unique workflows).
Use the full dataset if:
- Quality subset shows coverage gaps
- Production deployment needs
- Comprehensive service knowledge required
File Formats
All files use JSONL (JSON Lines) format:
- One JSON object per line
- Easy to stream
- Memory efficient
- Industry standard
Note: 04_advanced_workflows.json is JSONL format despite the .json extension.
Dataset Statistics
| Dataset | Examples | Size | Specialty |
|---|---|---|---|
| 01_conversational_sft | 9,979 | ~125 MB | Conversations |
| 02_reasoning_with_thinking | 5,361 | ~91 MB | Debugging |
| 03_latest_features | 2,737 | ~47 MB | Current |
| 04_advanced_workflows | 10,260 | ~13 MB | Advanced |
| Total | 28,337 | ~276 MB | Complete |
Related Documentation
- Main README - Repository overview
- Datasets README - Full dataset collection info
- Dataset Analysis - Detailed analysis
License
These datasets are aggregated from various sources. Please check individual source licenses:
- eclaude datasets: Check HuggingFace repository
- ruh-ai dataset: Check HuggingFace repository
- mbakgun dataset: Check HuggingFace repository
- Original datasets: Part of n8n-docs repository