| # N8N Training Dataset Collection | |
| High-quality curated subset of n8n workflow examples optimized for LLM training. | |
| ## Overview | |
| **Total Examples:** 28,337 | |
| **Purpose:** Train LLMs for autonomous n8n workflow creation and error troubleshooting | |
| ## Dataset Files | |
| ### 01_conversational_sft.jsonl | |
| **Examples:** 9,979 | |
| **Source:** eclaude/n8n-workflows-sft (HuggingFace) | |
| **Format:** Supervised Fine-Tuning (SFT) | |
| **Specialty:** Conversational workflow generation | |
| **Structure:** | |
| ```json | |
| { | |
| "instruction": "Create a workflow that sends Slack notifications when...", | |
| "response": "{n8n workflow JSON}" | |
| } | |
| ``` | |
| **Use For:** | |
| - Training conversational AI assistants | |
| - Natural language β workflow conversion | |
| - Chat-based workflow generation | |
| --- | |
| ### 02_reasoning_with_thinking.jsonl | |
| **Examples:** 5,361 | |
| **Source:** ruh-ai/n8n-workflow-dataset (HuggingFace) | |
| **Format:** JSONL with thinking chains | |
| **Specialty:** Reasoning and debugging (UNIQUE - only dataset with thinking chains!) | |
| **Structure:** | |
| ```json | |
| { | |
| "prompt": "Build a workflow to process CSV files...", | |
| "thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...", | |
| "json": "{n8n workflow JSON}" | |
| } | |
| ``` | |
| **Use For:** | |
| - Teaching logical reasoning | |
| - Error troubleshooting | |
| - Workflow debugging | |
| - Explaining design decisions | |
| --- | |
| ### 03_latest_features.jsonl | |
| **Examples:** 2,737 | |
| **Source:** mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace) | |
| **Updated:** December 26, 2024 | |
| **Specialty:** Most recent n8n features and nodes | |
| **Use For:** | |
| - Current n8n API patterns | |
| - Latest node versions | |
| - Modern integrations | |
| - Avoiding deprecated patterns | |
| --- | |
| ### 04_advanced_workflows.json | |
| **Examples:** 10,260 | |
| **Source:** Original validated collection | |
| **Format:** JSONL (despite .json extension) | |
| **Specialty:** Complex multi-node workflows | |
| **Use For:** | |
| - Advanced integration patterns | |
| - Sophisticated business logic | |
| - Production workflow examples | |
| - Complex data transformations | |
| --- | |
| ## Training Recommendations | |
| ### Quick Start | |
| ```python | |
| import json | |
| from pathlib import Path | |
| def load_training_data(): | |
| """Load all training datasets.""" | |
| training_files = [ | |
| '01_conversational_sft.jsonl', | |
| '02_reasoning_with_thinking.jsonl', | |
| '03_latest_features.jsonl', | |
| '04_advanced_workflows.json', | |
| ] | |
| examples = [] | |
| for filename in training_files: | |
| with open(filename, 'r', encoding='utf-8') as f: | |
| first_char = f.read(1) | |
| f.seek(0) | |
| if first_char == '[': | |
| examples.extend(json.load(f)) # JSON array | |
| else: | |
| examples.extend(json.loads(line) for line in f if line.strip()) | |
| return examples | |
| data = load_training_data() | |
| print(f"Loaded {len(data):,} training examples") | |
| ``` | |
| ### Train/Val/Test Split | |
| ```python | |
| from sklearn.model_selection import train_test_split | |
| train, temp = train_test_split(data, test_size=0.2, random_state=42) | |
| val, test = train_test_split(temp, test_size=0.5, random_state=42) | |
| print(f"Train: {len(train):,}") # 22,670 | |
| print(f"Val: {len(val):,}") # 2,834 | |
| print(f"Test: {len(test):,}") # 2,833 | |
| ``` | |
| ## Why This Subset? | |
| **Quality over Quantity:** | |
| - β Curated best examples from each category | |
| - β Complementary strengths (conversation, reasoning, latest, advanced) | |
| - β 3x faster training than full 81K dataset | |
| - β 3x lower compute cost | |
| - β No duplicates | |
| **Unique Capabilities:** | |
| - **Conversational:** Natural language understanding (01) | |
| - **Reasoning:** Step-by-step logic and debugging (02) | |
| - **Current:** Latest n8n features (03) | |
| - **Advanced:** Complex patterns (04) | |
| ## Alternative: Full Dataset | |
| For maximum coverage, see `../n8n_master.jsonl` (81,649 unique workflows). | |
| Use the full dataset if: | |
| - Quality subset shows coverage gaps | |
| - Production deployment needs | |
| - Comprehensive service knowledge required | |
| ## File Formats | |
| All files use JSONL (JSON Lines) format: | |
| - One JSON object per line | |
| - Easy to stream | |
| - Memory efficient | |
| - Industry standard | |
| **Note:** `04_advanced_workflows.json` is JSONL format despite the `.json` extension. | |
| ## Dataset Statistics | |
| | Dataset | Examples | Size | Specialty | | |
| |---------|----------|------|-----------| | |
| | 01_conversational_sft | 9,979 | ~125 MB | Conversations | | |
| | 02_reasoning_with_thinking | 5,361 | ~91 MB | Debugging | | |
| | 03_latest_features | 2,737 | ~47 MB | Current | | |
| | 04_advanced_workflows | 10,260 | ~13 MB | Advanced | | |
| | **Total** | **28,337** | **~276 MB** | **Complete** | | |
| ## Related Documentation | |
| - [Main README](../../README.md) - Repository overview | |
| - [Datasets README](../README.md) - Full dataset collection info | |
| - [Dataset Analysis](../../../.gemini/antigravity/brain/afbe61e0-35d6-4500-8f3e-e9431fc1db24/complete_dataset_analysis.md) - Detailed analysis | |
| ## License | |
| These datasets are aggregated from various sources. Please check individual source licenses: | |
| - eclaude datasets: Check HuggingFace repository | |
| - ruh-ai dataset: Check HuggingFace repository | |
| - mbakgun dataset: Check HuggingFace repository | |
| - Original datasets: Part of n8n-docs repository | |