# N8N Training Dataset Collection High-quality curated subset of n8n workflow examples optimized for LLM training. ## Overview **Total Examples:** 28,337 **Purpose:** Train LLMs for autonomous n8n workflow creation and error troubleshooting ## Dataset Files ### 01_conversational_sft.jsonl **Examples:** 9,979 **Source:** eclaude/n8n-workflows-sft (HuggingFace) **Format:** Supervised Fine-Tuning (SFT) **Specialty:** Conversational workflow generation **Structure:** ```json { "instruction": "Create a workflow that sends Slack notifications when...", "response": "{n8n workflow JSON}" } ``` **Use For:** - Training conversational AI assistants - Natural language → workflow conversion - Chat-based workflow generation --- ### 02_reasoning_with_thinking.jsonl **Examples:** 5,361 **Source:** ruh-ai/n8n-workflow-dataset (HuggingFace) **Format:** JSONL with thinking chains **Specialty:** Reasoning and debugging (UNIQUE - only dataset with thinking chains!) **Structure:** ```json { "prompt": "Build a workflow to process CSV files...", "thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...", "json": "{n8n workflow JSON}" } ``` **Use For:** - Teaching logical reasoning - Error troubleshooting - Workflow debugging - Explaining design decisions --- ### 03_latest_features.jsonl **Examples:** 2,737 **Source:** mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace) **Updated:** December 26, 2024 **Specialty:** Most recent n8n features and nodes **Use For:** - Current n8n API patterns - Latest node versions - Modern integrations - Avoiding deprecated patterns --- ### 04_advanced_workflows.json **Examples:** 10,260 **Source:** Original validated collection **Format:** JSONL (despite .json extension) **Specialty:** Complex multi-node workflows **Use For:** - Advanced integration patterns - Sophisticated business logic - Production workflow examples - Complex data transformations --- ## Training Recommendations ### Quick Start ```python import json from pathlib import Path def load_training_data(): """Load all training datasets.""" training_files = [ '01_conversational_sft.jsonl', '02_reasoning_with_thinking.jsonl', '03_latest_features.jsonl', '04_advanced_workflows.json', ] examples = [] for filename in training_files: with open(filename, 'r', encoding='utf-8') as f: first_char = f.read(1) f.seek(0) if first_char == '[': examples.extend(json.load(f)) # JSON array else: examples.extend(json.loads(line) for line in f if line.strip()) return examples data = load_training_data() print(f"Loaded {len(data):,} training examples") ``` ### Train/Val/Test Split ```python from sklearn.model_selection import train_test_split train, temp = train_test_split(data, test_size=0.2, random_state=42) val, test = train_test_split(temp, test_size=0.5, random_state=42) print(f"Train: {len(train):,}") # 22,670 print(f"Val: {len(val):,}") # 2,834 print(f"Test: {len(test):,}") # 2,833 ``` ## Why This Subset? **Quality over Quantity:** - ✅ Curated best examples from each category - ✅ Complementary strengths (conversation, reasoning, latest, advanced) - ✅ 3x faster training than full 81K dataset - ✅ 3x lower compute cost - ✅ No duplicates **Unique Capabilities:** - **Conversational:** Natural language understanding (01) - **Reasoning:** Step-by-step logic and debugging (02) - **Current:** Latest n8n features (03) - **Advanced:** Complex patterns (04) ## Alternative: Full Dataset For maximum coverage, see `../n8n_master.jsonl` (81,649 unique workflows). Use the full dataset if: - Quality subset shows coverage gaps - Production deployment needs - Comprehensive service knowledge required ## File Formats All files use JSONL (JSON Lines) format: - One JSON object per line - Easy to stream - Memory efficient - Industry standard **Note:** `04_advanced_workflows.json` is JSONL format despite the `.json` extension. ## Dataset Statistics | Dataset | Examples | Size | Specialty | |---------|----------|------|-----------| | 01_conversational_sft | 9,979 | ~125 MB | Conversations | | 02_reasoning_with_thinking | 5,361 | ~91 MB | Debugging | | 03_latest_features | 2,737 | ~47 MB | Current | | 04_advanced_workflows | 10,260 | ~13 MB | Advanced | | **Total** | **28,337** | **~276 MB** | **Complete** | ## Related Documentation - [Main README](../../README.md) - Repository overview - [Datasets README](../README.md) - Full dataset collection info - [Dataset Analysis](../../../.gemini/antigravity/brain/afbe61e0-35d6-4500-8f3e-e9431fc1db24/complete_dataset_analysis.md) - Detailed analysis ## License These datasets are aggregated from various sources. Please check individual source licenses: - eclaude datasets: Check HuggingFace repository - ruh-ai dataset: Check HuggingFace repository - mbakgun dataset: Check HuggingFace repository - Original datasets: Part of n8n-docs repository