n8n-docs-datasets / datasets /training /README.md

Add datasets dataset

e65ef8e verified 14 days ago

5.1 kB

	# N8N Training Dataset Collection

	High-quality curated subset of n8n workflow examples optimized for LLM training.

	## Overview

	Total Examples: 28,337
	Purpose: Train LLMs for autonomous n8n workflow creation and error troubleshooting

	## Dataset Files

	### 01_conversational_sft.jsonl
	Examples: 9,979
	Source: eclaude/n8n-workflows-sft (HuggingFace)
	Format: Supervised Fine-Tuning (SFT)

	Specialty: Conversational workflow generation

	Structure:
	```json
	{
	"instruction": "Create a workflow that sends Slack notifications when...",
	"response": "{n8n workflow JSON}"
	}
	```

	Use For:
	- Training conversational AI assistants
	- Natural language → workflow conversion
	- Chat-based workflow generation

	---

	### 02_reasoning_with_thinking.jsonl
	Examples: 5,361
	Source: ruh-ai/n8n-workflow-dataset (HuggingFace)
	Format: JSONL with thinking chains

	Specialty: Reasoning and debugging (UNIQUE - only dataset with thinking chains!)

	Structure:
	```json
	{
	"prompt": "Build a workflow to process CSV files...",
	"thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...",
	"json": "{n8n workflow JSON}"
	}
	```

	Use For:
	- Teaching logical reasoning
	- Error troubleshooting
	- Workflow debugging
	- Explaining design decisions

	---

	### 03_latest_features.jsonl
	Examples: 2,737
	Source: mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace)
	Updated: December 26, 2024

	Specialty: Most recent n8n features and nodes

	Use For:
	- Current n8n API patterns
	- Latest node versions
	- Modern integrations
	- Avoiding deprecated patterns

	---

	### 04_advanced_workflows.json
	Examples: 10,260
	Source: Original validated collection
	Format: JSONL (despite .json extension)

	Specialty: Complex multi-node workflows

	Use For:
	- Advanced integration patterns
	- Sophisticated business logic
	- Production workflow examples
	- Complex data transformations

	---

	## Training Recommendations

	### Quick Start
	```python
	import json
	from pathlib import Path

	def load_training_data():
	"""Load all training datasets."""
	training_files = [
	'01_conversational_sft.jsonl',
	'02_reasoning_with_thinking.jsonl',
	'03_latest_features.jsonl',
	'04_advanced_workflows.json',
	]

	examples = []
	for filename in training_files:
	with open(filename, 'r', encoding='utf-8') as f:
	first_char = f.read(1)
	f.seek(0)

	if first_char == '[':
	examples.extend(json.load(f)) # JSON array
	else:
	examples.extend(json.loads(line) for line in f if line.strip())

	return examples

	data = load_training_data()
	print(f"Loaded {len(data):,} training examples")
	```

	### Train/Val/Test Split
	```python
	from sklearn.model_selection import train_test_split

	train, temp = train_test_split(data, test_size=0.2, random_state=42)
	val, test = train_test_split(temp, test_size=0.5, random_state=42)

	print(f"Train: {len(train):,}") # 22,670
	print(f"Val: {len(val):,}") # 2,834
	print(f"Test: {len(test):,}") # 2,833
	```

	## Why This Subset?

	Quality over Quantity:
	- ✅ Curated best examples from each category
	- ✅ Complementary strengths (conversation, reasoning, latest, advanced)
	- ✅ 3x faster training than full 81K dataset
	- ✅ 3x lower compute cost
	- ✅ No duplicates

	Unique Capabilities:
	- Conversational: Natural language understanding (01)
	- Reasoning: Step-by-step logic and debugging (02)
	- Current: Latest n8n features (03)
	- Advanced: Complex patterns (04)

	## Alternative: Full Dataset

	For maximum coverage, see `../n8n_master.jsonl` (81,649 unique workflows).

	Use the full dataset if:
	- Quality subset shows coverage gaps
	- Production deployment needs
	- Comprehensive service knowledge required

	## File Formats

	All files use JSONL (JSON Lines) format:
	- One JSON object per line
	- Easy to stream
	- Memory efficient
	- Industry standard

	Note: `04_advanced_workflows.json` is JSONL format despite the `.json` extension.

	## Dataset Statistics

	\| Dataset \| Examples \| Size \| Specialty \|
	\|---------\|----------\|------\|-----------\|
	\| 01_conversational_sft \| 9,979 \| ~125 MB \| Conversations \|
	\| 02_reasoning_with_thinking \| 5,361 \| ~91 MB \| Debugging \|
	\| 03_latest_features \| 2,737 \| ~47 MB \| Current \|
	\| 04_advanced_workflows \| 10,260 \| ~13 MB \| Advanced \|
	\| Total \| 28,337 \| ~276 MB \| Complete \|

	## Related Documentation

	- [Main README](../../README.md) - Repository overview
	- [Datasets README](../README.md) - Full dataset collection info
	- [Dataset Analysis](../../../.gemini/antigravity/brain/afbe61e0-35d6-4500-8f3e-e9431fc1db24/complete_dataset_analysis.md) - Detailed analysis

	## License

	These datasets are aggregated from various sources. Please check individual source licenses:
	- eclaude datasets: Check HuggingFace repository
	- ruh-ai dataset: Check HuggingFace repository
	- mbakgun dataset: Check HuggingFace repository
	- Original datasets: Part of n8n-docs repository