DavidrPatton's picture
Add datasets dataset
e65ef8e verified
# N8N Training Dataset Collection
High-quality curated subset of n8n workflow examples optimized for LLM training.
## Overview
**Total Examples:** 28,337
**Purpose:** Train LLMs for autonomous n8n workflow creation and error troubleshooting
## Dataset Files
### 01_conversational_sft.jsonl
**Examples:** 9,979
**Source:** eclaude/n8n-workflows-sft (HuggingFace)
**Format:** Supervised Fine-Tuning (SFT)
**Specialty:** Conversational workflow generation
**Structure:**
```json
{
"instruction": "Create a workflow that sends Slack notifications when...",
"response": "{n8n workflow JSON}"
}
```
**Use For:**
- Training conversational AI assistants
- Natural language β†’ workflow conversion
- Chat-based workflow generation
---
### 02_reasoning_with_thinking.jsonl
**Examples:** 5,361
**Source:** ruh-ai/n8n-workflow-dataset (HuggingFace)
**Format:** JSONL with thinking chains
**Specialty:** Reasoning and debugging (UNIQUE - only dataset with thinking chains!)
**Structure:**
```json
{
"prompt": "Build a workflow to process CSV files...",
"thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...",
"json": "{n8n workflow JSON}"
}
```
**Use For:**
- Teaching logical reasoning
- Error troubleshooting
- Workflow debugging
- Explaining design decisions
---
### 03_latest_features.jsonl
**Examples:** 2,737
**Source:** mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace)
**Updated:** December 26, 2024
**Specialty:** Most recent n8n features and nodes
**Use For:**
- Current n8n API patterns
- Latest node versions
- Modern integrations
- Avoiding deprecated patterns
---
### 04_advanced_workflows.json
**Examples:** 10,260
**Source:** Original validated collection
**Format:** JSONL (despite .json extension)
**Specialty:** Complex multi-node workflows
**Use For:**
- Advanced integration patterns
- Sophisticated business logic
- Production workflow examples
- Complex data transformations
---
## Training Recommendations
### Quick Start
```python
import json
from pathlib import Path
def load_training_data():
"""Load all training datasets."""
training_files = [
'01_conversational_sft.jsonl',
'02_reasoning_with_thinking.jsonl',
'03_latest_features.jsonl',
'04_advanced_workflows.json',
]
examples = []
for filename in training_files:
with open(filename, 'r', encoding='utf-8') as f:
first_char = f.read(1)
f.seek(0)
if first_char == '[':
examples.extend(json.load(f)) # JSON array
else:
examples.extend(json.loads(line) for line in f if line.strip())
return examples
data = load_training_data()
print(f"Loaded {len(data):,} training examples")
```
### Train/Val/Test Split
```python
from sklearn.model_selection import train_test_split
train, temp = train_test_split(data, test_size=0.2, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)
print(f"Train: {len(train):,}") # 22,670
print(f"Val: {len(val):,}") # 2,834
print(f"Test: {len(test):,}") # 2,833
```
## Why This Subset?
**Quality over Quantity:**
- βœ… Curated best examples from each category
- βœ… Complementary strengths (conversation, reasoning, latest, advanced)
- βœ… 3x faster training than full 81K dataset
- βœ… 3x lower compute cost
- βœ… No duplicates
**Unique Capabilities:**
- **Conversational:** Natural language understanding (01)
- **Reasoning:** Step-by-step logic and debugging (02)
- **Current:** Latest n8n features (03)
- **Advanced:** Complex patterns (04)
## Alternative: Full Dataset
For maximum coverage, see `../n8n_master.jsonl` (81,649 unique workflows).
Use the full dataset if:
- Quality subset shows coverage gaps
- Production deployment needs
- Comprehensive service knowledge required
## File Formats
All files use JSONL (JSON Lines) format:
- One JSON object per line
- Easy to stream
- Memory efficient
- Industry standard
**Note:** `04_advanced_workflows.json` is JSONL format despite the `.json` extension.
## Dataset Statistics
| Dataset | Examples | Size | Specialty |
|---------|----------|------|-----------|
| 01_conversational_sft | 9,979 | ~125 MB | Conversations |
| 02_reasoning_with_thinking | 5,361 | ~91 MB | Debugging |
| 03_latest_features | 2,737 | ~47 MB | Current |
| 04_advanced_workflows | 10,260 | ~13 MB | Advanced |
| **Total** | **28,337** | **~276 MB** | **Complete** |
## Related Documentation
- [Main README](../../README.md) - Repository overview
- [Datasets README](../README.md) - Full dataset collection info
- [Dataset Analysis](../../../.gemini/antigravity/brain/afbe61e0-35d6-4500-8f3e-e9431fc1db24/complete_dataset_analysis.md) - Detailed analysis
## License
These datasets are aggregated from various sources. Please check individual source licenses:
- eclaude datasets: Check HuggingFace repository
- ruh-ai dataset: Check HuggingFace repository
- mbakgun dataset: Check HuggingFace repository
- Original datasets: Part of n8n-docs repository