# N8N Training Dataset Collection

High-quality curated subset of n8n workflow examples optimized for LLM training.

## Overview

**Total Examples:** 28,337  
**Purpose:** Train LLMs for autonomous n8n workflow creation and error troubleshooting

## Dataset Files

### 01_conversational_sft.jsonl
**Examples:** 9,979  
**Source:** eclaude/n8n-workflows-sft (HuggingFace)  
**Format:** Supervised Fine-Tuning (SFT)

**Specialty:** Conversational workflow generation

**Structure:**
```json
{
  "instruction": "Create a workflow that sends Slack notifications when...",
  "response": "{n8n workflow JSON}"
}
```

**Use For:**
- Training conversational AI assistants
- Natural language → workflow conversion
- Chat-based workflow generation

---

### 02_reasoning_with_thinking.jsonl
**Examples:** 5,361  
**Source:** ruh-ai/n8n-workflow-dataset (HuggingFace)  
**Format:** JSONL with thinking chains

**Specialty:** Reasoning and debugging (UNIQUE - only dataset with thinking chains!)

**Structure:**
```json
{
  "prompt": "Build a workflow to process CSV files...",
  "thinking": "Step 1: We need a trigger... Step 2: Parse CSV... Step 3: Loop through rows...",
  "json": "{n8n workflow JSON}"
}
```

**Use For:**
- Teaching logical reasoning
- Error troubleshooting
- Workflow debugging
- Explaining design decisions

---

### 03_latest_features.jsonl
**Examples:** 2,737  
**Source:** mbakgun/n8nbuilder-n8n-workflows-dataset (HuggingFace)  
**Updated:** December 26, 2024

**Specialty:** Most recent n8n features and nodes

**Use For:**
- Current n8n API patterns
- Latest node versions
- Modern integrations
- Avoiding deprecated patterns

---

### 04_advanced_workflows.json
**Examples:** 10,260  
**Source:** Original validated collection  
**Format:** JSONL (despite .json extension)

**Specialty:** Complex multi-node workflows

**Use For:**
- Advanced integration patterns
- Sophisticated business logic
- Production workflow examples
- Complex data transformations

---

## Training Recommendations

### Quick Start
```python
import json
from pathlib import Path

def load_training_data():
    """Load all training datasets."""
    training_files = [
        '01_conversational_sft.jsonl',
        '02_reasoning_with_thinking.jsonl',
        '03_latest_features.jsonl',
        '04_advanced_workflows.json',
    ]
    
    examples = []
    for filename in training_files:
        with open(filename, 'r', encoding='utf-8') as f:
            first_char = f.read(1)
            f.seek(0)
            
            if first_char == '[':
                examples.extend(json.load(f))  # JSON array
            else:
                examples.extend(json.loads(line) for line in f if line.strip())
    
    return examples

data = load_training_data()
print(f"Loaded {len(data):,} training examples")
```

### Train/Val/Test Split
```python
from sklearn.model_selection import train_test_split

train, temp = train_test_split(data, test_size=0.2, random_state=42)
val, test = train_test_split(temp, test_size=0.5, random_state=42)

print(f"Train: {len(train):,}")  # 22,670
print(f"Val:   {len(val):,}")    # 2,834
print(f"Test:  {len(test):,}")   # 2,833
```

## Why This Subset?

**Quality over Quantity:**
- ✅ Curated best examples from each category
- ✅ Complementary strengths (conversation, reasoning, latest, advanced)
- ✅ 3x faster training than full 81K dataset
- ✅ 3x lower compute cost
- ✅ No duplicates

**Unique Capabilities:**
- **Conversational:** Natural language understanding (01)
- **Reasoning:** Step-by-step logic and debugging (02)
- **Current:** Latest n8n features (03)
- **Advanced:** Complex patterns (04)

## Alternative: Full Dataset

For maximum coverage, see `../n8n_master.jsonl` (81,649 unique workflows).

Use the full dataset if:
- Quality subset shows coverage gaps
- Production deployment needs
- Comprehensive service knowledge required

## File Formats

All files use JSONL (JSON Lines) format:
- One JSON object per line
- Easy to stream
- Memory efficient
- Industry standard

**Note:** `04_advanced_workflows.json` is JSONL format despite the `.json` extension.

## Dataset Statistics

| Dataset | Examples | Size | Specialty |
|---------|----------|------|-----------|
| 01_conversational_sft | 9,979 | ~125 MB | Conversations |
| 02_reasoning_with_thinking | 5,361 | ~91 MB | Debugging |
| 03_latest_features | 2,737 | ~47 MB | Current |
| 04_advanced_workflows | 10,260 | ~13 MB | Advanced |
| **Total** | **28,337** | **~276 MB** | **Complete** |

## Related Documentation

- [Main README](../../README.md) - Repository overview
- [Datasets README](../README.md) - Full dataset collection info
- [Dataset Analysis](../../../.gemini/antigravity/brain/afbe61e0-35d6-4500-8f3e-e9431fc1db24/complete_dataset_analysis.md) - Detailed analysis

## License

These datasets are aggregated from various sources. Please check individual source licenses:
- eclaude datasets: Check HuggingFace repository
- ruh-ai dataset: Check HuggingFace repository  
- mbakgun dataset: Check HuggingFace repository
- Original datasets: Part of n8n-docs repository