DeepXR
/

Helion-V1.5

+# Helion 1.5 Usage Guide
+Complete guide for using the Helion 1.5 dataset series for training and fine-tuning language models.
+## Table of Contents
+1. [Quick Start](#quick-start)
+2. [Dataset Overview](#dataset-overview)
+3. [Loading Data](#loading-data)
+4. [Training Examples](#training-examples)
+5. [Fine-Tuning Strategies](#fine-tuning-strategies)
+6. [Best Practices](#best-practices)
+7. [Troubleshooting](#troubleshooting)
+---
+## Quick Start
+### Installation
+```bash
+pip install datasets transformers torch accelerate
+```
+### Load Dataset
+```python
+from datasets import load_dataset
+# Load full dataset
+dataset = load_dataset("your-username/helion-1.5")
+# Load specific subset
+conversations = load_dataset(
+    "your-username/helion-1.5",
+    data_files="helion-1.5-conversations.jsonl"
+)
+```
+### Basic Training
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
+# Initialize model and tokenizer
+model_name = "meta-llama/Llama-2-7b-hf"  # or your preferred base model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+# Quick training setup
+training_args = TrainingArguments(
+    output_dir="./helion-1.5-finetuned",
+    num_train_epochs=3,
+    per_device_train_batch_size=4,
+    learning_rate=2e-5,
+    logging_steps=100,
+)
+```
+---
+## Dataset Overview
+### File Structure
+```
+helion-1.5/
+├── helion-1.5-conversations.jsonl    # 800K multi-turn conversations
+├── helion-1.5-instructions.jsonl     # 600K instruction pairs
+├── helion-1.5-code.jsonl             # 250K code examples
+├── helion-1.5-reasoning.jsonl        # 180K reasoning tasks
+├── helion-1.5-creative.jsonl         # 120K creative writing
+└── helion-1.5-multilingual.jsonl     # 50K multilingual data
+```
+### Data Formats
+#### Conversations Format
+```json
+{
+  "id": "conv_abc123",
+  "conversations": [
+    {"role": "user", "content": "How does photosynthesis work?"},
+    {"role": "assistant", "content": "Photosynthesis is..."}
+  ],
+  "metadata": {
+    "domain": "science",
+    "difficulty": "intermediate",
+    "quality_score": 0.95
+  }
+}
+```
+#### Instructions Format
+```json
+{
+  "id": "inst_xyz789",
+  "instruction": "Summarize the following text:",
+  "input": "Long text here...",
+  "output": "Summary here...",
+  "metadata": {
+    "task_type": "summarization",
+    "complexity": "medium"
+  }
+}
+```
+#### Code Format
+```json
+{
+  "id": "code_def456",
+  "language": "python",
+  "problem": "Implement binary search",
+  "solution": "def binary_search(arr, target): ...",
+  "explanation": "This algorithm...",
+  "test_cases": [...]
+}
+```
+---
+## Loading Data
+### Load Specific Subsets
+```python
+from datasets import load_dataset
+# Load only conversations
+conversations = load_dataset(
+    "your-username/helion-1.5",
+    data_files="helion-1.5-conversations.jsonl",
+    split="train"
+)
+# Load multiple files
+multi_data = load_dataset(
+    "your-username/helion-1.5",
+    data_files=[
+        "helion-1.5-conversations.jsonl",
+        "helion-1.5-instructions.jsonl"
+    ]
+)
+```
+### Filter by Domain
+```python
+# Filter science domain
+science_data = conversations.filter(
+    lambda x: x['metadata']['domain'] == 'science'
+)
+# Filter high quality
+high_quality = conversations.filter(
+    lambda x: x['metadata'].get('quality_score', 0) > 0.9
+)
+```
+### Combine Multiple Sources
+```python
+from datasets import concatenate_datasets
+# Load different subsets
+conv = load_dataset("...", data_files="conversations.jsonl")
+inst = load_dataset("...", data_files="instructions.jsonl")
+# Combine
+combined = concatenate_datasets([conv, inst])
+```
+---
+## Training Examples
+### 1. Instruction Fine-Tuning
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
+from datasets import load_dataset
+# Load instruction data
+dataset = load_dataset(
+    "your-username/helion-1.5",
+    data_files="helion-1.5-instructions.jsonl"
+)
+# Initialize
+model_name = "meta-llama/Llama-2-7b-hf"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.pad_token = tokenizer.eos_token
+model = AutoModelForCausalLM.from_pretrained(model_name)
+# Format function
+def format_instruction(example):
+    text = f"### Instruction:\n{example['instruction']}\n\n"
+    if example.get('input'):
+        text += f"### Input:\n{example['input']}\n\n"
+    text += f"### Response:\n{example['output']}"
+    return {"text": text}
+# Apply formatting
+dataset = dataset.map(format_instruction)
+# Tokenize
+def tokenize_function(examples):
+    return tokenizer(
+        examples["text"],
+        padding="max_length",
+        truncation=True,
+        max_length=512
+    )
+tokenized_dataset = dataset.map(tokenize_function, batched=True)
+# Training arguments
+training_args = TrainingArguments(
+    output_dir="./instruction-model",
+    num_train_epochs=3,
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=8,
+    learning_rate=2e-5,
+    warmup_steps=500,
+    logging_steps=100,
+    save_steps=1000,
+    fp16=True,
+    optim="adamw_torch",
+)
+# Train
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_dataset,
+)
+trainer.train()
+model.save_pretrained("./instruction-model-final")
+```
+### 2. Conversational Model Training
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
+# Load conversation data
+dataset = load_dataset(
+    "your-username/helion-1.5",
+    data_files="helion-1.5-conversations.jsonl"
+)
+# Format conversations
+def format_conversation(example):
+    formatted = ""
+    for turn in example['conversations']:
+        role = turn['role'].capitalize()
+        content = turn['content']
+        formatted += f"{role}: {content}\n\n"
+    return {"text": formatted.strip()}
+dataset = dataset.map(format_conversation)
+# Tokenize
+def tokenize(examples):
+    return tokenizer(
+        examples["text"],
+        padding="max_length",
+        truncation=True,
+        max_length=2048  # Longer for conversations
+    )
+tokenized = dataset.map(tokenize, batched=True)
+# Training setup
+training_args = TrainingArguments(
+    output_dir="./conversation-model",
+    num_train_epochs=3,
+    per_device_train_batch_size=2,
+    gradient_accumulation_steps=16,
+    learning_rate=1e-5,
+    warmup_ratio=0.1,
+    logging_steps=50,
+    save_strategy="epoch",
+    fp16=True,
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized,
+)
+trainer.train()
+```
+### 3. Code Generation Training
+```python
+# Load code data
+code_data = load_dataset(
+    "your-username/helion-1.5",
+    data_files="helion-1.5-code.jsonl"
+)
+# Format code examples
+def format_code(example):
+    text = f"# Problem: {example['problem']}\n\n"
+    text += f"# Solution ({example['language']}):\n{example['solution']}\n\n"
+    if example.get('explanation'):
+        text += f"# Explanation: {example['explanation']}"
+    return {"text": text}
+code_data = code_data.map(format_code)
+# Filter by language (optional)
+python_code = code_data.filter(
+    lambda x: x['language'] == 'python'
+)
+# Training with code-specific settings
+training_args = TrainingArguments(
+    output_dir="./code-model",
+    num_train_epochs=5,  # More epochs for code
+    per_device_train_batch_size=4,
+    learning_rate=3e-5,
+    warmup_steps=1000,
+    save_steps=2000,
+)
+# Train model
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_code,
+)
+trainer.train()
+```
+### 4. LoRA Fine-Tuning (Memory Efficient)
+```python
+from peft import LoraConfig, get_peft_model, TaskType
+# Load base model
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    load_in_8bit=True,  # 8-bit quantization
+    device_map="auto",
+)
+# LoRA configuration
+lora_config = LoraConfig(
+    r=16,  # LoRA rank
+    lora_alpha=32,
+    target_modules=["q_proj", "v_proj"],
+    lora_dropout=0.05,
+    bias="none",
+    task_type=TaskType.CAUSAL_LM
+)
+# Add LoRA adapters
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+# Training with LoRA
+training_args = TrainingArguments(
+    output_dir="./lora-model",
+    num_train_epochs=3,
+    per_device_train_batch_size=8,  # Can use larger batch
+    gradient_accumulation_steps=4,
+    learning_rate=3e-4,  # Higher LR for LoRA
+    fp16=True,
+    logging_steps=100,
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_dataset,
+)
+trainer.train()
+```
+---
+## Fine-Tuning Strategies
+### Strategy 1: Domain-Specific Fine-Tuning
+```python
+# Fine-tune on specific domain
+science_data = dataset.filter(
+    lambda x: x['metadata']['domain'] == 'science'
+)
+# Train with domain focus
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=science_data,
+)
+```
+### Strategy 2: Progressive Fine-Tuning
+```python
+# Stage 1: General knowledge
+general_data = dataset.filter(
+    lambda x: x['metadata']['domain'] == 'general'
+)
+trainer.train(train_dataset=general_data)
+# Stage 2: Specialized knowledge
+specialized_data = dataset.filter(
+    lambda x: x['metadata']['difficulty'] == 'advanced'
+)
+trainer.train(train_dataset=specialized_data)
+```
+### Strategy 3: Multi-Task Learning
+```python
+# Mix different data types
+conv_weight = 0.4
+inst_weight = 0.3
+code_weight = 0.3
+# Sample proportionally
+from datasets import concatenate_datasets
+mixed_dataset = concatenate_datasets([
+    conversations.shuffle().select(range(int(10000 * conv_weight))),
+    instructions.shuffle().select(range(int(10000 * inst_weight))),
+    code_data.shuffle().select(range(int(10000 * code_weight))),
+])
+```
+### Strategy 4: Curriculum Learning
+```python
+# Start with easy examples
+easy_data = dataset.filter(
+    lambda x: x['metadata']['difficulty'] == 'easy'
+)
+# Progress to harder examples
+medium_data = dataset.filter(
+    lambda x: x['metadata']['difficulty'] == 'intermediate'
+)
+hard_data = dataset.filter(
+    lambda x: x['metadata']['difficulty'] == 'advanced'
+)
+# Train progressively
+for epoch, data in enumerate([easy_data, medium_data, hard_data]):
+    trainer.train(train_dataset=data)
+```
+---
+## Best Practices
+### 1. Data Preparation
+```python
+# Clean and validate data
+def validate_example(example):
+    """Ensure data quality"""
+    if 'metadata' not in example:
+        return False
+    if example['metadata'].get('quality_score', 0) < 0.8:
+        return False
+    return True
+cleaned_dataset = dataset.filter(validate_example)
+```
+### 2. Handling Long Sequences
+```python
+# Dynamic padding for efficiency
+from transformers import DataCollatorWithPadding
+data_collator = DataCollatorWithPadding(
+    tokenizer=tokenizer,
+    padding=True,
+    max_length=2048
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    data_collator=data_collator,
+    train_dataset=dataset,
+)
+```
+### 3. Monitoring Training
+```python
+# Add callbacks
+from transformers import TrainerCallback
+class QualityMonitorCallback(TrainerCallback):
+    def on_evaluate(self, args, state, control, metrics, **kwargs):
+        print(f"Step {state.global_step}: Loss = {metrics.get('loss', 0):.4f}")
+training_args.evaluation_strategy = "steps"
+training_args.eval_steps = 500
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    callbacks=[QualityMonitorCallback()],
+)
+```
+### 4. Saving Checkpoints
+```python
+training_args = TrainingArguments(
+    output_dir="./checkpoints",
+    save_strategy="steps",
+    save_steps=1000,
+    save_total_limit=3,  # Keep only last 3 checkpoints
+    load_best_model_at_end=True,
+)
+```
+### 5. Distributed Training
+```bash
+# Launch with multiple GPUs
+accelerate launch --multi_gpu train.py
+# Or with DeepSpeed
+deepspeed --num_gpus=4 train.py --deepspeed ds_config.json
+```
+---
+## Troubleshooting
+### Out of Memory
+```python
+# Solutions:
+# 1. Reduce batch size
+training_args.per_device_train_batch_size = 1
+# 2. Increase gradient accumulation
+training_args.gradient_accumulation_steps = 32
+# 3. Use gradient checkpointing
+model.gradient_checkpointing_enable()
+# 4. Use 8-bit training
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    load_in_8bit=True,
+    device_map="auto"
+)
+```
+### Slow Training
+```python
+# Solutions:
+# 1. Enable mixed precision
+training_args.fp16 = True
+# 2. Optimize data loading
+dataset.set_format("torch")
+# 3. Increase workers
+training_args.dataloader_num_workers = 4
+# 4. Pin memory
+training_args.dataloader_pin_memory = True
+```
+### Poor Model Performance
+```python
+# Solutions:
+# 1. Increase training epochs
+training_args.num_train_epochs = 5
+# 2. Adjust learning rate
+training_args.learning_rate = 1e-5
+# 3. Add warmup
+training_args.warmup_ratio = 0.1
+# 4. Filter low-quality data
+high_quality = dataset.filter(
+    lambda x: x['metadata'].get('quality_score', 0) > 0.9
+)
+```
+### Data Loading Issues
+```python
+# Solutions:
+# 1. Check file format
+from datasets import load_dataset
+try:
+    dataset = load_dataset("...", split="train")
+except Exception as e:
+    print(f"Error: {e}")
+# 2. Manually load JSONL
+import json
+data = []
+with open("file.jsonl", "r") as f:
+    for line in f:
+        data.append(json.loads(line))
+# 3. Verify data structure
+print(dataset[0])
+```
+---
+## Evaluation
+### Evaluate on Benchmarks
+```python
+from datasets import load_metric
+# Load metrics
+accuracy = load_metric("accuracy")
+bleu = load_metric("bleu")
+# Evaluate
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    # Your metric computation
+    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}
+trainer = Trainer(
+    model=model,
+    compute_metrics=compute_metrics,
+)
+results = trainer.evaluate()
+print(results)
+```
+### Generate Samples
+```python
+# Generate text
+from transformers import pipeline
+generator = pipeline("text-generation", model="./trained-model")
+prompt = "Explain quantum computing in simple terms:"
+output = generator(prompt, max_length=200)
+print(output[0]['generated_text'])
+```
+---
+## Advanced Topics
+### Custom Data Mixing
+```python
+def create_mixed_dataset(ratios):
+    """Mix different datasets with specified ratios"""
+    datasets_dict = {
+        'conversations': load_dataset(..., data_files="conversations.jsonl"),
+        'instructions': load_dataset(..., data_files="instructions.jsonl"),
+        'code': load_dataset(..., data_files="code.jsonl"),
+    }
+    mixed = []
+    for name, ratio in ratios.items():
+        size = int(10000 * ratio)
+        mixed.append(datasets_dict[name].shuffle().select(range(size)))
+    return concatenate_datasets(mixed)
+# Use it
+dataset = create_mixed_dataset({
+    'conversations': 0.4,
+    'instructions': 0.4,
+    'code': 0.2
+})
+```
+### Hyperparameter Tuning
+```python
+from ray import tune
+def train_model(config):
+    training_args = TrainingArguments(
+        learning_rate=config["lr"],
+        per_device_train_batch_size=config["batch_size"],
+        num_train_epochs=3,
+    )
+    trainer = Trainer(model=model, args=training_args)
+    trainer.train()
+    return {"loss": trainer.state.log_history[-1]["loss"]}
+# Run hyperparameter search
+analysis = tune.run(
+    train_model,
+    config={
+        "lr": tune.loguniform(1e-6, 1e-4),
+        "batch_size": tune.choice([2, 4, 8]),
+    }
+)
+```
+---
+## Citation
+```bibtex
+@dataset{helion_1_5_2024,
+  title={Helion 1.5: An Enhanced Large-Scale Dataset for Language Model Training},
+  author={DeepXR/Organization},
+  year={2025},
+  publisher={Hugging Face},
+}
+```
+---
+## License
+This dataset is released under CC BY 4.0 License. See LICENSE file for details.