DeepXR
/

Helion-V1.5

@@ -1,740 +0,0 @@
-# Helion 1.5 Usage Guide
-Complete guide for using the Helion 1.5 dataset series for training and fine-tuning language models.
-## Table of Contents
-1. [Quick Start](#quick-start)
-2. [Dataset Overview](#dataset-overview)
-3. [Loading Data](#loading-data)
-4. [Training Examples](#training-examples)
-5. [Fine-Tuning Strategies](#fine-tuning-strategies)
-6. [Best Practices](#best-practices)
-7. [Troubleshooting](#troubleshooting)
----
-## Quick Start
-### Installation
-```bash
-pip install datasets transformers torch accelerate
-```
-### Load Dataset
-```python
-from datasets import load_dataset
-# Load full dataset
-dataset = load_dataset("your-username/helion-1.5")
-# Load specific subset
-conversations = load_dataset(
-    "your-username/helion-1.5",
-    data_files="helion-1.5-conversations.jsonl"
-)
-```
-### Basic Training
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
-# Initialize model and tokenizer
-model_name = "meta-llama/Llama-2-7b-hf"  # or your preferred base model
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(model_name)
-# Quick training setup
-training_args = TrainingArguments(
-    output_dir="./helion-1.5-finetuned",
-    num_train_epochs=3,
-    per_device_train_batch_size=4,
-    learning_rate=2e-5,
-    logging_steps=100,
-)
-```
----
-## Dataset Overview
-### File Structure
-```
-helion-1.5/
-├── helion-1.5-conversations.jsonl    # 800K multi-turn conversations
-├── helion-1.5-instructions.jsonl     # 600K instruction pairs
-├── helion-1.5-code.jsonl             # 250K code examples
-├── helion-1.5-reasoning.jsonl        # 180K reasoning tasks
-├── helion-1.5-creative.jsonl         # 120K creative writing
-└── helion-1.5-multilingual.jsonl     # 50K multilingual data
-```
-### Data Formats
-#### Conversations Format
-```json
-{
-  "id": "conv_abc123",
-  "conversations": [
-    {"role": "user", "content": "How does photosynthesis work?"},
-    {"role": "assistant", "content": "Photosynthesis is..."}
-  ],
-  "metadata": {
-    "domain": "science",
-    "difficulty": "intermediate",
-    "quality_score": 0.95
-  }
-}
-```
-#### Instructions Format
-```json
-{
-  "id": "inst_xyz789",
-  "instruction": "Summarize the following text:",
-  "input": "Long text here...",
-  "output": "Summary here...",
-  "metadata": {
-    "task_type": "summarization",
-    "complexity": "medium"
-  }
-}
-```
-#### Code Format
-```json
-{
-  "id": "code_def456",
-  "language": "python",
-  "problem": "Implement binary search",
-  "solution": "def binary_search(arr, target): ...",
-  "explanation": "This algorithm...",
-  "test_cases": [...]
-}
-```
----
-## Loading Data
-### Load Specific Subsets
-```python
-from datasets import load_dataset
-# Load only conversations
-conversations = load_dataset(
-    "your-username/helion-1.5",
-    data_files="helion-1.5-conversations.jsonl",
-    split="train"
-)
-# Load multiple files
-multi_data = load_dataset(
-    "your-username/helion-1.5",
-    data_files=[
-        "helion-1.5-conversations.jsonl",
-        "helion-1.5-instructions.jsonl"
-    ]
-)
-```
-### Filter by Domain
-```python
-# Filter science domain
-science_data = conversations.filter(
-    lambda x: x['metadata']['domain'] == 'science'
-)
-# Filter high quality
-high_quality = conversations.filter(
-    lambda x: x['metadata'].get('quality_score', 0) > 0.9
-)
-```
-### Combine Multiple Sources
-```python
-from datasets import concatenate_datasets
-# Load different subsets
-conv = load_dataset("...", data_files="conversations.jsonl")
-inst = load_dataset("...", data_files="instructions.jsonl")
-# Combine
-combined = concatenate_datasets([conv, inst])
-```
----
-## Training Examples
-### 1. Instruction Fine-Tuning
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
-from datasets import load_dataset
-# Load instruction data
-dataset = load_dataset(
-    "your-username/helion-1.5",
-    data_files="helion-1.5-instructions.jsonl"
-)
-# Initialize
-model_name = "meta-llama/Llama-2-7b-hf"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-tokenizer.pad_token = tokenizer.eos_token
-model = AutoModelForCausalLM.from_pretrained(model_name)
-# Format function
-def format_instruction(example):
-    text = f"### Instruction:\n{example['instruction']}\n\n"
-    if example.get('input'):
-        text += f"### Input:\n{example['input']}\n\n"
-    text += f"### Response:\n{example['output']}"
-    return {"text": text}
-# Apply formatting
-dataset = dataset.map(format_instruction)
-# Tokenize
-def tokenize_function(examples):
-    return tokenizer(
-        examples["text"],
-        padding="max_length",
-        truncation=True,
-        max_length=512
-    )
-tokenized_dataset = dataset.map(tokenize_function, batched=True)
-# Training arguments
-training_args = TrainingArguments(
-    output_dir="./instruction-model",
-    num_train_epochs=3,
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=8,
-    learning_rate=2e-5,
-    warmup_steps=500,
-    logging_steps=100,
-    save_steps=1000,
-    fp16=True,
-    optim="adamw_torch",
-)
-# Train
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_dataset,
-)
-trainer.train()
-model.save_pretrained("./instruction-model-final")
-```
-### 2. Conversational Model Training
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
-# Load conversation data
-dataset = load_dataset(
-    "your-username/helion-1.5",
-    data_files="helion-1.5-conversations.jsonl"
-)
-# Format conversations
-def format_conversation(example):
-    formatted = ""
-    for turn in example['conversations']:
-        role = turn['role'].capitalize()
-        content = turn['content']
-        formatted += f"{role}: {content}\n\n"
-    return {"text": formatted.strip()}
-dataset = dataset.map(format_conversation)
-# Tokenize
-def tokenize(examples):
-    return tokenizer(
-        examples["text"],
-        padding="max_length",
-        truncation=True,
-        max_length=2048  # Longer for conversations
-    )
-tokenized = dataset.map(tokenize, batched=True)
-# Training setup
-training_args = TrainingArguments(
-    output_dir="./conversation-model",
-    num_train_epochs=3,
-    per_device_train_batch_size=2,
-    gradient_accumulation_steps=16,
-    learning_rate=1e-5,
-    warmup_ratio=0.1,
-    logging_steps=50,
-    save_strategy="epoch",
-    fp16=True,
-)
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized,
-)
-trainer.train()
-```
-### 3. Code Generation Training
-```python
-# Load code data
-code_data = load_dataset(
-    "your-username/helion-1.5",
-    data_files="helion-1.5-code.jsonl"
-)
-# Format code examples
-def format_code(example):
-    text = f"# Problem: {example['problem']}\n\n"
-    text += f"# Solution ({example['language']}):\n{example['solution']}\n\n"
-    if example.get('explanation'):
-        text += f"# Explanation: {example['explanation']}"
-    return {"text": text}
-code_data = code_data.map(format_code)
-# Filter by language (optional)
-python_code = code_data.filter(
-    lambda x: x['language'] == 'python'
-)
-# Training with code-specific settings
-training_args = TrainingArguments(
-    output_dir="./code-model",
-    num_train_epochs=5,  # More epochs for code
-    per_device_train_batch_size=4,
-    learning_rate=3e-5,
-    warmup_steps=1000,
-    save_steps=2000,
-)
-# Train model
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_code,
-)
-trainer.train()
-```
-### 4. LoRA Fine-Tuning (Memory Efficient)
-```python
-from peft import LoraConfig, get_peft_model, TaskType
-# Load base model
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    load_in_8bit=True,  # 8-bit quantization
-    device_map="auto",
-)
-# LoRA configuration
-lora_config = LoraConfig(
-    r=16,  # LoRA rank
-    lora_alpha=32,
-    target_modules=["q_proj", "v_proj"],
-    lora_dropout=0.05,
-    bias="none",
-    task_type=TaskType.CAUSAL_LM
-)
-# Add LoRA adapters
-model = get_peft_model(model, lora_config)
-model.print_trainable_parameters()
-# Training with LoRA
-training_args = TrainingArguments(
-    output_dir="./lora-model",
-    num_train_epochs=3,
-    per_device_train_batch_size=8,  # Can use larger batch
-    gradient_accumulation_steps=4,
-    learning_rate=3e-4,  # Higher LR for LoRA
-    fp16=True,
-    logging_steps=100,
-)
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=tokenized_dataset,
-)
-trainer.train()
-```
----
-## Fine-Tuning Strategies
-### Strategy 1: Domain-Specific Fine-Tuning
-```python
-# Fine-tune on specific domain
-science_data = dataset.filter(
-    lambda x: x['metadata']['domain'] == 'science'
-)
-# Train with domain focus
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=science_data,
-)
-```
-### Strategy 2: Progressive Fine-Tuning
-```python
-# Stage 1: General knowledge
-general_data = dataset.filter(
-    lambda x: x['metadata']['domain'] == 'general'
-)
-trainer.train(train_dataset=general_data)
-# Stage 2: Specialized knowledge
-specialized_data = dataset.filter(
-    lambda x: x['metadata']['difficulty'] == 'advanced'
-)
-trainer.train(train_dataset=specialized_data)
-```
-### Strategy 3: Multi-Task Learning
-```python
-# Mix different data types
-conv_weight = 0.4
-inst_weight = 0.3
-code_weight = 0.3
-# Sample proportionally
-from datasets import concatenate_datasets
-mixed_dataset = concatenate_datasets([
-    conversations.shuffle().select(range(int(10000 * conv_weight))),
-    instructions.shuffle().select(range(int(10000 * inst_weight))),
-    code_data.shuffle().select(range(int(10000 * code_weight))),
-])
-```
-### Strategy 4: Curriculum Learning
-```python
-# Start with easy examples
-easy_data = dataset.filter(
-    lambda x: x['metadata']['difficulty'] == 'easy'
-)
-# Progress to harder examples
-medium_data = dataset.filter(
-    lambda x: x['metadata']['difficulty'] == 'intermediate'
-)
-hard_data = dataset.filter(
-    lambda x: x['metadata']['difficulty'] == 'advanced'
-)
-# Train progressively
-for epoch, data in enumerate([easy_data, medium_data, hard_data]):
-    trainer.train(train_dataset=data)
-```
----
-## Best Practices
-### 1. Data Preparation
-```python
-# Clean and validate data
-def validate_example(example):
-    """Ensure data quality"""
-    if 'metadata' not in example:
-        return False
-    if example['metadata'].get('quality_score', 0) < 0.8:
-        return False
-    return True
-cleaned_dataset = dataset.filter(validate_example)
-```
-### 2. Handling Long Sequences
-```python
-# Dynamic padding for efficiency
-from transformers import DataCollatorWithPadding
-data_collator = DataCollatorWithPadding(
-    tokenizer=tokenizer,
-    padding=True,
-    max_length=2048
-)
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    data_collator=data_collator,
-    train_dataset=dataset,
-)
-```
-### 3. Monitoring Training
-```python
-# Add callbacks
-from transformers import TrainerCallback
-class QualityMonitorCallback(TrainerCallback):
-    def on_evaluate(self, args, state, control, metrics, **kwargs):
-        print(f"Step {state.global_step}: Loss = {metrics.get('loss', 0):.4f}")
-training_args.evaluation_strategy = "steps"
-training_args.eval_steps = 500
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    callbacks=[QualityMonitorCallback()],
-)
-```
-### 4. Saving Checkpoints
-```python
-training_args = TrainingArguments(
-    output_dir="./checkpoints",
-    save_strategy="steps",
-    save_steps=1000,
-    save_total_limit=3,  # Keep only last 3 checkpoints
-    load_best_model_at_end=True,
-)
-```
-### 5. Distributed Training
-```bash
-# Launch with multiple GPUs
-accelerate launch --multi_gpu train.py
-# Or with DeepSpeed
-deepspeed --num_gpus=4 train.py --deepspeed ds_config.json
-```
----
-## Troubleshooting
-### Out of Memory
-```python
-# Solutions:
-# 1. Reduce batch size
-training_args.per_device_train_batch_size = 1
-# 2. Increase gradient accumulation
-training_args.gradient_accumulation_steps = 32
-# 3. Use gradient checkpointing
-model.gradient_checkpointing_enable()
-# 4. Use 8-bit training
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    load_in_8bit=True,
-    device_map="auto"
-)
-```
-### Slow Training
-```python
-# Solutions:
-# 1. Enable mixed precision
-training_args.fp16 = True
-# 2. Optimize data loading
-dataset.set_format("torch")
-# 3. Increase workers
-training_args.dataloader_num_workers = 4
-# 4. Pin memory
-training_args.dataloader_pin_memory = True
-```
-### Poor Model Performance
-```python
-# Solutions:
-# 1. Increase training epochs
-training_args.num_train_epochs = 5
-# 2. Adjust learning rate
-training_args.learning_rate = 1e-5
-# 3. Add warmup
-training_args.warmup_ratio = 0.1
-# 4. Filter low-quality data
-high_quality = dataset.filter(
-    lambda x: x['metadata'].get('quality_score', 0) > 0.9
-)
-```
-### Data Loading Issues
-```python
-# Solutions:
-# 1. Check file format
-from datasets import load_dataset
-try:
-    dataset = load_dataset("...", split="train")
-except Exception as e:
-    print(f"Error: {e}")
-# 2. Manually load JSONL
-import json
-data = []
-with open("file.jsonl", "r") as f:
-    for line in f:
-        data.append(json.loads(line))
-# 3. Verify data structure
-print(dataset[0])
-```
----
-## Evaluation
-### Evaluate on Benchmarks
-```python
-from datasets import load_metric
-# Load metrics
-accuracy = load_metric("accuracy")
-bleu = load_metric("bleu")
-# Evaluate
-def compute_metrics(eval_pred):
-    predictions, labels = eval_pred
-    # Your metric computation
-    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}
-trainer = Trainer(
-    model=model,
-    compute_metrics=compute_metrics,
-)
-results = trainer.evaluate()
-print(results)
-```
-### Generate Samples
-```python
-# Generate text
-from transformers import pipeline
-generator = pipeline("text-generation", model="./trained-model")
-prompt = "Explain quantum computing in simple terms:"
-output = generator(prompt, max_length=200)
-print(output[0]['generated_text'])
-```
----
-## Advanced Topics
-### Custom Data Mixing
-```python
-def create_mixed_dataset(ratios):
-    """Mix different datasets with specified ratios"""
-    datasets_dict = {
-        'conversations': load_dataset(..., data_files="conversations.jsonl"),
-        'instructions': load_dataset(..., data_files="instructions.jsonl"),
-        'code': load_dataset(..., data_files="code.jsonl"),
-    }
-    mixed = []
-    for name, ratio in ratios.items():
-        size = int(10000 * ratio)
-        mixed.append(datasets_dict[name].shuffle().select(range(size)))
-    return concatenate_datasets(mixed)
-# Use it
-dataset = create_mixed_dataset({
-    'conversations': 0.4,
-    'instructions': 0.4,
-    'code': 0.2
-})
-```
-### Hyperparameter Tuning
-```python
-from ray import tune
-def train_model(config):
-    training_args = TrainingArguments(
-        learning_rate=config["lr"],
-        per_device_train_batch_size=config["batch_size"],
-        num_train_epochs=3,
-    )
-    trainer = Trainer(model=model, args=training_args)
-    trainer.train()
-    return {"loss": trainer.state.log_history[-1]["loss"]}
-# Run hyperparameter search
-analysis = tune.run(
-    train_model,
-    config={
-        "lr": tune.loguniform(1e-6, 1e-4),
-        "batch_size": tune.choice([2, 4, 8]),
-    }
-)
-```
----
-## Citation
-```bibtex
-@dataset{helion_1_5_2024,
-  title={Helion 1.5: An Enhanced Large-Scale Dataset for Language Model Training},
-  author={DeepXR/Organization},
-  year={2025},
-  publisher={Hugging Face},
-}
-```
----
-## License
-This dataset is released under CC BY 4.0 License. See LICENSE file for details.