Trouter-Library commited on
Commit
b55cd05
·
verified ·
1 Parent(s): 361f597

Create USAGE_GUIDE.md

Browse files
Files changed (1) hide show
  1. USAGE_GUIDE.md +740 -0
USAGE_GUIDE.md ADDED
@@ -0,0 +1,740 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Helion 1.5 Usage Guide
2
+
3
+ Complete guide for using the Helion 1.5 dataset series for training and fine-tuning language models.
4
+
5
+ ## Table of Contents
6
+
7
+ 1. [Quick Start](#quick-start)
8
+ 2. [Dataset Overview](#dataset-overview)
9
+ 3. [Loading Data](#loading-data)
10
+ 4. [Training Examples](#training-examples)
11
+ 5. [Fine-Tuning Strategies](#fine-tuning-strategies)
12
+ 6. [Best Practices](#best-practices)
13
+ 7. [Troubleshooting](#troubleshooting)
14
+
15
+ ---
16
+
17
+ ## Quick Start
18
+
19
+ ### Installation
20
+
21
+ ```bash
22
+ pip install datasets transformers torch accelerate
23
+ ```
24
+
25
+ ### Load Dataset
26
+
27
+ ```python
28
+ from datasets import load_dataset
29
+
30
+ # Load full dataset
31
+ dataset = load_dataset("your-username/helion-1.5")
32
+
33
+ # Load specific subset
34
+ conversations = load_dataset(
35
+ "your-username/helion-1.5",
36
+ data_files="helion-1.5-conversations.jsonl"
37
+ )
38
+ ```
39
+
40
+ ### Basic Training
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
44
+
45
+ # Initialize model and tokenizer
46
+ model_name = "meta-llama/Llama-2-7b-hf" # or your preferred base model
47
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
48
+ model = AutoModelForCausalLM.from_pretrained(model_name)
49
+
50
+ # Quick training setup
51
+ training_args = TrainingArguments(
52
+ output_dir="./helion-1.5-finetuned",
53
+ num_train_epochs=3,
54
+ per_device_train_batch_size=4,
55
+ learning_rate=2e-5,
56
+ logging_steps=100,
57
+ )
58
+ ```
59
+
60
+ ---
61
+
62
+ ## Dataset Overview
63
+
64
+ ### File Structure
65
+
66
+ ```
67
+ helion-1.5/
68
+ ├── helion-1.5-conversations.jsonl # 800K multi-turn conversations
69
+ ├── helion-1.5-instructions.jsonl # 600K instruction pairs
70
+ ├── helion-1.5-code.jsonl # 250K code examples
71
+ ├── helion-1.5-reasoning.jsonl # 180K reasoning tasks
72
+ ├── helion-1.5-creative.jsonl # 120K creative writing
73
+ └── helion-1.5-multilingual.jsonl # 50K multilingual data
74
+ ```
75
+
76
+ ### Data Formats
77
+
78
+ #### Conversations Format
79
+ ```json
80
+ {
81
+ "id": "conv_abc123",
82
+ "conversations": [
83
+ {"role": "user", "content": "How does photosynthesis work?"},
84
+ {"role": "assistant", "content": "Photosynthesis is..."}
85
+ ],
86
+ "metadata": {
87
+ "domain": "science",
88
+ "difficulty": "intermediate",
89
+ "quality_score": 0.95
90
+ }
91
+ }
92
+ ```
93
+
94
+ #### Instructions Format
95
+ ```json
96
+ {
97
+ "id": "inst_xyz789",
98
+ "instruction": "Summarize the following text:",
99
+ "input": "Long text here...",
100
+ "output": "Summary here...",
101
+ "metadata": {
102
+ "task_type": "summarization",
103
+ "complexity": "medium"
104
+ }
105
+ }
106
+ ```
107
+
108
+ #### Code Format
109
+ ```json
110
+ {
111
+ "id": "code_def456",
112
+ "language": "python",
113
+ "problem": "Implement binary search",
114
+ "solution": "def binary_search(arr, target): ...",
115
+ "explanation": "This algorithm...",
116
+ "test_cases": [...]
117
+ }
118
+ ```
119
+
120
+ ---
121
+
122
+ ## Loading Data
123
+
124
+ ### Load Specific Subsets
125
+
126
+ ```python
127
+ from datasets import load_dataset
128
+
129
+ # Load only conversations
130
+ conversations = load_dataset(
131
+ "your-username/helion-1.5",
132
+ data_files="helion-1.5-conversations.jsonl",
133
+ split="train"
134
+ )
135
+
136
+ # Load multiple files
137
+ multi_data = load_dataset(
138
+ "your-username/helion-1.5",
139
+ data_files=[
140
+ "helion-1.5-conversations.jsonl",
141
+ "helion-1.5-instructions.jsonl"
142
+ ]
143
+ )
144
+ ```
145
+
146
+ ### Filter by Domain
147
+
148
+ ```python
149
+ # Filter science domain
150
+ science_data = conversations.filter(
151
+ lambda x: x['metadata']['domain'] == 'science'
152
+ )
153
+
154
+ # Filter high quality
155
+ high_quality = conversations.filter(
156
+ lambda x: x['metadata'].get('quality_score', 0) > 0.9
157
+ )
158
+ ```
159
+
160
+ ### Combine Multiple Sources
161
+
162
+ ```python
163
+ from datasets import concatenate_datasets
164
+
165
+ # Load different subsets
166
+ conv = load_dataset("...", data_files="conversations.jsonl")
167
+ inst = load_dataset("...", data_files="instructions.jsonl")
168
+
169
+ # Combine
170
+ combined = concatenate_datasets([conv, inst])
171
+ ```
172
+
173
+ ---
174
+
175
+ ## Training Examples
176
+
177
+ ### 1. Instruction Fine-Tuning
178
+
179
+ ```python
180
+ from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
181
+ from datasets import load_dataset
182
+
183
+ # Load instruction data
184
+ dataset = load_dataset(
185
+ "your-username/helion-1.5",
186
+ data_files="helion-1.5-instructions.jsonl"
187
+ )
188
+
189
+ # Initialize
190
+ model_name = "meta-llama/Llama-2-7b-hf"
191
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
192
+ tokenizer.pad_token = tokenizer.eos_token
193
+ model = AutoModelForCausalLM.from_pretrained(model_name)
194
+
195
+ # Format function
196
+ def format_instruction(example):
197
+ text = f"### Instruction:\n{example['instruction']}\n\n"
198
+ if example.get('input'):
199
+ text += f"### Input:\n{example['input']}\n\n"
200
+ text += f"### Response:\n{example['output']}"
201
+ return {"text": text}
202
+
203
+ # Apply formatting
204
+ dataset = dataset.map(format_instruction)
205
+
206
+ # Tokenize
207
+ def tokenize_function(examples):
208
+ return tokenizer(
209
+ examples["text"],
210
+ padding="max_length",
211
+ truncation=True,
212
+ max_length=512
213
+ )
214
+
215
+ tokenized_dataset = dataset.map(tokenize_function, batched=True)
216
+
217
+ # Training arguments
218
+ training_args = TrainingArguments(
219
+ output_dir="./instruction-model",
220
+ num_train_epochs=3,
221
+ per_device_train_batch_size=4,
222
+ gradient_accumulation_steps=8,
223
+ learning_rate=2e-5,
224
+ warmup_steps=500,
225
+ logging_steps=100,
226
+ save_steps=1000,
227
+ fp16=True,
228
+ optim="adamw_torch",
229
+ )
230
+
231
+ # Train
232
+ trainer = Trainer(
233
+ model=model,
234
+ args=training_args,
235
+ train_dataset=tokenized_dataset,
236
+ )
237
+
238
+ trainer.train()
239
+ model.save_pretrained("./instruction-model-final")
240
+ ```
241
+
242
+ ### 2. Conversational Model Training
243
+
244
+ ```python
245
+ from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
246
+
247
+ # Load conversation data
248
+ dataset = load_dataset(
249
+ "your-username/helion-1.5",
250
+ data_files="helion-1.5-conversations.jsonl"
251
+ )
252
+
253
+ # Format conversations
254
+ def format_conversation(example):
255
+ formatted = ""
256
+ for turn in example['conversations']:
257
+ role = turn['role'].capitalize()
258
+ content = turn['content']
259
+ formatted += f"{role}: {content}\n\n"
260
+ return {"text": formatted.strip()}
261
+
262
+ dataset = dataset.map(format_conversation)
263
+
264
+ # Tokenize
265
+ def tokenize(examples):
266
+ return tokenizer(
267
+ examples["text"],
268
+ padding="max_length",
269
+ truncation=True,
270
+ max_length=2048 # Longer for conversations
271
+ )
272
+
273
+ tokenized = dataset.map(tokenize, batched=True)
274
+
275
+ # Training setup
276
+ training_args = TrainingArguments(
277
+ output_dir="./conversation-model",
278
+ num_train_epochs=3,
279
+ per_device_train_batch_size=2,
280
+ gradient_accumulation_steps=16,
281
+ learning_rate=1e-5,
282
+ warmup_ratio=0.1,
283
+ logging_steps=50,
284
+ save_strategy="epoch",
285
+ fp16=True,
286
+ )
287
+
288
+ trainer = Trainer(
289
+ model=model,
290
+ args=training_args,
291
+ train_dataset=tokenized,
292
+ )
293
+
294
+ trainer.train()
295
+ ```
296
+
297
+ ### 3. Code Generation Training
298
+
299
+ ```python
300
+ # Load code data
301
+ code_data = load_dataset(
302
+ "your-username/helion-1.5",
303
+ data_files="helion-1.5-code.jsonl"
304
+ )
305
+
306
+ # Format code examples
307
+ def format_code(example):
308
+ text = f"# Problem: {example['problem']}\n\n"
309
+ text += f"# Solution ({example['language']}):\n{example['solution']}\n\n"
310
+ if example.get('explanation'):
311
+ text += f"# Explanation: {example['explanation']}"
312
+ return {"text": text}
313
+
314
+ code_data = code_data.map(format_code)
315
+
316
+ # Filter by language (optional)
317
+ python_code = code_data.filter(
318
+ lambda x: x['language'] == 'python'
319
+ )
320
+
321
+ # Training with code-specific settings
322
+ training_args = TrainingArguments(
323
+ output_dir="./code-model",
324
+ num_train_epochs=5, # More epochs for code
325
+ per_device_train_batch_size=4,
326
+ learning_rate=3e-5,
327
+ warmup_steps=1000,
328
+ save_steps=2000,
329
+ )
330
+
331
+ # Train model
332
+ trainer = Trainer(
333
+ model=model,
334
+ args=training_args,
335
+ train_dataset=tokenized_code,
336
+ )
337
+
338
+ trainer.train()
339
+ ```
340
+
341
+ ### 4. LoRA Fine-Tuning (Memory Efficient)
342
+
343
+ ```python
344
+ from peft import LoraConfig, get_peft_model, TaskType
345
+
346
+ # Load base model
347
+ model = AutoModelForCausalLM.from_pretrained(
348
+ model_name,
349
+ load_in_8bit=True, # 8-bit quantization
350
+ device_map="auto",
351
+ )
352
+
353
+ # LoRA configuration
354
+ lora_config = LoraConfig(
355
+ r=16, # LoRA rank
356
+ lora_alpha=32,
357
+ target_modules=["q_proj", "v_proj"],
358
+ lora_dropout=0.05,
359
+ bias="none",
360
+ task_type=TaskType.CAUSAL_LM
361
+ )
362
+
363
+ # Add LoRA adapters
364
+ model = get_peft_model(model, lora_config)
365
+ model.print_trainable_parameters()
366
+
367
+ # Training with LoRA
368
+ training_args = TrainingArguments(
369
+ output_dir="./lora-model",
370
+ num_train_epochs=3,
371
+ per_device_train_batch_size=8, # Can use larger batch
372
+ gradient_accumulation_steps=4,
373
+ learning_rate=3e-4, # Higher LR for LoRA
374
+ fp16=True,
375
+ logging_steps=100,
376
+ )
377
+
378
+ trainer = Trainer(
379
+ model=model,
380
+ args=training_args,
381
+ train_dataset=tokenized_dataset,
382
+ )
383
+
384
+ trainer.train()
385
+ ```
386
+
387
+ ---
388
+
389
+ ## Fine-Tuning Strategies
390
+
391
+ ### Strategy 1: Domain-Specific Fine-Tuning
392
+
393
+ ```python
394
+ # Fine-tune on specific domain
395
+ science_data = dataset.filter(
396
+ lambda x: x['metadata']['domain'] == 'science'
397
+ )
398
+
399
+ # Train with domain focus
400
+ trainer = Trainer(
401
+ model=model,
402
+ args=training_args,
403
+ train_dataset=science_data,
404
+ )
405
+ ```
406
+
407
+ ### Strategy 2: Progressive Fine-Tuning
408
+
409
+ ```python
410
+ # Stage 1: General knowledge
411
+ general_data = dataset.filter(
412
+ lambda x: x['metadata']['domain'] == 'general'
413
+ )
414
+ trainer.train(train_dataset=general_data)
415
+
416
+ # Stage 2: Specialized knowledge
417
+ specialized_data = dataset.filter(
418
+ lambda x: x['metadata']['difficulty'] == 'advanced'
419
+ )
420
+ trainer.train(train_dataset=specialized_data)
421
+ ```
422
+
423
+ ### Strategy 3: Multi-Task Learning
424
+
425
+ ```python
426
+ # Mix different data types
427
+ conv_weight = 0.4
428
+ inst_weight = 0.3
429
+ code_weight = 0.3
430
+
431
+ # Sample proportionally
432
+ from datasets import concatenate_datasets
433
+
434
+ mixed_dataset = concatenate_datasets([
435
+ conversations.shuffle().select(range(int(10000 * conv_weight))),
436
+ instructions.shuffle().select(range(int(10000 * inst_weight))),
437
+ code_data.shuffle().select(range(int(10000 * code_weight))),
438
+ ])
439
+ ```
440
+
441
+ ### Strategy 4: Curriculum Learning
442
+
443
+ ```python
444
+ # Start with easy examples
445
+ easy_data = dataset.filter(
446
+ lambda x: x['metadata']['difficulty'] == 'easy'
447
+ )
448
+
449
+ # Progress to harder examples
450
+ medium_data = dataset.filter(
451
+ lambda x: x['metadata']['difficulty'] == 'intermediate'
452
+ )
453
+
454
+ hard_data = dataset.filter(
455
+ lambda x: x['metadata']['difficulty'] == 'advanced'
456
+ )
457
+
458
+ # Train progressively
459
+ for epoch, data in enumerate([easy_data, medium_data, hard_data]):
460
+ trainer.train(train_dataset=data)
461
+ ```
462
+
463
+ ---
464
+
465
+ ## Best Practices
466
+
467
+ ### 1. Data Preparation
468
+
469
+ ```python
470
+ # Clean and validate data
471
+ def validate_example(example):
472
+ """Ensure data quality"""
473
+ if 'metadata' not in example:
474
+ return False
475
+ if example['metadata'].get('quality_score', 0) < 0.8:
476
+ return False
477
+ return True
478
+
479
+ cleaned_dataset = dataset.filter(validate_example)
480
+ ```
481
+
482
+ ### 2. Handling Long Sequences
483
+
484
+ ```python
485
+ # Dynamic padding for efficiency
486
+ from transformers import DataCollatorWithPadding
487
+
488
+ data_collator = DataCollatorWithPadding(
489
+ tokenizer=tokenizer,
490
+ padding=True,
491
+ max_length=2048
492
+ )
493
+
494
+ trainer = Trainer(
495
+ model=model,
496
+ args=training_args,
497
+ data_collator=data_collator,
498
+ train_dataset=dataset,
499
+ )
500
+ ```
501
+
502
+ ### 3. Monitoring Training
503
+
504
+ ```python
505
+ # Add callbacks
506
+ from transformers import TrainerCallback
507
+
508
+ class QualityMonitorCallback(TrainerCallback):
509
+ def on_evaluate(self, args, state, control, metrics, **kwargs):
510
+ print(f"Step {state.global_step}: Loss = {metrics.get('loss', 0):.4f}")
511
+
512
+ training_args.evaluation_strategy = "steps"
513
+ training_args.eval_steps = 500
514
+
515
+ trainer = Trainer(
516
+ model=model,
517
+ args=training_args,
518
+ callbacks=[QualityMonitorCallback()],
519
+ )
520
+ ```
521
+
522
+ ### 4. Saving Checkpoints
523
+
524
+ ```python
525
+ training_args = TrainingArguments(
526
+ output_dir="./checkpoints",
527
+ save_strategy="steps",
528
+ save_steps=1000,
529
+ save_total_limit=3, # Keep only last 3 checkpoints
530
+ load_best_model_at_end=True,
531
+ )
532
+ ```
533
+
534
+ ### 5. Distributed Training
535
+
536
+ ```bash
537
+ # Launch with multiple GPUs
538
+ accelerate launch --multi_gpu train.py
539
+
540
+ # Or with DeepSpeed
541
+ deepspeed --num_gpus=4 train.py --deepspeed ds_config.json
542
+ ```
543
+
544
+ ---
545
+
546
+ ## Troubleshooting
547
+
548
+ ### Out of Memory
549
+
550
+ ```python
551
+ # Solutions:
552
+ # 1. Reduce batch size
553
+ training_args.per_device_train_batch_size = 1
554
+
555
+ # 2. Increase gradient accumulation
556
+ training_args.gradient_accumulation_steps = 32
557
+
558
+ # 3. Use gradient checkpointing
559
+ model.gradient_checkpointing_enable()
560
+
561
+ # 4. Use 8-bit training
562
+ model = AutoModelForCausalLM.from_pretrained(
563
+ model_name,
564
+ load_in_8bit=True,
565
+ device_map="auto"
566
+ )
567
+ ```
568
+
569
+ ### Slow Training
570
+
571
+ ```python
572
+ # Solutions:
573
+ # 1. Enable mixed precision
574
+ training_args.fp16 = True
575
+
576
+ # 2. Optimize data loading
577
+ dataset.set_format("torch")
578
+
579
+ # 3. Increase workers
580
+ training_args.dataloader_num_workers = 4
581
+
582
+ # 4. Pin memory
583
+ training_args.dataloader_pin_memory = True
584
+ ```
585
+
586
+ ### Poor Model Performance
587
+
588
+ ```python
589
+ # Solutions:
590
+ # 1. Increase training epochs
591
+ training_args.num_train_epochs = 5
592
+
593
+ # 2. Adjust learning rate
594
+ training_args.learning_rate = 1e-5
595
+
596
+ # 3. Add warmup
597
+ training_args.warmup_ratio = 0.1
598
+
599
+ # 4. Filter low-quality data
600
+ high_quality = dataset.filter(
601
+ lambda x: x['metadata'].get('quality_score', 0) > 0.9
602
+ )
603
+ ```
604
+
605
+ ### Data Loading Issues
606
+
607
+ ```python
608
+ # Solutions:
609
+ # 1. Check file format
610
+ from datasets import load_dataset
611
+ try:
612
+ dataset = load_dataset("...", split="train")
613
+ except Exception as e:
614
+ print(f"Error: {e}")
615
+
616
+ # 2. Manually load JSONL
617
+ import json
618
+ data = []
619
+ with open("file.jsonl", "r") as f:
620
+ for line in f:
621
+ data.append(json.loads(line))
622
+
623
+ # 3. Verify data structure
624
+ print(dataset[0])
625
+ ```
626
+
627
+ ---
628
+
629
+ ## Evaluation
630
+
631
+ ### Evaluate on Benchmarks
632
+
633
+ ```python
634
+ from datasets import load_metric
635
+
636
+ # Load metrics
637
+ accuracy = load_metric("accuracy")
638
+ bleu = load_metric("bleu")
639
+
640
+ # Evaluate
641
+ def compute_metrics(eval_pred):
642
+ predictions, labels = eval_pred
643
+ # Your metric computation
644
+ return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}
645
+
646
+ trainer = Trainer(
647
+ model=model,
648
+ compute_metrics=compute_metrics,
649
+ )
650
+
651
+ results = trainer.evaluate()
652
+ print(results)
653
+ ```
654
+
655
+ ### Generate Samples
656
+
657
+ ```python
658
+ # Generate text
659
+ from transformers import pipeline
660
+
661
+ generator = pipeline("text-generation", model="./trained-model")
662
+
663
+ prompt = "Explain quantum computing in simple terms:"
664
+ output = generator(prompt, max_length=200)
665
+ print(output[0]['generated_text'])
666
+ ```
667
+
668
+ ---
669
+
670
+ ## Advanced Topics
671
+
672
+ ### Custom Data Mixing
673
+
674
+ ```python
675
+ def create_mixed_dataset(ratios):
676
+ """Mix different datasets with specified ratios"""
677
+ datasets_dict = {
678
+ 'conversations': load_dataset(..., data_files="conversations.jsonl"),
679
+ 'instructions': load_dataset(..., data_files="instructions.jsonl"),
680
+ 'code': load_dataset(..., data_files="code.jsonl"),
681
+ }
682
+
683
+ mixed = []
684
+ for name, ratio in ratios.items():
685
+ size = int(10000 * ratio)
686
+ mixed.append(datasets_dict[name].shuffle().select(range(size)))
687
+
688
+ return concatenate_datasets(mixed)
689
+
690
+ # Use it
691
+ dataset = create_mixed_dataset({
692
+ 'conversations': 0.4,
693
+ 'instructions': 0.4,
694
+ 'code': 0.2
695
+ })
696
+ ```
697
+
698
+ ### Hyperparameter Tuning
699
+
700
+ ```python
701
+ from ray import tune
702
+
703
+ def train_model(config):
704
+ training_args = TrainingArguments(
705
+ learning_rate=config["lr"],
706
+ per_device_train_batch_size=config["batch_size"],
707
+ num_train_epochs=3,
708
+ )
709
+ trainer = Trainer(model=model, args=training_args)
710
+ trainer.train()
711
+ return {"loss": trainer.state.log_history[-1]["loss"]}
712
+
713
+ # Run hyperparameter search
714
+ analysis = tune.run(
715
+ train_model,
716
+ config={
717
+ "lr": tune.loguniform(1e-6, 1e-4),
718
+ "batch_size": tune.choice([2, 4, 8]),
719
+ }
720
+ )
721
+ ```
722
+
723
+ ---
724
+
725
+ ## Citation
726
+
727
+ ```bibtex
728
+ @dataset{helion_1_5_2024,
729
+ title={Helion 1.5: An Enhanced Large-Scale Dataset for Language Model Training},
730
+ author={DeepXR/Organization},
731
+ year={2025},
732
+ publisher={Hugging Face},
733
+ }
734
+ ```
735
+
736
+ ---
737
+
738
+ ## License
739
+
740
+ This dataset is released under CC BY 4.0 License. See LICENSE file for details.