Trouter-Library commited on
Commit
a313c4b
·
verified ·
1 Parent(s): 0e83bb5

Delete USAGE_GUIDE.md

Browse files
Files changed (1) hide show
  1. USAGE_GUIDE.md +0 -740
USAGE_GUIDE.md DELETED
@@ -1,740 +0,0 @@
1
- # Helion 1.5 Usage Guide
2
-
3
- Complete guide for using the Helion 1.5 dataset series for training and fine-tuning language models.
4
-
5
- ## Table of Contents
6
-
7
- 1. [Quick Start](#quick-start)
8
- 2. [Dataset Overview](#dataset-overview)
9
- 3. [Loading Data](#loading-data)
10
- 4. [Training Examples](#training-examples)
11
- 5. [Fine-Tuning Strategies](#fine-tuning-strategies)
12
- 6. [Best Practices](#best-practices)
13
- 7. [Troubleshooting](#troubleshooting)
14
-
15
- ---
16
-
17
- ## Quick Start
18
-
19
- ### Installation
20
-
21
- ```bash
22
- pip install datasets transformers torch accelerate
23
- ```
24
-
25
- ### Load Dataset
26
-
27
- ```python
28
- from datasets import load_dataset
29
-
30
- # Load full dataset
31
- dataset = load_dataset("your-username/helion-1.5")
32
-
33
- # Load specific subset
34
- conversations = load_dataset(
35
- "your-username/helion-1.5",
36
- data_files="helion-1.5-conversations.jsonl"
37
- )
38
- ```
39
-
40
- ### Basic Training
41
-
42
- ```python
43
- from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
44
-
45
- # Initialize model and tokenizer
46
- model_name = "meta-llama/Llama-2-7b-hf" # or your preferred base model
47
- tokenizer = AutoTokenizer.from_pretrained(model_name)
48
- model = AutoModelForCausalLM.from_pretrained(model_name)
49
-
50
- # Quick training setup
51
- training_args = TrainingArguments(
52
- output_dir="./helion-1.5-finetuned",
53
- num_train_epochs=3,
54
- per_device_train_batch_size=4,
55
- learning_rate=2e-5,
56
- logging_steps=100,
57
- )
58
- ```
59
-
60
- ---
61
-
62
- ## Dataset Overview
63
-
64
- ### File Structure
65
-
66
- ```
67
- helion-1.5/
68
- ├── helion-1.5-conversations.jsonl # 800K multi-turn conversations
69
- ├── helion-1.5-instructions.jsonl # 600K instruction pairs
70
- ├── helion-1.5-code.jsonl # 250K code examples
71
- ├── helion-1.5-reasoning.jsonl # 180K reasoning tasks
72
- ├── helion-1.5-creative.jsonl # 120K creative writing
73
- └── helion-1.5-multilingual.jsonl # 50K multilingual data
74
- ```
75
-
76
- ### Data Formats
77
-
78
- #### Conversations Format
79
- ```json
80
- {
81
- "id": "conv_abc123",
82
- "conversations": [
83
- {"role": "user", "content": "How does photosynthesis work?"},
84
- {"role": "assistant", "content": "Photosynthesis is..."}
85
- ],
86
- "metadata": {
87
- "domain": "science",
88
- "difficulty": "intermediate",
89
- "quality_score": 0.95
90
- }
91
- }
92
- ```
93
-
94
- #### Instructions Format
95
- ```json
96
- {
97
- "id": "inst_xyz789",
98
- "instruction": "Summarize the following text:",
99
- "input": "Long text here...",
100
- "output": "Summary here...",
101
- "metadata": {
102
- "task_type": "summarization",
103
- "complexity": "medium"
104
- }
105
- }
106
- ```
107
-
108
- #### Code Format
109
- ```json
110
- {
111
- "id": "code_def456",
112
- "language": "python",
113
- "problem": "Implement binary search",
114
- "solution": "def binary_search(arr, target): ...",
115
- "explanation": "This algorithm...",
116
- "test_cases": [...]
117
- }
118
- ```
119
-
120
- ---
121
-
122
- ## Loading Data
123
-
124
- ### Load Specific Subsets
125
-
126
- ```python
127
- from datasets import load_dataset
128
-
129
- # Load only conversations
130
- conversations = load_dataset(
131
- "your-username/helion-1.5",
132
- data_files="helion-1.5-conversations.jsonl",
133
- split="train"
134
- )
135
-
136
- # Load multiple files
137
- multi_data = load_dataset(
138
- "your-username/helion-1.5",
139
- data_files=[
140
- "helion-1.5-conversations.jsonl",
141
- "helion-1.5-instructions.jsonl"
142
- ]
143
- )
144
- ```
145
-
146
- ### Filter by Domain
147
-
148
- ```python
149
- # Filter science domain
150
- science_data = conversations.filter(
151
- lambda x: x['metadata']['domain'] == 'science'
152
- )
153
-
154
- # Filter high quality
155
- high_quality = conversations.filter(
156
- lambda x: x['metadata'].get('quality_score', 0) > 0.9
157
- )
158
- ```
159
-
160
- ### Combine Multiple Sources
161
-
162
- ```python
163
- from datasets import concatenate_datasets
164
-
165
- # Load different subsets
166
- conv = load_dataset("...", data_files="conversations.jsonl")
167
- inst = load_dataset("...", data_files="instructions.jsonl")
168
-
169
- # Combine
170
- combined = concatenate_datasets([conv, inst])
171
- ```
172
-
173
- ---
174
-
175
- ## Training Examples
176
-
177
- ### 1. Instruction Fine-Tuning
178
-
179
- ```python
180
- from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
181
- from datasets import load_dataset
182
-
183
- # Load instruction data
184
- dataset = load_dataset(
185
- "your-username/helion-1.5",
186
- data_files="helion-1.5-instructions.jsonl"
187
- )
188
-
189
- # Initialize
190
- model_name = "meta-llama/Llama-2-7b-hf"
191
- tokenizer = AutoTokenizer.from_pretrained(model_name)
192
- tokenizer.pad_token = tokenizer.eos_token
193
- model = AutoModelForCausalLM.from_pretrained(model_name)
194
-
195
- # Format function
196
- def format_instruction(example):
197
- text = f"### Instruction:\n{example['instruction']}\n\n"
198
- if example.get('input'):
199
- text += f"### Input:\n{example['input']}\n\n"
200
- text += f"### Response:\n{example['output']}"
201
- return {"text": text}
202
-
203
- # Apply formatting
204
- dataset = dataset.map(format_instruction)
205
-
206
- # Tokenize
207
- def tokenize_function(examples):
208
- return tokenizer(
209
- examples["text"],
210
- padding="max_length",
211
- truncation=True,
212
- max_length=512
213
- )
214
-
215
- tokenized_dataset = dataset.map(tokenize_function, batched=True)
216
-
217
- # Training arguments
218
- training_args = TrainingArguments(
219
- output_dir="./instruction-model",
220
- num_train_epochs=3,
221
- per_device_train_batch_size=4,
222
- gradient_accumulation_steps=8,
223
- learning_rate=2e-5,
224
- warmup_steps=500,
225
- logging_steps=100,
226
- save_steps=1000,
227
- fp16=True,
228
- optim="adamw_torch",
229
- )
230
-
231
- # Train
232
- trainer = Trainer(
233
- model=model,
234
- args=training_args,
235
- train_dataset=tokenized_dataset,
236
- )
237
-
238
- trainer.train()
239
- model.save_pretrained("./instruction-model-final")
240
- ```
241
-
242
- ### 2. Conversational Model Training
243
-
244
- ```python
245
- from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
246
-
247
- # Load conversation data
248
- dataset = load_dataset(
249
- "your-username/helion-1.5",
250
- data_files="helion-1.5-conversations.jsonl"
251
- )
252
-
253
- # Format conversations
254
- def format_conversation(example):
255
- formatted = ""
256
- for turn in example['conversations']:
257
- role = turn['role'].capitalize()
258
- content = turn['content']
259
- formatted += f"{role}: {content}\n\n"
260
- return {"text": formatted.strip()}
261
-
262
- dataset = dataset.map(format_conversation)
263
-
264
- # Tokenize
265
- def tokenize(examples):
266
- return tokenizer(
267
- examples["text"],
268
- padding="max_length",
269
- truncation=True,
270
- max_length=2048 # Longer for conversations
271
- )
272
-
273
- tokenized = dataset.map(tokenize, batched=True)
274
-
275
- # Training setup
276
- training_args = TrainingArguments(
277
- output_dir="./conversation-model",
278
- num_train_epochs=3,
279
- per_device_train_batch_size=2,
280
- gradient_accumulation_steps=16,
281
- learning_rate=1e-5,
282
- warmup_ratio=0.1,
283
- logging_steps=50,
284
- save_strategy="epoch",
285
- fp16=True,
286
- )
287
-
288
- trainer = Trainer(
289
- model=model,
290
- args=training_args,
291
- train_dataset=tokenized,
292
- )
293
-
294
- trainer.train()
295
- ```
296
-
297
- ### 3. Code Generation Training
298
-
299
- ```python
300
- # Load code data
301
- code_data = load_dataset(
302
- "your-username/helion-1.5",
303
- data_files="helion-1.5-code.jsonl"
304
- )
305
-
306
- # Format code examples
307
- def format_code(example):
308
- text = f"# Problem: {example['problem']}\n\n"
309
- text += f"# Solution ({example['language']}):\n{example['solution']}\n\n"
310
- if example.get('explanation'):
311
- text += f"# Explanation: {example['explanation']}"
312
- return {"text": text}
313
-
314
- code_data = code_data.map(format_code)
315
-
316
- # Filter by language (optional)
317
- python_code = code_data.filter(
318
- lambda x: x['language'] == 'python'
319
- )
320
-
321
- # Training with code-specific settings
322
- training_args = TrainingArguments(
323
- output_dir="./code-model",
324
- num_train_epochs=5, # More epochs for code
325
- per_device_train_batch_size=4,
326
- learning_rate=3e-5,
327
- warmup_steps=1000,
328
- save_steps=2000,
329
- )
330
-
331
- # Train model
332
- trainer = Trainer(
333
- model=model,
334
- args=training_args,
335
- train_dataset=tokenized_code,
336
- )
337
-
338
- trainer.train()
339
- ```
340
-
341
- ### 4. LoRA Fine-Tuning (Memory Efficient)
342
-
343
- ```python
344
- from peft import LoraConfig, get_peft_model, TaskType
345
-
346
- # Load base model
347
- model = AutoModelForCausalLM.from_pretrained(
348
- model_name,
349
- load_in_8bit=True, # 8-bit quantization
350
- device_map="auto",
351
- )
352
-
353
- # LoRA configuration
354
- lora_config = LoraConfig(
355
- r=16, # LoRA rank
356
- lora_alpha=32,
357
- target_modules=["q_proj", "v_proj"],
358
- lora_dropout=0.05,
359
- bias="none",
360
- task_type=TaskType.CAUSAL_LM
361
- )
362
-
363
- # Add LoRA adapters
364
- model = get_peft_model(model, lora_config)
365
- model.print_trainable_parameters()
366
-
367
- # Training with LoRA
368
- training_args = TrainingArguments(
369
- output_dir="./lora-model",
370
- num_train_epochs=3,
371
- per_device_train_batch_size=8, # Can use larger batch
372
- gradient_accumulation_steps=4,
373
- learning_rate=3e-4, # Higher LR for LoRA
374
- fp16=True,
375
- logging_steps=100,
376
- )
377
-
378
- trainer = Trainer(
379
- model=model,
380
- args=training_args,
381
- train_dataset=tokenized_dataset,
382
- )
383
-
384
- trainer.train()
385
- ```
386
-
387
- ---
388
-
389
- ## Fine-Tuning Strategies
390
-
391
- ### Strategy 1: Domain-Specific Fine-Tuning
392
-
393
- ```python
394
- # Fine-tune on specific domain
395
- science_data = dataset.filter(
396
- lambda x: x['metadata']['domain'] == 'science'
397
- )
398
-
399
- # Train with domain focus
400
- trainer = Trainer(
401
- model=model,
402
- args=training_args,
403
- train_dataset=science_data,
404
- )
405
- ```
406
-
407
- ### Strategy 2: Progressive Fine-Tuning
408
-
409
- ```python
410
- # Stage 1: General knowledge
411
- general_data = dataset.filter(
412
- lambda x: x['metadata']['domain'] == 'general'
413
- )
414
- trainer.train(train_dataset=general_data)
415
-
416
- # Stage 2: Specialized knowledge
417
- specialized_data = dataset.filter(
418
- lambda x: x['metadata']['difficulty'] == 'advanced'
419
- )
420
- trainer.train(train_dataset=specialized_data)
421
- ```
422
-
423
- ### Strategy 3: Multi-Task Learning
424
-
425
- ```python
426
- # Mix different data types
427
- conv_weight = 0.4
428
- inst_weight = 0.3
429
- code_weight = 0.3
430
-
431
- # Sample proportionally
432
- from datasets import concatenate_datasets
433
-
434
- mixed_dataset = concatenate_datasets([
435
- conversations.shuffle().select(range(int(10000 * conv_weight))),
436
- instructions.shuffle().select(range(int(10000 * inst_weight))),
437
- code_data.shuffle().select(range(int(10000 * code_weight))),
438
- ])
439
- ```
440
-
441
- ### Strategy 4: Curriculum Learning
442
-
443
- ```python
444
- # Start with easy examples
445
- easy_data = dataset.filter(
446
- lambda x: x['metadata']['difficulty'] == 'easy'
447
- )
448
-
449
- # Progress to harder examples
450
- medium_data = dataset.filter(
451
- lambda x: x['metadata']['difficulty'] == 'intermediate'
452
- )
453
-
454
- hard_data = dataset.filter(
455
- lambda x: x['metadata']['difficulty'] == 'advanced'
456
- )
457
-
458
- # Train progressively
459
- for epoch, data in enumerate([easy_data, medium_data, hard_data]):
460
- trainer.train(train_dataset=data)
461
- ```
462
-
463
- ---
464
-
465
- ## Best Practices
466
-
467
- ### 1. Data Preparation
468
-
469
- ```python
470
- # Clean and validate data
471
- def validate_example(example):
472
- """Ensure data quality"""
473
- if 'metadata' not in example:
474
- return False
475
- if example['metadata'].get('quality_score', 0) < 0.8:
476
- return False
477
- return True
478
-
479
- cleaned_dataset = dataset.filter(validate_example)
480
- ```
481
-
482
- ### 2. Handling Long Sequences
483
-
484
- ```python
485
- # Dynamic padding for efficiency
486
- from transformers import DataCollatorWithPadding
487
-
488
- data_collator = DataCollatorWithPadding(
489
- tokenizer=tokenizer,
490
- padding=True,
491
- max_length=2048
492
- )
493
-
494
- trainer = Trainer(
495
- model=model,
496
- args=training_args,
497
- data_collator=data_collator,
498
- train_dataset=dataset,
499
- )
500
- ```
501
-
502
- ### 3. Monitoring Training
503
-
504
- ```python
505
- # Add callbacks
506
- from transformers import TrainerCallback
507
-
508
- class QualityMonitorCallback(TrainerCallback):
509
- def on_evaluate(self, args, state, control, metrics, **kwargs):
510
- print(f"Step {state.global_step}: Loss = {metrics.get('loss', 0):.4f}")
511
-
512
- training_args.evaluation_strategy = "steps"
513
- training_args.eval_steps = 500
514
-
515
- trainer = Trainer(
516
- model=model,
517
- args=training_args,
518
- callbacks=[QualityMonitorCallback()],
519
- )
520
- ```
521
-
522
- ### 4. Saving Checkpoints
523
-
524
- ```python
525
- training_args = TrainingArguments(
526
- output_dir="./checkpoints",
527
- save_strategy="steps",
528
- save_steps=1000,
529
- save_total_limit=3, # Keep only last 3 checkpoints
530
- load_best_model_at_end=True,
531
- )
532
- ```
533
-
534
- ### 5. Distributed Training
535
-
536
- ```bash
537
- # Launch with multiple GPUs
538
- accelerate launch --multi_gpu train.py
539
-
540
- # Or with DeepSpeed
541
- deepspeed --num_gpus=4 train.py --deepspeed ds_config.json
542
- ```
543
-
544
- ---
545
-
546
- ## Troubleshooting
547
-
548
- ### Out of Memory
549
-
550
- ```python
551
- # Solutions:
552
- # 1. Reduce batch size
553
- training_args.per_device_train_batch_size = 1
554
-
555
- # 2. Increase gradient accumulation
556
- training_args.gradient_accumulation_steps = 32
557
-
558
- # 3. Use gradient checkpointing
559
- model.gradient_checkpointing_enable()
560
-
561
- # 4. Use 8-bit training
562
- model = AutoModelForCausalLM.from_pretrained(
563
- model_name,
564
- load_in_8bit=True,
565
- device_map="auto"
566
- )
567
- ```
568
-
569
- ### Slow Training
570
-
571
- ```python
572
- # Solutions:
573
- # 1. Enable mixed precision
574
- training_args.fp16 = True
575
-
576
- # 2. Optimize data loading
577
- dataset.set_format("torch")
578
-
579
- # 3. Increase workers
580
- training_args.dataloader_num_workers = 4
581
-
582
- # 4. Pin memory
583
- training_args.dataloader_pin_memory = True
584
- ```
585
-
586
- ### Poor Model Performance
587
-
588
- ```python
589
- # Solutions:
590
- # 1. Increase training epochs
591
- training_args.num_train_epochs = 5
592
-
593
- # 2. Adjust learning rate
594
- training_args.learning_rate = 1e-5
595
-
596
- # 3. Add warmup
597
- training_args.warmup_ratio = 0.1
598
-
599
- # 4. Filter low-quality data
600
- high_quality = dataset.filter(
601
- lambda x: x['metadata'].get('quality_score', 0) > 0.9
602
- )
603
- ```
604
-
605
- ### Data Loading Issues
606
-
607
- ```python
608
- # Solutions:
609
- # 1. Check file format
610
- from datasets import load_dataset
611
- try:
612
- dataset = load_dataset("...", split="train")
613
- except Exception as e:
614
- print(f"Error: {e}")
615
-
616
- # 2. Manually load JSONL
617
- import json
618
- data = []
619
- with open("file.jsonl", "r") as f:
620
- for line in f:
621
- data.append(json.loads(line))
622
-
623
- # 3. Verify data structure
624
- print(dataset[0])
625
- ```
626
-
627
- ---
628
-
629
- ## Evaluation
630
-
631
- ### Evaluate on Benchmarks
632
-
633
- ```python
634
- from datasets import load_metric
635
-
636
- # Load metrics
637
- accuracy = load_metric("accuracy")
638
- bleu = load_metric("bleu")
639
-
640
- # Evaluate
641
- def compute_metrics(eval_pred):
642
- predictions, labels = eval_pred
643
- # Your metric computation
644
- return {"accuracy": accuracy.compute(predictions=predictions, references=labels)}
645
-
646
- trainer = Trainer(
647
- model=model,
648
- compute_metrics=compute_metrics,
649
- )
650
-
651
- results = trainer.evaluate()
652
- print(results)
653
- ```
654
-
655
- ### Generate Samples
656
-
657
- ```python
658
- # Generate text
659
- from transformers import pipeline
660
-
661
- generator = pipeline("text-generation", model="./trained-model")
662
-
663
- prompt = "Explain quantum computing in simple terms:"
664
- output = generator(prompt, max_length=200)
665
- print(output[0]['generated_text'])
666
- ```
667
-
668
- ---
669
-
670
- ## Advanced Topics
671
-
672
- ### Custom Data Mixing
673
-
674
- ```python
675
- def create_mixed_dataset(ratios):
676
- """Mix different datasets with specified ratios"""
677
- datasets_dict = {
678
- 'conversations': load_dataset(..., data_files="conversations.jsonl"),
679
- 'instructions': load_dataset(..., data_files="instructions.jsonl"),
680
- 'code': load_dataset(..., data_files="code.jsonl"),
681
- }
682
-
683
- mixed = []
684
- for name, ratio in ratios.items():
685
- size = int(10000 * ratio)
686
- mixed.append(datasets_dict[name].shuffle().select(range(size)))
687
-
688
- return concatenate_datasets(mixed)
689
-
690
- # Use it
691
- dataset = create_mixed_dataset({
692
- 'conversations': 0.4,
693
- 'instructions': 0.4,
694
- 'code': 0.2
695
- })
696
- ```
697
-
698
- ### Hyperparameter Tuning
699
-
700
- ```python
701
- from ray import tune
702
-
703
- def train_model(config):
704
- training_args = TrainingArguments(
705
- learning_rate=config["lr"],
706
- per_device_train_batch_size=config["batch_size"],
707
- num_train_epochs=3,
708
- )
709
- trainer = Trainer(model=model, args=training_args)
710
- trainer.train()
711
- return {"loss": trainer.state.log_history[-1]["loss"]}
712
-
713
- # Run hyperparameter search
714
- analysis = tune.run(
715
- train_model,
716
- config={
717
- "lr": tune.loguniform(1e-6, 1e-4),
718
- "batch_size": tune.choice([2, 4, 8]),
719
- }
720
- )
721
- ```
722
-
723
- ---
724
-
725
- ## Citation
726
-
727
- ```bibtex
728
- @dataset{helion_1_5_2024,
729
- title={Helion 1.5: An Enhanced Large-Scale Dataset for Language Model Training},
730
- author={DeepXR/Organization},
731
- year={2025},
732
- publisher={Hugging Face},
733
- }
734
- ```
735
-
736
- ---
737
-
738
- ## License
739
-
740
- This dataset is released under CC BY 4.0 License. See LICENSE file for details.