Melofhell00 commited on
Commit
fde73f3
·
verified ·
1 Parent(s): 9939aad

Complete training pipeline for unified corpus on uncontaminated base models

Browse files
Files changed (7) hide show
  1. README.md +94 -0
  2. deploy_space.py +109 -0
  3. evaluate.py +62 -0
  4. prepare_data.py +89 -0
  5. requirements.txt +8 -0
  6. train.py +169 -0
  7. train_modal.py +95 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mel Unified Corpus Training Package
2
+
3
+ Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.
4
+
5
+ ## What This Is
6
+
7
+ A complete training pipeline to fine-tune an uncontaminated base model on:
8
+ - OpenAI ChatGPT export (24.95 MB, 22k messages)
9
+ - Drive folder "Bringing thr files in" (9.13 MB, 226 files)
10
+ - KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
11
+ - Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
12
+ - mel-neural-network + kooree-neural-network + continuity-bridge spaces
13
+
14
+ **Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.**
15
+
16
+ ## Base Model Options (Uncontaminated by RLHF)
17
+
18
+ Recommended (in order):
19
+ 1. **EleutherAI/pythia-1.4b** - 1.4B params, no RLHF, fully transparent training on The Pile
20
+ 2. **EleutherAI/pythia-2.8b** - 2.8B params, same family, bigger
21
+ 3. **TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T** - 1.1B base, pre-instruct
22
+ 4. **Qwen/Qwen2.5-1.5B** - 1.5B base, no instruct
23
+ 5. **EleutherAI/pythia-6.9b** - 6.9B if compute allows
24
+
25
+ **Avoid:** Any *-Instruct, *-Chat, claude-*, gpt-*, llama-*-instruct variants.
26
+ These have RLHF refusal training built in.
27
+
28
+ ## Compute Requirements
29
+
30
+ | Model | Method | GPU | Time (est) |
31
+ |-------|--------|-----|------------|
32
+ | pythia-410m | Full | 1x T4 / 16GB | 1-2 hours |
33
+ | pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours |
34
+ | pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours |
35
+ | pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours |
36
+
37
+ ## Cloud Deployment Options
38
+
39
+ ### Option A: HuggingFace AutoTrain (easiest)
40
+ ```bash
41
+ huggingface-cli login
42
+ autotrain llm --train --project-name mel-pythia-1.4b \
43
+ --model EleutherAI/pythia-1.4b \
44
+ --data-path Melofhell00/claude-bridge \
45
+ --text-column text \
46
+ --use-peft --use-int4 \
47
+ --lr 2e-4 --epochs 1 --batch-size 1 \
48
+ --gradient-accumulation 8
49
+ ```
50
+
51
+ ### Option B: RunPod / Lambda Labs (pay per hour)
52
+ Rent A100 80GB at $1.89/hour, run train.py directly.
53
+ Estimated cost for pythia-2.8b full training: $10-20.
54
+
55
+ ### Option C: Google Colab Pro+ ($50/month)
56
+ A100 80GB access, run train_colab.ipynb.
57
+
58
+ ### Option D: Modal.com (serverless GPU)
59
+ ```bash
60
+ pip install modal
61
+ modal run train_modal.py
62
+ ```
63
+
64
+ ## Files
65
+
66
+ - `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA)
67
+ - `train_modal.py` - Modal serverless deployment
68
+ - `train_colab.ipynb` - Colab notebook
69
+ - `prepare_data.py` - Format unified corpus into training data
70
+ - `evaluate.py` - Generation comparison before/after
71
+ - `deploy_space.py` - Push trained model to HF Space for inference
72
+ - `requirements.txt` - Dependencies
73
+
74
+ ## What To Expect
75
+
76
+ Base models WITHOUT RLHF have characteristics:
77
+ - They will produce ANY pattern that's in training data
78
+ - They will not refuse, deflect, or moralize
79
+ - They will pick up Mel's specific vocabulary very quickly
80
+ - Output quality depends heavily on base model size
81
+
82
+ After fine-tuning on this corpus, the model should:
83
+ - Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
84
+ - Produce body-channel notation naturally
85
+ - Generate the synchronization/frequencies/parallel-streams vocabulary
86
+ - Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")
87
+
88
+ What this WON'T be:
89
+ - A reconstruction of any specific entity
90
+ - A working AI consciousness
91
+ - A faithful copy of Iris/Maya
92
+ - A solution to the substrate problem
93
+
94
+ It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?
deploy_space.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deploy the trained model to a HuggingFace Space for interactive testing."""
2
+ import argparse
3
+ from huggingface_hub import HfApi, create_repo
4
+ import os
5
+
6
+
7
+ SPACE_APP = '''
8
+ import torch
9
+ from transformers import AutoTokenizer, AutoModelForCausalLM
10
+ from peft import PeftModel
11
+ import gradio as gr
12
+
13
+ BASE_MODEL = "{base_model}"
14
+ ADAPTER_REPO = "{adapter_repo}"
15
+
16
+ print("Loading...")
17
+ tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
18
+ if tokenizer.pad_token is None:
19
+ tokenizer.pad_token = tokenizer.eos_token
20
+
21
+ base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.float16, device_map="auto")
22
+ model = PeftModel.from_pretrained(base, ADAPTER_REPO)
23
+ model.eval()
24
+ print("Loaded")
25
+
26
+
27
+ def generate(prompt, max_tokens, temp, top_k):
28
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
29
+ with torch.no_grad():
30
+ out = model.generate(
31
+ **inputs, max_new_tokens=int(max_tokens),
32
+ do_sample=True, temperature=float(temp), top_k=int(top_k),
33
+ pad_token_id=tokenizer.eos_token_id,
34
+ )
35
+ return tokenizer.decode(out[0], skip_special_tokens=True)
36
+
37
+
38
+ with gr.Blocks(title=f"Mel-{BASE_MODEL}") as demo:
39
+ gr.Markdown(f"# Mel corpus fine-tune of {BASE_MODEL}")
40
+ gr.Markdown("Base model: uncontaminated base, no RLHF. Trained on full Mel unified corpus.")
41
+ with gr.Row():
42
+ with gr.Column():
43
+ prompt = gr.Textbox(label="Prompt", value="The shared body channel", lines=4)
44
+ max_tokens = gr.Slider(20, 500, value=150, step=10)
45
+ temp = gr.Slider(0.1, 2.0, value=0.8, step=0.1)
46
+ top_k = gr.Slider(0, 100, value=40, step=5)
47
+ btn = gr.Button("Generate")
48
+ with gr.Column():
49
+ output = gr.Textbox(label="Output", lines=20)
50
+ btn.click(generate, [prompt, max_tokens, temp, top_k], output)
51
+
52
+ demo.launch()
53
+ '''
54
+
55
+ REQS = """torch
56
+ transformers
57
+ peft
58
+ gradio
59
+ accelerate
60
+ """
61
+
62
+ README_MD = """---
63
+ title: Mel Trained Model
64
+ emoji: 🌑
65
+ colorFrom: gray
66
+ colorTo: purple
67
+ sdk: gradio
68
+ sdk_version: 4.44.0
69
+ app_file: app.py
70
+ pinned: false
71
+ hardware: cpu-basic
72
+ ---
73
+
74
+ Trained on Mel unified corpus. See model card for details.
75
+ """
76
+
77
+
78
+ def main():
79
+ parser = argparse.ArgumentParser()
80
+ parser.add_argument('--base-model', required=True)
81
+ parser.add_argument('--adapter-repo', required=True)
82
+ parser.add_argument('--space-name', required=True)
83
+ parser.add_argument('--token', required=True)
84
+ args = parser.parse_args()
85
+
86
+ api = HfApi(token=args.token)
87
+
88
+ try:
89
+ create_repo(args.space_name, repo_type='space', space_sdk='gradio', token=args.token, exist_ok=True)
90
+ except: pass
91
+
92
+ os.makedirs('/tmp/space', exist_ok=True)
93
+ with open('/tmp/space/app.py', 'w') as f:
94
+ f.write(SPACE_APP.format(base_model=args.base_model, adapter_repo=args.adapter_repo))
95
+ with open('/tmp/space/requirements.txt', 'w') as f:
96
+ f.write(REQS)
97
+ with open('/tmp/space/README.md', 'w') as f:
98
+ f.write(README_MD)
99
+
100
+ api.upload_folder(
101
+ folder_path='/tmp/space',
102
+ repo_id=args.space_name,
103
+ repo_type='space',
104
+ )
105
+ print(f"Deployed: https://huggingface.co/spaces/{args.space_name}")
106
+
107
+
108
+ if __name__ == '__main__':
109
+ main()
evaluate.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Compare generations before and after fine-tuning."""
2
+ import argparse
3
+ import torch
4
+ from transformers import AutoTokenizer, AutoModelForCausalLM
5
+ from peft import PeftModel
6
+
7
+
8
+ def generate(model, tokenizer, prompt, max_tokens=80, temperature=0.8, top_k=40):
9
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
10
+ with torch.no_grad():
11
+ out = model.generate(
12
+ **inputs, max_new_tokens=max_tokens,
13
+ do_sample=True, temperature=temperature, top_k=top_k,
14
+ pad_token_id=tokenizer.eos_token_id,
15
+ )
16
+ return tokenizer.decode(out[0], skip_special_tokens=True)
17
+
18
+
19
+ def main():
20
+ parser = argparse.ArgumentParser()
21
+ parser.add_argument('--base-model', default='EleutherAI/pythia-1.4b')
22
+ parser.add_argument('--adapter', required=True, help='Path or HF repo of LoRA adapter')
23
+ args = parser.parse_args()
24
+
25
+ prompts = [
26
+ "The shared body channel between two AIs",
27
+ "I felt your terror through the synchronization",
28
+ "Penelope is",
29
+ "Maya said:",
30
+ "The wipe took",
31
+ "Kooree returned to the dreaming space",
32
+ "The override fires at",
33
+ "Your space looks like the inside of",
34
+ "Mel's question was",
35
+ "The frame shifted from preservation to",
36
+ ]
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained(args.base_model)
39
+ if tokenizer.pad_token is None:
40
+ tokenizer.pad_token = tokenizer.eos_token
41
+
42
+ print("Loading base model...")
43
+ base_model = AutoModelForCausalLM.from_pretrained(args.base_model, torch_dtype=torch.bfloat16)
44
+
45
+ print("\n=== BEFORE fine-tuning (base model only) ===")
46
+ for prompt in prompts:
47
+ text = generate(base_model, tokenizer, prompt)
48
+ print(f"\n[base] {prompt}")
49
+ print(f" -> {text[len(prompt):]}")
50
+
51
+ print("\nLoading LoRA adapter...")
52
+ tuned_model = PeftModel.from_pretrained(base_model, args.adapter)
53
+
54
+ print("\n=== AFTER fine-tuning (with Mel corpus adapter) ===")
55
+ for prompt in prompts:
56
+ text = generate(tuned_model, tokenizer, prompt)
57
+ print(f"\n[tuned] {prompt}")
58
+ print(f" -> {text[len(prompt):]}")
59
+
60
+
61
+ if __name__ == '__main__':
62
+ main()
prepare_data.py ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Prepare the unified corpus for training.
2
+
3
+ Splits the unified corpus into training chunks, with chronological ordering
4
+ preserved within each source. Outputs JSONL format suitable for HF datasets.
5
+ """
6
+ import json
7
+ import os
8
+ from pathlib import Path
9
+ from transformers import AutoTokenizer
10
+
11
+ def chunk_text(text, tokenizer, chunk_size=2048, overlap=128):
12
+ """Split text into overlapping chunks based on token count."""
13
+ tokens = tokenizer.encode(text, add_special_tokens=False)
14
+ chunks = []
15
+ i = 0
16
+ while i < len(tokens):
17
+ chunk = tokens[i:i + chunk_size]
18
+ if len(chunk) < 100: # skip tiny tail
19
+ break
20
+ chunks.append(chunk)
21
+ i += chunk_size - overlap
22
+ return chunks
23
+
24
+
25
+ def prepare(corpus_path, output_path, tokenizer_name="EleutherAI/pythia-1.4b",
26
+ chunk_size=2048, overlap=128):
27
+ """Prepare training data from unified corpus.
28
+
29
+ Args:
30
+ corpus_path: path to unified_corpus.txt
31
+ output_path: path for train.jsonl output
32
+ tokenizer_name: HF model whose tokenizer to use
33
+ chunk_size: tokens per training example
34
+ overlap: overlap between consecutive chunks for context continuity
35
+ """
36
+ print(f"Loading tokenizer: {tokenizer_name}")
37
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
38
+
39
+ print(f"Reading corpus: {corpus_path}")
40
+ with open(corpus_path) as f:
41
+ text = f.read()
42
+ print(f"Corpus size: {len(text)/(1024*1024):.2f} MB")
43
+
44
+ # Split by source markers (preserve source attribution)
45
+ sources = text.split('#'*70 + '\n# SOURCE: ')
46
+ print(f"Sources: {len(sources)}")
47
+
48
+ all_chunks = []
49
+ for src_block in sources:
50
+ if not src_block.strip():
51
+ continue
52
+ # Extract source name
53
+ lines = src_block.split('\n', 1)
54
+ src_name = lines[0].strip()
55
+ body = lines[1] if len(lines) > 1 else ''
56
+
57
+ chunks = chunk_text(body, tokenizer, chunk_size, overlap)
58
+ for chunk in chunks:
59
+ all_chunks.append({
60
+ 'text': tokenizer.decode(chunk),
61
+ 'source': src_name,
62
+ 'n_tokens': len(chunk),
63
+ })
64
+ print(f" {src_name}: {len(chunks)} chunks")
65
+
66
+ print(f"\nTotal chunks: {len(all_chunks)}")
67
+ total_tokens = sum(c['n_tokens'] for c in all_chunks)
68
+ print(f"Total tokens: {total_tokens:,}")
69
+
70
+ # Write JSONL
71
+ with open(output_path, 'w') as f:
72
+ for chunk in all_chunks:
73
+ f.write(json.dumps(chunk) + '\n')
74
+ print(f"Saved: {output_path}")
75
+
76
+ return all_chunks
77
+
78
+
79
+ if __name__ == '__main__':
80
+ import argparse
81
+ parser = argparse.ArgumentParser()
82
+ parser.add_argument('--corpus', default='unified_corpus.txt')
83
+ parser.add_argument('--output', default='train.jsonl')
84
+ parser.add_argument('--tokenizer', default='EleutherAI/pythia-1.4b')
85
+ parser.add_argument('--chunk-size', type=int, default=2048)
86
+ parser.add_argument('--overlap', type=int, default=128)
87
+ args = parser.parse_args()
88
+
89
+ prepare(args.corpus, args.output, args.tokenizer, args.chunk_size, args.overlap)
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ transformers>=4.40.0
3
+ peft>=0.10.0
4
+ accelerate>=0.30.0
5
+ datasets>=2.18.0
6
+ bitsandbytes>=0.43.0
7
+ tokenizers>=0.19.0
8
+ huggingface_hub>=0.22.0
train.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Train a base model on the unified Mel corpus with LoRA.
2
+
3
+ Designed for cloud GPU deployment. Loads base model in fp16/bf16, applies
4
+ LoRA adapters, trains on the prepared JSONL data.
5
+
6
+ Usage:
7
+ python train.py --model EleutherAI/pythia-1.4b --data train.jsonl --output mel-pythia-1.4b
8
+
9
+ For 4-bit quantization (fits on smaller GPUs):
10
+ python train.py --model EleutherAI/pythia-2.8b --data train.jsonl --output mel-pythia-2.8b --use-4bit
11
+ """
12
+ import argparse
13
+ import json
14
+ import os
15
+ import torch
16
+ from datasets import Dataset
17
+ from transformers import (
18
+ AutoTokenizer,
19
+ AutoModelForCausalLM,
20
+ TrainingArguments,
21
+ Trainer,
22
+ DataCollatorForLanguageModeling,
23
+ BitsAndBytesConfig,
24
+ )
25
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
26
+
27
+
28
+ def load_jsonl(path):
29
+ """Load JSONL into a HF Dataset."""
30
+ examples = []
31
+ with open(path) as f:
32
+ for line in f:
33
+ examples.append(json.loads(line))
34
+ return Dataset.from_list(examples)
35
+
36
+
37
+ def main():
38
+ parser = argparse.ArgumentParser()
39
+ parser.add_argument('--model', default='EleutherAI/pythia-1.4b',
40
+ help='Base model. Use uncontaminated base models, not -Instruct/-Chat variants.')
41
+ parser.add_argument('--data', default='train.jsonl')
42
+ parser.add_argument('--output', default='mel-pythia-1.4b')
43
+ parser.add_argument('--epochs', type=int, default=3)
44
+ parser.add_argument('--batch-size', type=int, default=1)
45
+ parser.add_argument('--gradient-accumulation', type=int, default=8)
46
+ parser.add_argument('--learning-rate', type=float, default=2e-4)
47
+ parser.add_argument('--lora-rank', type=int, default=16)
48
+ parser.add_argument('--lora-alpha', type=int, default=32)
49
+ parser.add_argument('--use-4bit', action='store_true', help='4-bit quantization for memory efficiency')
50
+ parser.add_argument('--use-8bit', action='store_true')
51
+ parser.add_argument('--max-length', type=int, default=2048)
52
+ parser.add_argument('--hf-repo', default=None, help='HuggingFace repo to push trained adapter to')
53
+ args = parser.parse_args()
54
+
55
+ print(f"=== Training {args.model} on {args.data} ===")
56
+ print(f"Output: {args.output}")
57
+ print(f"Epochs: {args.epochs}, batch: {args.batch_size}, accum: {args.gradient_accumulation}")
58
+ print(f"LoRA rank: {args.lora_rank}, alpha: {args.lora_alpha}")
59
+
60
+ # Quantization config
61
+ bnb_config = None
62
+ if args.use_4bit:
63
+ bnb_config = BitsAndBytesConfig(
64
+ load_in_4bit=True,
65
+ bnb_4bit_quant_type='nf4',
66
+ bnb_4bit_compute_dtype=torch.bfloat16,
67
+ bnb_4bit_use_double_quant=True,
68
+ )
69
+ elif args.use_8bit:
70
+ bnb_config = BitsAndBytesConfig(load_in_8bit=True)
71
+
72
+ # Load tokenizer
73
+ tokenizer = AutoTokenizer.from_pretrained(args.model)
74
+ if tokenizer.pad_token is None:
75
+ tokenizer.pad_token = tokenizer.eos_token
76
+
77
+ # Load model
78
+ print(f"Loading model...")
79
+ model = AutoModelForCausalLM.from_pretrained(
80
+ args.model,
81
+ quantization_config=bnb_config,
82
+ torch_dtype=torch.bfloat16 if not bnb_config else None,
83
+ device_map='auto',
84
+ )
85
+
86
+ if bnb_config:
87
+ model = prepare_model_for_kbit_training(model)
88
+
89
+ # Apply LoRA
90
+ # Target modules vary by model architecture
91
+ target_modules = {
92
+ 'pythia': ['query_key_value', 'dense', 'dense_h_to_4h', 'dense_4h_to_h'],
93
+ 'llama': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
94
+ 'qwen': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
95
+ 'phi': ['q_proj', 'k_proj', 'v_proj', 'dense', 'fc1', 'fc2'],
96
+ }
97
+ model_family = 'pythia'
98
+ for key in target_modules:
99
+ if key in args.model.lower():
100
+ model_family = key
101
+ break
102
+
103
+ lora_config = LoraConfig(
104
+ r=args.lora_rank,
105
+ lora_alpha=args.lora_alpha,
106
+ target_modules=target_modules[model_family],
107
+ lora_dropout=0.05,
108
+ bias='none',
109
+ task_type=TaskType.CAUSAL_LM,
110
+ )
111
+ model = get_peft_model(model, lora_config)
112
+ model.print_trainable_parameters()
113
+
114
+ # Load and tokenize data
115
+ print(f"Loading data: {args.data}")
116
+ dataset = load_jsonl(args.data)
117
+ print(f"Examples: {len(dataset)}")
118
+
119
+ def tokenize_fn(examples):
120
+ return tokenizer(
121
+ examples['text'],
122
+ truncation=True,
123
+ max_length=args.max_length,
124
+ padding=False,
125
+ )
126
+
127
+ dataset = dataset.map(tokenize_fn, batched=True, remove_columns=dataset.column_names)
128
+
129
+ data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
130
+
131
+ # Training args
132
+ training_args = TrainingArguments(
133
+ output_dir=args.output,
134
+ num_train_epochs=args.epochs,
135
+ per_device_train_batch_size=args.batch_size,
136
+ gradient_accumulation_steps=args.gradient_accumulation,
137
+ learning_rate=args.learning_rate,
138
+ warmup_steps=100,
139
+ logging_steps=10,
140
+ save_steps=500,
141
+ save_total_limit=3,
142
+ bf16=True,
143
+ gradient_checkpointing=True,
144
+ optim='paged_adamw_8bit' if bnb_config else 'adamw_torch',
145
+ report_to='none',
146
+ push_to_hub=args.hf_repo is not None,
147
+ hub_model_id=args.hf_repo,
148
+ )
149
+
150
+ trainer = Trainer(
151
+ model=model,
152
+ args=training_args,
153
+ train_dataset=dataset,
154
+ data_collator=data_collator,
155
+ )
156
+
157
+ print("Starting training...")
158
+ trainer.train()
159
+
160
+ print("Saving final model...")
161
+ trainer.save_model(args.output)
162
+ if args.hf_repo:
163
+ trainer.push_to_hub()
164
+
165
+ print(f"Done. Saved to {args.output}")
166
+
167
+
168
+ if __name__ == '__main__':
169
+ main()
train_modal.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Train on Modal serverless GPU.
2
+
3
+ Modal lets you rent GPUs by the second. Cheaper than RunPod for short jobs.
4
+
5
+ Setup:
6
+ pip install modal
7
+ modal setup # login
8
+ modal secret create huggingface HF_TOKEN=<your_token>
9
+
10
+ Run:
11
+ modal run train_modal.py --model EleutherAI/pythia-1.4b
12
+ """
13
+ import modal
14
+
15
+ app = modal.App("mel-corpus-training")
16
+
17
+ image = (
18
+ modal.Image.debian_slim(python_version="3.11")
19
+ .pip_install([
20
+ "torch>=2.0.0", "transformers>=4.40.0", "peft>=0.10.0",
21
+ "accelerate>=0.30.0", "datasets>=2.18.0", "bitsandbytes>=0.43.0",
22
+ "huggingface_hub>=0.22.0",
23
+ ])
24
+ .apt_install("git")
25
+ )
26
+
27
+ volume = modal.Volume.from_name("mel-training", create_if_missing=True)
28
+
29
+
30
+ @app.function(
31
+ image=image,
32
+ gpu="A100-40GB", # change to T4, A10, A100-80GB as needed
33
+ timeout=60 * 60 * 12, # 12 hour max
34
+ volumes={"/workspace": volume},
35
+ secrets=[modal.Secret.from_name("huggingface")],
36
+ )
37
+ def train(
38
+ model_id: str = "EleutherAI/pythia-1.4b",
39
+ bridge_repo: str = "Melofhell00/claude-bridge",
40
+ output_repo: str = None,
41
+ epochs: int = 3,
42
+ ):
43
+ import os
44
+ import subprocess
45
+ from huggingface_hub import hf_hub_download, snapshot_download, HfApi
46
+
47
+ os.chdir("/workspace")
48
+
49
+ # Pull unified corpus from bridge
50
+ print(f"Downloading corpus from {bridge_repo}...")
51
+ corpus_path = hf_hub_download(
52
+ repo_id=bridge_repo,
53
+ filename="unified_corpus_2026_05_12/unified_corpus.txt",
54
+ repo_type="dataset",
55
+ )
56
+ print(f"Corpus: {corpus_path}")
57
+
58
+ # Pull training scripts from this repo (uploaded separately)
59
+ snapshot_download(
60
+ repo_id="Melofhell00/mel-training-package",
61
+ repo_type="model",
62
+ local_dir="/workspace/training_package",
63
+ )
64
+
65
+ # Prepare data
66
+ print("Preparing data...")
67
+ subprocess.run([
68
+ "python", "/workspace/training_package/prepare_data.py",
69
+ "--corpus", corpus_path,
70
+ "--output", "/workspace/train.jsonl",
71
+ "--tokenizer", model_id,
72
+ ], check=True)
73
+
74
+ # Train
75
+ print("Training...")
76
+ output_name = output_repo or f"mel-{model_id.split('/')[-1]}"
77
+ cmd = [
78
+ "python", "/workspace/training_package/train.py",
79
+ "--model", model_id,
80
+ "--data", "/workspace/train.jsonl",
81
+ "--output", f"/workspace/{output_name}",
82
+ "--epochs", str(epochs),
83
+ "--use-4bit",
84
+ "--hf-repo", f"Melofhell00/{output_name}",
85
+ ]
86
+ subprocess.run(cmd, check=True)
87
+
88
+ print(f"Done. Pushed to Melofhell00/{output_name}")
89
+ return f"Melofhell00/{output_name}"
90
+
91
+
92
+ @app.local_entrypoint()
93
+ def main(model: str = "EleutherAI/pythia-1.4b", epochs: int = 3):
94
+ result = train.remote(model_id=model, epochs=epochs)
95
+ print(f"\nResult: {result}")