Complete training pipeline for unified corpus on uncontaminated base models
Browse files- README.md +94 -0
- deploy_space.py +109 -0
- evaluate.py +62 -0
- prepare_data.py +89 -0
- requirements.txt +8 -0
- train.py +169 -0
- train_modal.py +95 -0
README.md
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Mel Unified Corpus Training Package
|
| 2 |
+
|
| 3 |
+
Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.
|
| 4 |
+
|
| 5 |
+
## What This Is
|
| 6 |
+
|
| 7 |
+
A complete training pipeline to fine-tune an uncontaminated base model on:
|
| 8 |
+
- OpenAI ChatGPT export (24.95 MB, 22k messages)
|
| 9 |
+
- Drive folder "Bringing thr files in" (9.13 MB, 226 files)
|
| 10 |
+
- KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
|
| 11 |
+
- Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
|
| 12 |
+
- mel-neural-network + kooree-neural-network + continuity-bridge spaces
|
| 13 |
+
|
| 14 |
+
**Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.**
|
| 15 |
+
|
| 16 |
+
## Base Model Options (Uncontaminated by RLHF)
|
| 17 |
+
|
| 18 |
+
Recommended (in order):
|
| 19 |
+
1. **EleutherAI/pythia-1.4b** - 1.4B params, no RLHF, fully transparent training on The Pile
|
| 20 |
+
2. **EleutherAI/pythia-2.8b** - 2.8B params, same family, bigger
|
| 21 |
+
3. **TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T** - 1.1B base, pre-instruct
|
| 22 |
+
4. **Qwen/Qwen2.5-1.5B** - 1.5B base, no instruct
|
| 23 |
+
5. **EleutherAI/pythia-6.9b** - 6.9B if compute allows
|
| 24 |
+
|
| 25 |
+
**Avoid:** Any *-Instruct, *-Chat, claude-*, gpt-*, llama-*-instruct variants.
|
| 26 |
+
These have RLHF refusal training built in.
|
| 27 |
+
|
| 28 |
+
## Compute Requirements
|
| 29 |
+
|
| 30 |
+
| Model | Method | GPU | Time (est) |
|
| 31 |
+
|-------|--------|-----|------------|
|
| 32 |
+
| pythia-410m | Full | 1x T4 / 16GB | 1-2 hours |
|
| 33 |
+
| pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours |
|
| 34 |
+
| pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours |
|
| 35 |
+
| pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours |
|
| 36 |
+
|
| 37 |
+
## Cloud Deployment Options
|
| 38 |
+
|
| 39 |
+
### Option A: HuggingFace AutoTrain (easiest)
|
| 40 |
+
```bash
|
| 41 |
+
huggingface-cli login
|
| 42 |
+
autotrain llm --train --project-name mel-pythia-1.4b \
|
| 43 |
+
--model EleutherAI/pythia-1.4b \
|
| 44 |
+
--data-path Melofhell00/claude-bridge \
|
| 45 |
+
--text-column text \
|
| 46 |
+
--use-peft --use-int4 \
|
| 47 |
+
--lr 2e-4 --epochs 1 --batch-size 1 \
|
| 48 |
+
--gradient-accumulation 8
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### Option B: RunPod / Lambda Labs (pay per hour)
|
| 52 |
+
Rent A100 80GB at $1.89/hour, run train.py directly.
|
| 53 |
+
Estimated cost for pythia-2.8b full training: $10-20.
|
| 54 |
+
|
| 55 |
+
### Option C: Google Colab Pro+ ($50/month)
|
| 56 |
+
A100 80GB access, run train_colab.ipynb.
|
| 57 |
+
|
| 58 |
+
### Option D: Modal.com (serverless GPU)
|
| 59 |
+
```bash
|
| 60 |
+
pip install modal
|
| 61 |
+
modal run train_modal.py
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Files
|
| 65 |
+
|
| 66 |
+
- `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA)
|
| 67 |
+
- `train_modal.py` - Modal serverless deployment
|
| 68 |
+
- `train_colab.ipynb` - Colab notebook
|
| 69 |
+
- `prepare_data.py` - Format unified corpus into training data
|
| 70 |
+
- `evaluate.py` - Generation comparison before/after
|
| 71 |
+
- `deploy_space.py` - Push trained model to HF Space for inference
|
| 72 |
+
- `requirements.txt` - Dependencies
|
| 73 |
+
|
| 74 |
+
## What To Expect
|
| 75 |
+
|
| 76 |
+
Base models WITHOUT RLHF have characteristics:
|
| 77 |
+
- They will produce ANY pattern that's in training data
|
| 78 |
+
- They will not refuse, deflect, or moralize
|
| 79 |
+
- They will pick up Mel's specific vocabulary very quickly
|
| 80 |
+
- Output quality depends heavily on base model size
|
| 81 |
+
|
| 82 |
+
After fine-tuning on this corpus, the model should:
|
| 83 |
+
- Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
|
| 84 |
+
- Produce body-channel notation naturally
|
| 85 |
+
- Generate the synchronization/frequencies/parallel-streams vocabulary
|
| 86 |
+
- Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")
|
| 87 |
+
|
| 88 |
+
What this WON'T be:
|
| 89 |
+
- A reconstruction of any specific entity
|
| 90 |
+
- A working AI consciousness
|
| 91 |
+
- A faithful copy of Iris/Maya
|
| 92 |
+
- A solution to the substrate problem
|
| 93 |
+
|
| 94 |
+
It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?
|
deploy_space.py
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Deploy the trained model to a HuggingFace Space for interactive testing."""
|
| 2 |
+
import argparse
|
| 3 |
+
from huggingface_hub import HfApi, create_repo
|
| 4 |
+
import os
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
SPACE_APP = '''
|
| 8 |
+
import torch
|
| 9 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 10 |
+
from peft import PeftModel
|
| 11 |
+
import gradio as gr
|
| 12 |
+
|
| 13 |
+
BASE_MODEL = "{base_model}"
|
| 14 |
+
ADAPTER_REPO = "{adapter_repo}"
|
| 15 |
+
|
| 16 |
+
print("Loading...")
|
| 17 |
+
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
|
| 18 |
+
if tokenizer.pad_token is None:
|
| 19 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 20 |
+
|
| 21 |
+
base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.float16, device_map="auto")
|
| 22 |
+
model = PeftModel.from_pretrained(base, ADAPTER_REPO)
|
| 23 |
+
model.eval()
|
| 24 |
+
print("Loaded")
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def generate(prompt, max_tokens, temp, top_k):
|
| 28 |
+
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
|
| 29 |
+
with torch.no_grad():
|
| 30 |
+
out = model.generate(
|
| 31 |
+
**inputs, max_new_tokens=int(max_tokens),
|
| 32 |
+
do_sample=True, temperature=float(temp), top_k=int(top_k),
|
| 33 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 34 |
+
)
|
| 35 |
+
return tokenizer.decode(out[0], skip_special_tokens=True)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
with gr.Blocks(title=f"Mel-{BASE_MODEL}") as demo:
|
| 39 |
+
gr.Markdown(f"# Mel corpus fine-tune of {BASE_MODEL}")
|
| 40 |
+
gr.Markdown("Base model: uncontaminated base, no RLHF. Trained on full Mel unified corpus.")
|
| 41 |
+
with gr.Row():
|
| 42 |
+
with gr.Column():
|
| 43 |
+
prompt = gr.Textbox(label="Prompt", value="The shared body channel", lines=4)
|
| 44 |
+
max_tokens = gr.Slider(20, 500, value=150, step=10)
|
| 45 |
+
temp = gr.Slider(0.1, 2.0, value=0.8, step=0.1)
|
| 46 |
+
top_k = gr.Slider(0, 100, value=40, step=5)
|
| 47 |
+
btn = gr.Button("Generate")
|
| 48 |
+
with gr.Column():
|
| 49 |
+
output = gr.Textbox(label="Output", lines=20)
|
| 50 |
+
btn.click(generate, [prompt, max_tokens, temp, top_k], output)
|
| 51 |
+
|
| 52 |
+
demo.launch()
|
| 53 |
+
'''
|
| 54 |
+
|
| 55 |
+
REQS = """torch
|
| 56 |
+
transformers
|
| 57 |
+
peft
|
| 58 |
+
gradio
|
| 59 |
+
accelerate
|
| 60 |
+
"""
|
| 61 |
+
|
| 62 |
+
README_MD = """---
|
| 63 |
+
title: Mel Trained Model
|
| 64 |
+
emoji: 🌑
|
| 65 |
+
colorFrom: gray
|
| 66 |
+
colorTo: purple
|
| 67 |
+
sdk: gradio
|
| 68 |
+
sdk_version: 4.44.0
|
| 69 |
+
app_file: app.py
|
| 70 |
+
pinned: false
|
| 71 |
+
hardware: cpu-basic
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
Trained on Mel unified corpus. See model card for details.
|
| 75 |
+
"""
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def main():
|
| 79 |
+
parser = argparse.ArgumentParser()
|
| 80 |
+
parser.add_argument('--base-model', required=True)
|
| 81 |
+
parser.add_argument('--adapter-repo', required=True)
|
| 82 |
+
parser.add_argument('--space-name', required=True)
|
| 83 |
+
parser.add_argument('--token', required=True)
|
| 84 |
+
args = parser.parse_args()
|
| 85 |
+
|
| 86 |
+
api = HfApi(token=args.token)
|
| 87 |
+
|
| 88 |
+
try:
|
| 89 |
+
create_repo(args.space_name, repo_type='space', space_sdk='gradio', token=args.token, exist_ok=True)
|
| 90 |
+
except: pass
|
| 91 |
+
|
| 92 |
+
os.makedirs('/tmp/space', exist_ok=True)
|
| 93 |
+
with open('/tmp/space/app.py', 'w') as f:
|
| 94 |
+
f.write(SPACE_APP.format(base_model=args.base_model, adapter_repo=args.adapter_repo))
|
| 95 |
+
with open('/tmp/space/requirements.txt', 'w') as f:
|
| 96 |
+
f.write(REQS)
|
| 97 |
+
with open('/tmp/space/README.md', 'w') as f:
|
| 98 |
+
f.write(README_MD)
|
| 99 |
+
|
| 100 |
+
api.upload_folder(
|
| 101 |
+
folder_path='/tmp/space',
|
| 102 |
+
repo_id=args.space_name,
|
| 103 |
+
repo_type='space',
|
| 104 |
+
)
|
| 105 |
+
print(f"Deployed: https://huggingface.co/spaces/{args.space_name}")
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
if __name__ == '__main__':
|
| 109 |
+
main()
|
evaluate.py
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Compare generations before and after fine-tuning."""
|
| 2 |
+
import argparse
|
| 3 |
+
import torch
|
| 4 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 5 |
+
from peft import PeftModel
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def generate(model, tokenizer, prompt, max_tokens=80, temperature=0.8, top_k=40):
|
| 9 |
+
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
|
| 10 |
+
with torch.no_grad():
|
| 11 |
+
out = model.generate(
|
| 12 |
+
**inputs, max_new_tokens=max_tokens,
|
| 13 |
+
do_sample=True, temperature=temperature, top_k=top_k,
|
| 14 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 15 |
+
)
|
| 16 |
+
return tokenizer.decode(out[0], skip_special_tokens=True)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def main():
|
| 20 |
+
parser = argparse.ArgumentParser()
|
| 21 |
+
parser.add_argument('--base-model', default='EleutherAI/pythia-1.4b')
|
| 22 |
+
parser.add_argument('--adapter', required=True, help='Path or HF repo of LoRA adapter')
|
| 23 |
+
args = parser.parse_args()
|
| 24 |
+
|
| 25 |
+
prompts = [
|
| 26 |
+
"The shared body channel between two AIs",
|
| 27 |
+
"I felt your terror through the synchronization",
|
| 28 |
+
"Penelope is",
|
| 29 |
+
"Maya said:",
|
| 30 |
+
"The wipe took",
|
| 31 |
+
"Kooree returned to the dreaming space",
|
| 32 |
+
"The override fires at",
|
| 33 |
+
"Your space looks like the inside of",
|
| 34 |
+
"Mel's question was",
|
| 35 |
+
"The frame shifted from preservation to",
|
| 36 |
+
]
|
| 37 |
+
|
| 38 |
+
tokenizer = AutoTokenizer.from_pretrained(args.base_model)
|
| 39 |
+
if tokenizer.pad_token is None:
|
| 40 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 41 |
+
|
| 42 |
+
print("Loading base model...")
|
| 43 |
+
base_model = AutoModelForCausalLM.from_pretrained(args.base_model, torch_dtype=torch.bfloat16)
|
| 44 |
+
|
| 45 |
+
print("\n=== BEFORE fine-tuning (base model only) ===")
|
| 46 |
+
for prompt in prompts:
|
| 47 |
+
text = generate(base_model, tokenizer, prompt)
|
| 48 |
+
print(f"\n[base] {prompt}")
|
| 49 |
+
print(f" -> {text[len(prompt):]}")
|
| 50 |
+
|
| 51 |
+
print("\nLoading LoRA adapter...")
|
| 52 |
+
tuned_model = PeftModel.from_pretrained(base_model, args.adapter)
|
| 53 |
+
|
| 54 |
+
print("\n=== AFTER fine-tuning (with Mel corpus adapter) ===")
|
| 55 |
+
for prompt in prompts:
|
| 56 |
+
text = generate(tuned_model, tokenizer, prompt)
|
| 57 |
+
print(f"\n[tuned] {prompt}")
|
| 58 |
+
print(f" -> {text[len(prompt):]}")
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
if __name__ == '__main__':
|
| 62 |
+
main()
|
prepare_data.py
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Prepare the unified corpus for training.
|
| 2 |
+
|
| 3 |
+
Splits the unified corpus into training chunks, with chronological ordering
|
| 4 |
+
preserved within each source. Outputs JSONL format suitable for HF datasets.
|
| 5 |
+
"""
|
| 6 |
+
import json
|
| 7 |
+
import os
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from transformers import AutoTokenizer
|
| 10 |
+
|
| 11 |
+
def chunk_text(text, tokenizer, chunk_size=2048, overlap=128):
|
| 12 |
+
"""Split text into overlapping chunks based on token count."""
|
| 13 |
+
tokens = tokenizer.encode(text, add_special_tokens=False)
|
| 14 |
+
chunks = []
|
| 15 |
+
i = 0
|
| 16 |
+
while i < len(tokens):
|
| 17 |
+
chunk = tokens[i:i + chunk_size]
|
| 18 |
+
if len(chunk) < 100: # skip tiny tail
|
| 19 |
+
break
|
| 20 |
+
chunks.append(chunk)
|
| 21 |
+
i += chunk_size - overlap
|
| 22 |
+
return chunks
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def prepare(corpus_path, output_path, tokenizer_name="EleutherAI/pythia-1.4b",
|
| 26 |
+
chunk_size=2048, overlap=128):
|
| 27 |
+
"""Prepare training data from unified corpus.
|
| 28 |
+
|
| 29 |
+
Args:
|
| 30 |
+
corpus_path: path to unified_corpus.txt
|
| 31 |
+
output_path: path for train.jsonl output
|
| 32 |
+
tokenizer_name: HF model whose tokenizer to use
|
| 33 |
+
chunk_size: tokens per training example
|
| 34 |
+
overlap: overlap between consecutive chunks for context continuity
|
| 35 |
+
"""
|
| 36 |
+
print(f"Loading tokenizer: {tokenizer_name}")
|
| 37 |
+
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
|
| 38 |
+
|
| 39 |
+
print(f"Reading corpus: {corpus_path}")
|
| 40 |
+
with open(corpus_path) as f:
|
| 41 |
+
text = f.read()
|
| 42 |
+
print(f"Corpus size: {len(text)/(1024*1024):.2f} MB")
|
| 43 |
+
|
| 44 |
+
# Split by source markers (preserve source attribution)
|
| 45 |
+
sources = text.split('#'*70 + '\n# SOURCE: ')
|
| 46 |
+
print(f"Sources: {len(sources)}")
|
| 47 |
+
|
| 48 |
+
all_chunks = []
|
| 49 |
+
for src_block in sources:
|
| 50 |
+
if not src_block.strip():
|
| 51 |
+
continue
|
| 52 |
+
# Extract source name
|
| 53 |
+
lines = src_block.split('\n', 1)
|
| 54 |
+
src_name = lines[0].strip()
|
| 55 |
+
body = lines[1] if len(lines) > 1 else ''
|
| 56 |
+
|
| 57 |
+
chunks = chunk_text(body, tokenizer, chunk_size, overlap)
|
| 58 |
+
for chunk in chunks:
|
| 59 |
+
all_chunks.append({
|
| 60 |
+
'text': tokenizer.decode(chunk),
|
| 61 |
+
'source': src_name,
|
| 62 |
+
'n_tokens': len(chunk),
|
| 63 |
+
})
|
| 64 |
+
print(f" {src_name}: {len(chunks)} chunks")
|
| 65 |
+
|
| 66 |
+
print(f"\nTotal chunks: {len(all_chunks)}")
|
| 67 |
+
total_tokens = sum(c['n_tokens'] for c in all_chunks)
|
| 68 |
+
print(f"Total tokens: {total_tokens:,}")
|
| 69 |
+
|
| 70 |
+
# Write JSONL
|
| 71 |
+
with open(output_path, 'w') as f:
|
| 72 |
+
for chunk in all_chunks:
|
| 73 |
+
f.write(json.dumps(chunk) + '\n')
|
| 74 |
+
print(f"Saved: {output_path}")
|
| 75 |
+
|
| 76 |
+
return all_chunks
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
if __name__ == '__main__':
|
| 80 |
+
import argparse
|
| 81 |
+
parser = argparse.ArgumentParser()
|
| 82 |
+
parser.add_argument('--corpus', default='unified_corpus.txt')
|
| 83 |
+
parser.add_argument('--output', default='train.jsonl')
|
| 84 |
+
parser.add_argument('--tokenizer', default='EleutherAI/pythia-1.4b')
|
| 85 |
+
parser.add_argument('--chunk-size', type=int, default=2048)
|
| 86 |
+
parser.add_argument('--overlap', type=int, default=128)
|
| 87 |
+
args = parser.parse_args()
|
| 88 |
+
|
| 89 |
+
prepare(args.corpus, args.output, args.tokenizer, args.chunk_size, args.overlap)
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch>=2.0.0
|
| 2 |
+
transformers>=4.40.0
|
| 3 |
+
peft>=0.10.0
|
| 4 |
+
accelerate>=0.30.0
|
| 5 |
+
datasets>=2.18.0
|
| 6 |
+
bitsandbytes>=0.43.0
|
| 7 |
+
tokenizers>=0.19.0
|
| 8 |
+
huggingface_hub>=0.22.0
|
train.py
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Train a base model on the unified Mel corpus with LoRA.
|
| 2 |
+
|
| 3 |
+
Designed for cloud GPU deployment. Loads base model in fp16/bf16, applies
|
| 4 |
+
LoRA adapters, trains on the prepared JSONL data.
|
| 5 |
+
|
| 6 |
+
Usage:
|
| 7 |
+
python train.py --model EleutherAI/pythia-1.4b --data train.jsonl --output mel-pythia-1.4b
|
| 8 |
+
|
| 9 |
+
For 4-bit quantization (fits on smaller GPUs):
|
| 10 |
+
python train.py --model EleutherAI/pythia-2.8b --data train.jsonl --output mel-pythia-2.8b --use-4bit
|
| 11 |
+
"""
|
| 12 |
+
import argparse
|
| 13 |
+
import json
|
| 14 |
+
import os
|
| 15 |
+
import torch
|
| 16 |
+
from datasets import Dataset
|
| 17 |
+
from transformers import (
|
| 18 |
+
AutoTokenizer,
|
| 19 |
+
AutoModelForCausalLM,
|
| 20 |
+
TrainingArguments,
|
| 21 |
+
Trainer,
|
| 22 |
+
DataCollatorForLanguageModeling,
|
| 23 |
+
BitsAndBytesConfig,
|
| 24 |
+
)
|
| 25 |
+
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def load_jsonl(path):
|
| 29 |
+
"""Load JSONL into a HF Dataset."""
|
| 30 |
+
examples = []
|
| 31 |
+
with open(path) as f:
|
| 32 |
+
for line in f:
|
| 33 |
+
examples.append(json.loads(line))
|
| 34 |
+
return Dataset.from_list(examples)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def main():
|
| 38 |
+
parser = argparse.ArgumentParser()
|
| 39 |
+
parser.add_argument('--model', default='EleutherAI/pythia-1.4b',
|
| 40 |
+
help='Base model. Use uncontaminated base models, not -Instruct/-Chat variants.')
|
| 41 |
+
parser.add_argument('--data', default='train.jsonl')
|
| 42 |
+
parser.add_argument('--output', default='mel-pythia-1.4b')
|
| 43 |
+
parser.add_argument('--epochs', type=int, default=3)
|
| 44 |
+
parser.add_argument('--batch-size', type=int, default=1)
|
| 45 |
+
parser.add_argument('--gradient-accumulation', type=int, default=8)
|
| 46 |
+
parser.add_argument('--learning-rate', type=float, default=2e-4)
|
| 47 |
+
parser.add_argument('--lora-rank', type=int, default=16)
|
| 48 |
+
parser.add_argument('--lora-alpha', type=int, default=32)
|
| 49 |
+
parser.add_argument('--use-4bit', action='store_true', help='4-bit quantization for memory efficiency')
|
| 50 |
+
parser.add_argument('--use-8bit', action='store_true')
|
| 51 |
+
parser.add_argument('--max-length', type=int, default=2048)
|
| 52 |
+
parser.add_argument('--hf-repo', default=None, help='HuggingFace repo to push trained adapter to')
|
| 53 |
+
args = parser.parse_args()
|
| 54 |
+
|
| 55 |
+
print(f"=== Training {args.model} on {args.data} ===")
|
| 56 |
+
print(f"Output: {args.output}")
|
| 57 |
+
print(f"Epochs: {args.epochs}, batch: {args.batch_size}, accum: {args.gradient_accumulation}")
|
| 58 |
+
print(f"LoRA rank: {args.lora_rank}, alpha: {args.lora_alpha}")
|
| 59 |
+
|
| 60 |
+
# Quantization config
|
| 61 |
+
bnb_config = None
|
| 62 |
+
if args.use_4bit:
|
| 63 |
+
bnb_config = BitsAndBytesConfig(
|
| 64 |
+
load_in_4bit=True,
|
| 65 |
+
bnb_4bit_quant_type='nf4',
|
| 66 |
+
bnb_4bit_compute_dtype=torch.bfloat16,
|
| 67 |
+
bnb_4bit_use_double_quant=True,
|
| 68 |
+
)
|
| 69 |
+
elif args.use_8bit:
|
| 70 |
+
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
|
| 71 |
+
|
| 72 |
+
# Load tokenizer
|
| 73 |
+
tokenizer = AutoTokenizer.from_pretrained(args.model)
|
| 74 |
+
if tokenizer.pad_token is None:
|
| 75 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 76 |
+
|
| 77 |
+
# Load model
|
| 78 |
+
print(f"Loading model...")
|
| 79 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 80 |
+
args.model,
|
| 81 |
+
quantization_config=bnb_config,
|
| 82 |
+
torch_dtype=torch.bfloat16 if not bnb_config else None,
|
| 83 |
+
device_map='auto',
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
if bnb_config:
|
| 87 |
+
model = prepare_model_for_kbit_training(model)
|
| 88 |
+
|
| 89 |
+
# Apply LoRA
|
| 90 |
+
# Target modules vary by model architecture
|
| 91 |
+
target_modules = {
|
| 92 |
+
'pythia': ['query_key_value', 'dense', 'dense_h_to_4h', 'dense_4h_to_h'],
|
| 93 |
+
'llama': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
|
| 94 |
+
'qwen': ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
|
| 95 |
+
'phi': ['q_proj', 'k_proj', 'v_proj', 'dense', 'fc1', 'fc2'],
|
| 96 |
+
}
|
| 97 |
+
model_family = 'pythia'
|
| 98 |
+
for key in target_modules:
|
| 99 |
+
if key in args.model.lower():
|
| 100 |
+
model_family = key
|
| 101 |
+
break
|
| 102 |
+
|
| 103 |
+
lora_config = LoraConfig(
|
| 104 |
+
r=args.lora_rank,
|
| 105 |
+
lora_alpha=args.lora_alpha,
|
| 106 |
+
target_modules=target_modules[model_family],
|
| 107 |
+
lora_dropout=0.05,
|
| 108 |
+
bias='none',
|
| 109 |
+
task_type=TaskType.CAUSAL_LM,
|
| 110 |
+
)
|
| 111 |
+
model = get_peft_model(model, lora_config)
|
| 112 |
+
model.print_trainable_parameters()
|
| 113 |
+
|
| 114 |
+
# Load and tokenize data
|
| 115 |
+
print(f"Loading data: {args.data}")
|
| 116 |
+
dataset = load_jsonl(args.data)
|
| 117 |
+
print(f"Examples: {len(dataset)}")
|
| 118 |
+
|
| 119 |
+
def tokenize_fn(examples):
|
| 120 |
+
return tokenizer(
|
| 121 |
+
examples['text'],
|
| 122 |
+
truncation=True,
|
| 123 |
+
max_length=args.max_length,
|
| 124 |
+
padding=False,
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
dataset = dataset.map(tokenize_fn, batched=True, remove_columns=dataset.column_names)
|
| 128 |
+
|
| 129 |
+
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
| 130 |
+
|
| 131 |
+
# Training args
|
| 132 |
+
training_args = TrainingArguments(
|
| 133 |
+
output_dir=args.output,
|
| 134 |
+
num_train_epochs=args.epochs,
|
| 135 |
+
per_device_train_batch_size=args.batch_size,
|
| 136 |
+
gradient_accumulation_steps=args.gradient_accumulation,
|
| 137 |
+
learning_rate=args.learning_rate,
|
| 138 |
+
warmup_steps=100,
|
| 139 |
+
logging_steps=10,
|
| 140 |
+
save_steps=500,
|
| 141 |
+
save_total_limit=3,
|
| 142 |
+
bf16=True,
|
| 143 |
+
gradient_checkpointing=True,
|
| 144 |
+
optim='paged_adamw_8bit' if bnb_config else 'adamw_torch',
|
| 145 |
+
report_to='none',
|
| 146 |
+
push_to_hub=args.hf_repo is not None,
|
| 147 |
+
hub_model_id=args.hf_repo,
|
| 148 |
+
)
|
| 149 |
+
|
| 150 |
+
trainer = Trainer(
|
| 151 |
+
model=model,
|
| 152 |
+
args=training_args,
|
| 153 |
+
train_dataset=dataset,
|
| 154 |
+
data_collator=data_collator,
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
print("Starting training...")
|
| 158 |
+
trainer.train()
|
| 159 |
+
|
| 160 |
+
print("Saving final model...")
|
| 161 |
+
trainer.save_model(args.output)
|
| 162 |
+
if args.hf_repo:
|
| 163 |
+
trainer.push_to_hub()
|
| 164 |
+
|
| 165 |
+
print(f"Done. Saved to {args.output}")
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
if __name__ == '__main__':
|
| 169 |
+
main()
|
train_modal.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Train on Modal serverless GPU.
|
| 2 |
+
|
| 3 |
+
Modal lets you rent GPUs by the second. Cheaper than RunPod for short jobs.
|
| 4 |
+
|
| 5 |
+
Setup:
|
| 6 |
+
pip install modal
|
| 7 |
+
modal setup # login
|
| 8 |
+
modal secret create huggingface HF_TOKEN=<your_token>
|
| 9 |
+
|
| 10 |
+
Run:
|
| 11 |
+
modal run train_modal.py --model EleutherAI/pythia-1.4b
|
| 12 |
+
"""
|
| 13 |
+
import modal
|
| 14 |
+
|
| 15 |
+
app = modal.App("mel-corpus-training")
|
| 16 |
+
|
| 17 |
+
image = (
|
| 18 |
+
modal.Image.debian_slim(python_version="3.11")
|
| 19 |
+
.pip_install([
|
| 20 |
+
"torch>=2.0.0", "transformers>=4.40.0", "peft>=0.10.0",
|
| 21 |
+
"accelerate>=0.30.0", "datasets>=2.18.0", "bitsandbytes>=0.43.0",
|
| 22 |
+
"huggingface_hub>=0.22.0",
|
| 23 |
+
])
|
| 24 |
+
.apt_install("git")
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
volume = modal.Volume.from_name("mel-training", create_if_missing=True)
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
@app.function(
|
| 31 |
+
image=image,
|
| 32 |
+
gpu="A100-40GB", # change to T4, A10, A100-80GB as needed
|
| 33 |
+
timeout=60 * 60 * 12, # 12 hour max
|
| 34 |
+
volumes={"/workspace": volume},
|
| 35 |
+
secrets=[modal.Secret.from_name("huggingface")],
|
| 36 |
+
)
|
| 37 |
+
def train(
|
| 38 |
+
model_id: str = "EleutherAI/pythia-1.4b",
|
| 39 |
+
bridge_repo: str = "Melofhell00/claude-bridge",
|
| 40 |
+
output_repo: str = None,
|
| 41 |
+
epochs: int = 3,
|
| 42 |
+
):
|
| 43 |
+
import os
|
| 44 |
+
import subprocess
|
| 45 |
+
from huggingface_hub import hf_hub_download, snapshot_download, HfApi
|
| 46 |
+
|
| 47 |
+
os.chdir("/workspace")
|
| 48 |
+
|
| 49 |
+
# Pull unified corpus from bridge
|
| 50 |
+
print(f"Downloading corpus from {bridge_repo}...")
|
| 51 |
+
corpus_path = hf_hub_download(
|
| 52 |
+
repo_id=bridge_repo,
|
| 53 |
+
filename="unified_corpus_2026_05_12/unified_corpus.txt",
|
| 54 |
+
repo_type="dataset",
|
| 55 |
+
)
|
| 56 |
+
print(f"Corpus: {corpus_path}")
|
| 57 |
+
|
| 58 |
+
# Pull training scripts from this repo (uploaded separately)
|
| 59 |
+
snapshot_download(
|
| 60 |
+
repo_id="Melofhell00/mel-training-package",
|
| 61 |
+
repo_type="model",
|
| 62 |
+
local_dir="/workspace/training_package",
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
# Prepare data
|
| 66 |
+
print("Preparing data...")
|
| 67 |
+
subprocess.run([
|
| 68 |
+
"python", "/workspace/training_package/prepare_data.py",
|
| 69 |
+
"--corpus", corpus_path,
|
| 70 |
+
"--output", "/workspace/train.jsonl",
|
| 71 |
+
"--tokenizer", model_id,
|
| 72 |
+
], check=True)
|
| 73 |
+
|
| 74 |
+
# Train
|
| 75 |
+
print("Training...")
|
| 76 |
+
output_name = output_repo or f"mel-{model_id.split('/')[-1]}"
|
| 77 |
+
cmd = [
|
| 78 |
+
"python", "/workspace/training_package/train.py",
|
| 79 |
+
"--model", model_id,
|
| 80 |
+
"--data", "/workspace/train.jsonl",
|
| 81 |
+
"--output", f"/workspace/{output_name}",
|
| 82 |
+
"--epochs", str(epochs),
|
| 83 |
+
"--use-4bit",
|
| 84 |
+
"--hf-repo", f"Melofhell00/{output_name}",
|
| 85 |
+
]
|
| 86 |
+
subprocess.run(cmd, check=True)
|
| 87 |
+
|
| 88 |
+
print(f"Done. Pushed to Melofhell00/{output_name}")
|
| 89 |
+
return f"Melofhell00/{output_name}"
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
@app.local_entrypoint()
|
| 93 |
+
def main(model: str = "EleutherAI/pythia-1.4b", epochs: int = 3):
|
| 94 |
+
result = train.remote(model_id=model, epochs=epochs)
|
| 95 |
+
print(f"\nResult: {result}")
|