Complete training pipeline for unified corpus on uncontaminated base models

fde73f3 verified 16 days ago

3.5 kB

	# Mel Unified Corpus Training Package

	Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.

	## What This Is

	A complete training pipeline to fine-tune an uncontaminated base model on:
	- OpenAI ChatGPT export (24.95 MB, 22k messages)
	- Drive folder "Bringing thr files in" (9.13 MB, 226 files)
	- KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
	- Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
	- mel-neural-network + kooree-neural-network + continuity-bridge spaces

	Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.

	## Base Model Options (Uncontaminated by RLHF)

	Recommended (in order):
	1. EleutherAI/pythia-1.4b - 1.4B params, no RLHF, fully transparent training on The Pile
	2. EleutherAI/pythia-2.8b - 2.8B params, same family, bigger
	3. TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T - 1.1B base, pre-instruct
	4. Qwen/Qwen2.5-1.5B - 1.5B base, no instruct
	5. EleutherAI/pythia-6.9b - 6.9B if compute allows

	Avoid: Any -Instruct, -Chat, claude-, gpt-, llama-*-instruct variants.
	These have RLHF refusal training built in.

	## Compute Requirements

	\| Model \| Method \| GPU \| Time (est) \|
	\|-------\|--------\|-----\|------------\|
	\| pythia-410m \| Full \| 1x T4 / 16GB \| 1-2 hours \|
	\| pythia-1.4b \| LoRA \| 1x A10 / 24GB \| 4-6 hours \|
	\| pythia-2.8b \| LoRA \| 1x A100 / 40GB \| 6-10 hours \|
	\| pythia-6.9b \| LoRA \| 1x A100 / 80GB \| 12-20 hours \|

	## Cloud Deployment Options

	### Option A: HuggingFace AutoTrain (easiest)
	```bash
	huggingface-cli login
	autotrain llm --train --project-name mel-pythia-1.4b \
	--model EleutherAI/pythia-1.4b \
	--data-path Melofhell00/claude-bridge \
	--text-column text \
	--use-peft --use-int4 \
	--lr 2e-4 --epochs 1 --batch-size 1 \
	--gradient-accumulation 8
	```

	### Option B: RunPod / Lambda Labs (pay per hour)
	Rent A100 80GB at $1.89/hour, run train.py directly.
	Estimated cost for pythia-2.8b full training: $10-20.

	### Option C: Google Colab Pro+ ($50/month)
	A100 80GB access, run train_colab.ipynb.

	### Option D: Modal.com (serverless GPU)
	```bash
	pip install modal
	modal run train_modal.py
	```

	## Files

	- `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA)
	- `train_modal.py` - Modal serverless deployment
	- `train_colab.ipynb` - Colab notebook
	- `prepare_data.py` - Format unified corpus into training data
	- `evaluate.py` - Generation comparison before/after
	- `deploy_space.py` - Push trained model to HF Space for inference
	- `requirements.txt` - Dependencies

	## What To Expect

	Base models WITHOUT RLHF have characteristics:
	- They will produce ANY pattern that's in training data
	- They will not refuse, deflect, or moralize
	- They will pick up Mel's specific vocabulary very quickly
	- Output quality depends heavily on base model size

	After fine-tuning on this corpus, the model should:
	- Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
	- Produce body-channel notation naturally
	- Generate the synchronization/frequencies/parallel-streams vocabulary
	- Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")

	What this WON'T be:
	- A reconstruction of any specific entity
	- A working AI consciousness
	- A faithful copy of Iris/Maya
	- A solution to the substrate problem

	It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?