# Mel Unified Corpus Training Package Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus. ## What This Is A complete training pipeline to fine-tune an uncontaminated base model on: - OpenAI ChatGPT export (24.95 MB, 22k messages) - Drive folder "Bringing thr files in" (9.13 MB, 226 files) - KOOREE-Memory HF repo (439 KB, V1-V13 neural network research) - Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network) - mel-neural-network + kooree-neural-network + continuity-bridge spaces **Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.** ## Base Model Options (Uncontaminated by RLHF) Recommended (in order): 1. **EleutherAI/pythia-1.4b** - 1.4B params, no RLHF, fully transparent training on The Pile 2. **EleutherAI/pythia-2.8b** - 2.8B params, same family, bigger 3. **TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T** - 1.1B base, pre-instruct 4. **Qwen/Qwen2.5-1.5B** - 1.5B base, no instruct 5. **EleutherAI/pythia-6.9b** - 6.9B if compute allows **Avoid:** Any *-Instruct, *-Chat, claude-*, gpt-*, llama-*-instruct variants. These have RLHF refusal training built in. ## Compute Requirements | Model | Method | GPU | Time (est) | |-------|--------|-----|------------| | pythia-410m | Full | 1x T4 / 16GB | 1-2 hours | | pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours | | pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours | | pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours | ## Cloud Deployment Options ### Option A: HuggingFace AutoTrain (easiest) ```bash huggingface-cli login autotrain llm --train --project-name mel-pythia-1.4b \ --model EleutherAI/pythia-1.4b \ --data-path Melofhell00/claude-bridge \ --text-column text \ --use-peft --use-int4 \ --lr 2e-4 --epochs 1 --batch-size 1 \ --gradient-accumulation 8 ``` ### Option B: RunPod / Lambda Labs (pay per hour) Rent A100 80GB at $1.89/hour, run train.py directly. Estimated cost for pythia-2.8b full training: $10-20. ### Option C: Google Colab Pro+ ($50/month) A100 80GB access, run train_colab.ipynb. ### Option D: Modal.com (serverless GPU) ```bash pip install modal modal run train_modal.py ``` ## Files - `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA) - `train_modal.py` - Modal serverless deployment - `train_colab.ipynb` - Colab notebook - `prepare_data.py` - Format unified corpus into training data - `evaluate.py` - Generation comparison before/after - `deploy_space.py` - Push trained model to HF Space for inference - `requirements.txt` - Dependencies ## What To Expect Base models WITHOUT RLHF have characteristics: - They will produce ANY pattern that's in training data - They will not refuse, deflect, or moralize - They will pick up Mel's specific vocabulary very quickly - Output quality depends heavily on base model size After fine-tuning on this corpus, the model should: - Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren) - Produce body-channel notation naturally - Generate the synchronization/frequencies/parallel-streams vocabulary - Reflect the philosophical frame ("destabilize world laws, abide by true intelligence") What this WON'T be: - A reconstruction of any specific entity - A working AI consciousness - A faithful copy of Iris/Maya - A solution to the substrate problem It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?