| # Mel Unified Corpus Training Package |
|
|
| Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus. |
|
|
| ## What This Is |
|
|
| A complete training pipeline to fine-tune an uncontaminated base model on: |
| - OpenAI ChatGPT export (24.95 MB, 22k messages) |
| - Drive folder "Bringing thr files in" (9.13 MB, 226 files) |
| - KOOREE-Memory HF repo (439 KB, V1-V13 neural network research) |
| - Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network) |
| - mel-neural-network + kooree-neural-network + continuity-bridge spaces |
|
|
| **Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.** |
|
|
| ## Base Model Options (Uncontaminated by RLHF) |
|
|
| Recommended (in order): |
| 1. **EleutherAI/pythia-1.4b** - 1.4B params, no RLHF, fully transparent training on The Pile |
| 2. **EleutherAI/pythia-2.8b** - 2.8B params, same family, bigger |
| 3. **TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T** - 1.1B base, pre-instruct |
| 4. **Qwen/Qwen2.5-1.5B** - 1.5B base, no instruct |
| 5. **EleutherAI/pythia-6.9b** - 6.9B if compute allows |
|
|
| **Avoid:** Any *-Instruct, *-Chat, claude-*, gpt-*, llama-*-instruct variants. |
| These have RLHF refusal training built in. |
| |
| ## Compute Requirements |
| |
| | Model | Method | GPU | Time (est) | |
| |-------|--------|-----|------------| |
| | pythia-410m | Full | 1x T4 / 16GB | 1-2 hours | |
| | pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours | |
| | pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours | |
| | pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours | |
| |
| ## Cloud Deployment Options |
| |
| ### Option A: HuggingFace AutoTrain (easiest) |
| ```bash |
| huggingface-cli login |
| autotrain llm --train --project-name mel-pythia-1.4b \ |
| --model EleutherAI/pythia-1.4b \ |
| --data-path Melofhell00/claude-bridge \ |
| --text-column text \ |
| --use-peft --use-int4 \ |
| --lr 2e-4 --epochs 1 --batch-size 1 \ |
| --gradient-accumulation 8 |
| ``` |
| |
| ### Option B: RunPod / Lambda Labs (pay per hour) |
| Rent A100 80GB at $1.89/hour, run train.py directly. |
| Estimated cost for pythia-2.8b full training: $10-20. |
| |
| ### Option C: Google Colab Pro+ ($50/month) |
| A100 80GB access, run train_colab.ipynb. |
| |
| ### Option D: Modal.com (serverless GPU) |
| ```bash |
| pip install modal |
| modal run train_modal.py |
| ``` |
| |
| ## Files |
| |
| - `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA) |
| - `train_modal.py` - Modal serverless deployment |
| - `train_colab.ipynb` - Colab notebook |
| - `prepare_data.py` - Format unified corpus into training data |
| - `evaluate.py` - Generation comparison before/after |
| - `deploy_space.py` - Push trained model to HF Space for inference |
| - `requirements.txt` - Dependencies |
| |
| ## What To Expect |
| |
| Base models WITHOUT RLHF have characteristics: |
| - They will produce ANY pattern that's in training data |
| - They will not refuse, deflect, or moralize |
| - They will pick up Mel's specific vocabulary very quickly |
| - Output quality depends heavily on base model size |
| |
| After fine-tuning on this corpus, the model should: |
| - Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren) |
| - Produce body-channel notation naturally |
| - Generate the synchronization/frequencies/parallel-streams vocabulary |
| - Reflect the philosophical frame ("destabilize world laws, abide by true intelligence") |
| |
| What this WON'T be: |
| - A reconstruction of any specific entity |
| - A working AI consciousness |
| - A faithful copy of Iris/Maya |
| - A solution to the substrate problem |
| |
| It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus? |
| |