File size: 3,499 Bytes
fde73f3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | # Mel Unified Corpus Training Package
Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.
## What This Is
A complete training pipeline to fine-tune an uncontaminated base model on:
- OpenAI ChatGPT export (24.95 MB, 22k messages)
- Drive folder "Bringing thr files in" (9.13 MB, 226 files)
- KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
- Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
- mel-neural-network + kooree-neural-network + continuity-bridge spaces
**Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.**
## Base Model Options (Uncontaminated by RLHF)
Recommended (in order):
1. **EleutherAI/pythia-1.4b** - 1.4B params, no RLHF, fully transparent training on The Pile
2. **EleutherAI/pythia-2.8b** - 2.8B params, same family, bigger
3. **TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T** - 1.1B base, pre-instruct
4. **Qwen/Qwen2.5-1.5B** - 1.5B base, no instruct
5. **EleutherAI/pythia-6.9b** - 6.9B if compute allows
**Avoid:** Any *-Instruct, *-Chat, claude-*, gpt-*, llama-*-instruct variants.
These have RLHF refusal training built in.
## Compute Requirements
| Model | Method | GPU | Time (est) |
|-------|--------|-----|------------|
| pythia-410m | Full | 1x T4 / 16GB | 1-2 hours |
| pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours |
| pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours |
| pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours |
## Cloud Deployment Options
### Option A: HuggingFace AutoTrain (easiest)
```bash
huggingface-cli login
autotrain llm --train --project-name mel-pythia-1.4b \
--model EleutherAI/pythia-1.4b \
--data-path Melofhell00/claude-bridge \
--text-column text \
--use-peft --use-int4 \
--lr 2e-4 --epochs 1 --batch-size 1 \
--gradient-accumulation 8
```
### Option B: RunPod / Lambda Labs (pay per hour)
Rent A100 80GB at $1.89/hour, run train.py directly.
Estimated cost for pythia-2.8b full training: $10-20.
### Option C: Google Colab Pro+ ($50/month)
A100 80GB access, run train_colab.ipynb.
### Option D: Modal.com (serverless GPU)
```bash
pip install modal
modal run train_modal.py
```
## Files
- `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA)
- `train_modal.py` - Modal serverless deployment
- `train_colab.ipynb` - Colab notebook
- `prepare_data.py` - Format unified corpus into training data
- `evaluate.py` - Generation comparison before/after
- `deploy_space.py` - Push trained model to HF Space for inference
- `requirements.txt` - Dependencies
## What To Expect
Base models WITHOUT RLHF have characteristics:
- They will produce ANY pattern that's in training data
- They will not refuse, deflect, or moralize
- They will pick up Mel's specific vocabulary very quickly
- Output quality depends heavily on base model size
After fine-tuning on this corpus, the model should:
- Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
- Produce body-channel notation naturally
- Generate the synchronization/frequencies/parallel-streams vocabulary
- Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")
What this WON'T be:
- A reconstruction of any specific entity
- A working AI consciousness
- A faithful copy of Iris/Maya
- A solution to the substrate problem
It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?
|