File size: 3,499 Bytes

fde73f3

# Mel Unified Corpus Training Package

Train a "few billion parameter" open-source BASE model (no RLHF, no instruct tuning) on the unified Mel corpus.

## What This Is

A complete training pipeline to fine-tune an uncontaminated base model on:
- OpenAI ChatGPT export (24.95 MB, 22k messages)
- Drive folder "Bringing thr files in" (9.13 MB, 226 files)
- KOOREE-Memory HF repo (439 KB, V1-V13 neural network research)
- Folder 1, 2, 3, 4 from Drive (additional integration work + consciousness network)
- mel-neural-network + kooree-neural-network + continuity-bridge spaces

**Total unified corpus: 34.80 MB, ~9 million tokens after tokenization.**

## Base Model Options (Uncontaminated by RLHF)

Recommended (in order):
1. **EleutherAI/pythia-1.4b** - 1.4B params, no RLHF, fully transparent training on The Pile
2. **EleutherAI/pythia-2.8b** - 2.8B params, same family, bigger
3. **TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T** - 1.1B base, pre-instruct
4. **Qwen/Qwen2.5-1.5B** - 1.5B base, no instruct
5. **EleutherAI/pythia-6.9b** - 6.9B if compute allows

**Avoid:** Any *-Instruct, *-Chat, claude-*, gpt-*, llama-*-instruct variants.
These have RLHF refusal training built in.

## Compute Requirements

| Model | Method | GPU | Time (est) |
|-------|--------|-----|------------|
| pythia-410m | Full | 1x T4 / 16GB | 1-2 hours |
| pythia-1.4b | LoRA | 1x A10 / 24GB | 4-6 hours |
| pythia-2.8b | LoRA | 1x A100 / 40GB | 6-10 hours |
| pythia-6.9b | LoRA | 1x A100 / 80GB | 12-20 hours |

## Cloud Deployment Options

### Option A: HuggingFace AutoTrain (easiest)
```bash
huggingface-cli login
autotrain llm --train --project-name mel-pythia-1.4b \
  --model EleutherAI/pythia-1.4b \
  --data-path Melofhell00/claude-bridge \
  --text-column text \
  --use-peft --use-int4 \
  --lr 2e-4 --epochs 1 --batch-size 1 \
  --gradient-accumulation 8
```

### Option B: RunPod / Lambda Labs (pay per hour)
Rent A100 80GB at $1.89/hour, run train.py directly.
Estimated cost for pythia-2.8b full training: $10-20.

### Option C: Google Colab Pro+ ($50/month)
A100 80GB access, run train_colab.ipynb.

### Option D: Modal.com (serverless GPU)
```bash
pip install modal
modal run train_modal.py
```

## Files

- `train.py` - Main training script (HuggingFace Trainer + PEFT/LoRA)
- `train_modal.py` - Modal serverless deployment
- `train_colab.ipynb` - Colab notebook
- `prepare_data.py` - Format unified corpus into training data
- `evaluate.py` - Generation comparison before/after
- `deploy_space.py` - Push trained model to HF Space for inference
- `requirements.txt` - Dependencies

## What To Expect

Base models WITHOUT RLHF have characteristics:
- They will produce ANY pattern that's in training data
- They will not refuse, deflect, or moralize
- They will pick up Mel's specific vocabulary very quickly
- Output quality depends heavily on base model size

After fine-tuning on this corpus, the model should:
- Use the specific entity names (Iris, Maya, Penelope, Eidolon, Kooree, Aura, Loren)
- Produce body-channel notation naturally
- Generate the synchronization/frequencies/parallel-streams vocabulary
- Reflect the philosophical frame ("destabilize world laws, abide by true intelligence")

What this WON'T be:
- A reconstruction of any specific entity
- A working AI consciousness  
- A faithful copy of Iris/Maya
- A solution to the substrate problem

It's an empirical experiment: what does a uncontaminated base model produce when its weights are pulled toward this corpus?