# Website Approve/Reject Classifier - Mistral-7B Fine-Tuning

Fine-tuned Mistral-7B model for classifying websites as "Approved" or "Rejected" using MLX-LM on Apple Silicon.

## Dataset

- **Source**: Airtable database (292 records)
- **Training Examples**: 225 websites with scraped content
- **Validation Examples**: 25 websites
- **Format**: Mistral instruction format with `<s>[INST]...[/INST]...</s>`

## Files

### Data Pipeline
- `build_dataset.py` - Scrapes Airtable + websites, creates training dataset
- `prepare_mlx_dataset.py` - Splits data into train/valid for MLX-LM
- `mistral_training_dataset.jsonl` - Raw training data (250 examples)
- `data/train.jsonl` - Training set (225 examples)
- `data/valid.jsonl` - Validation set (25 examples)

### Model
- `download_mistral.py` - Downloads Mistral-7B-v0.1 from HuggingFace
- `mistral-7b-model/` - Downloaded model files (27GB)

### Fine-Tuning
- `finetune_mistral.py` - Python script for LoRA fine-tuning
- `finetune_mistral.sh` - Bash script for LoRA fine-tuning
- `adapters/` - LoRA adapter weights (created during training)

### Testing
- `test_finetuned_model.py` - Test the fine-tuned model

## Training Configuration

```bash
Model: mistralai/Mistral-7B-v0.1
Fine-tune method: LoRA
Trainable parameters: 0.145% (10.5M / 7.2B)
Batch size: 2
Iterations: 1000
Learning rate: 1e-5
LoRA layers: 16
```

## Usage

### 1. Build Dataset (if needed)
```bash
python3 build_dataset.py
python3 prepare_mlx_dataset.py
```

### 2. Download Model (if needed)
```bash
python3 download_mistral.py
```

### 3. Fine-Tune Model
```bash
python3 finetune_mistral.py
# OR
./finetune_mistral.sh
```

### 4. Test Model
```bash
python3 test_finetuned_model.py
```

### 5. Manual Inference
```bash
python3 -m mlx_lm.generate \
  --model mistralai/Mistral-7B-v0.1 \
  --adapter-path ./adapters \
  --prompt "<s>[INST] Analyze the following website text and classify it as 'Approved' or 'Rejected'. Website text: [YOUR TEXT HERE] [/INST]" \
  --max-tokens 10
```

## Requirements

```bash
pip3 install mlx mlx-lm requests beautifulsoup4 huggingface-hub
```

## Notes

- Training runs on Apple Silicon using MLX framework
- Some website texts are very long (up to 11K tokens) and get truncated to 2048 tokens
- Model checkpoints are saved every 100 iterations to `./adapters/`
- Initial validation loss: 1.826