# Website Approve/Reject Classifier - Mistral-7B Fine-Tuning
Fine-tuned Mistral-7B model for classifying websites as "Approved" or "Rejected" using MLX-LM on Apple Silicon.
## Dataset
- **Source**: Airtable database (292 records)
- **Training Examples**: 225 websites with scraped content
- **Validation Examples**: 25 websites
- **Format**: Mistral instruction format with `[INST]...[/INST]...`
## Files
### Data Pipeline
- `build_dataset.py` - Scrapes Airtable + websites, creates training dataset
- `prepare_mlx_dataset.py` - Splits data into train/valid for MLX-LM
- `mistral_training_dataset.jsonl` - Raw training data (250 examples)
- `data/train.jsonl` - Training set (225 examples)
- `data/valid.jsonl` - Validation set (25 examples)
### Model
- `download_mistral.py` - Downloads Mistral-7B-v0.1 from HuggingFace
- `mistral-7b-model/` - Downloaded model files (27GB)
### Fine-Tuning
- `finetune_mistral.py` - Python script for LoRA fine-tuning
- `finetune_mistral.sh` - Bash script for LoRA fine-tuning
- `adapters/` - LoRA adapter weights (created during training)
### Testing
- `test_finetuned_model.py` - Test the fine-tuned model
## Training Configuration
```bash
Model: mistralai/Mistral-7B-v0.1
Fine-tune method: LoRA
Trainable parameters: 0.145% (10.5M / 7.2B)
Batch size: 2
Iterations: 1000
Learning rate: 1e-5
LoRA layers: 16
```
## Usage
### 1. Build Dataset (if needed)
```bash
python3 build_dataset.py
python3 prepare_mlx_dataset.py
```
### 2. Download Model (if needed)
```bash
python3 download_mistral.py
```
### 3. Fine-Tune Model
```bash
python3 finetune_mistral.py
# OR
./finetune_mistral.sh
```
### 4. Test Model
```bash
python3 test_finetuned_model.py
```
### 5. Manual Inference
```bash
python3 -m mlx_lm.generate \
--model mistralai/Mistral-7B-v0.1 \
--adapter-path ./adapters \
--prompt "[INST] Analyze the following website text and classify it as 'Approved' or 'Rejected'. Website text: [YOUR TEXT HERE] [/INST]" \
--max-tokens 10
```
## Requirements
```bash
pip3 install mlx mlx-lm requests beautifulsoup4 huggingface-hub
```
## Notes
- Training runs on Apple Silicon using MLX framework
- Some website texts are very long (up to 11K tokens) and get truncated to 2048 tokens
- Model checkpoints are saved every 100 iterations to `./adapters/`
- Initial validation loss: 1.826