# Website Approve/Reject Classifier - Mistral-7B Fine-Tuning Fine-tuned Mistral-7B model for classifying websites as "Approved" or "Rejected" using MLX-LM on Apple Silicon. ## Dataset - **Source**: Airtable database (292 records) - **Training Examples**: 225 websites with scraped content - **Validation Examples**: 25 websites - **Format**: Mistral instruction format with `[INST]...[/INST]...` ## Files ### Data Pipeline - `build_dataset.py` - Scrapes Airtable + websites, creates training dataset - `prepare_mlx_dataset.py` - Splits data into train/valid for MLX-LM - `mistral_training_dataset.jsonl` - Raw training data (250 examples) - `data/train.jsonl` - Training set (225 examples) - `data/valid.jsonl` - Validation set (25 examples) ### Model - `download_mistral.py` - Downloads Mistral-7B-v0.1 from HuggingFace - `mistral-7b-model/` - Downloaded model files (27GB) ### Fine-Tuning - `finetune_mistral.py` - Python script for LoRA fine-tuning - `finetune_mistral.sh` - Bash script for LoRA fine-tuning - `adapters/` - LoRA adapter weights (created during training) ### Testing - `test_finetuned_model.py` - Test the fine-tuned model ## Training Configuration ```bash Model: mistralai/Mistral-7B-v0.1 Fine-tune method: LoRA Trainable parameters: 0.145% (10.5M / 7.2B) Batch size: 2 Iterations: 1000 Learning rate: 1e-5 LoRA layers: 16 ``` ## Usage ### 1. Build Dataset (if needed) ```bash python3 build_dataset.py python3 prepare_mlx_dataset.py ``` ### 2. Download Model (if needed) ```bash python3 download_mistral.py ``` ### 3. Fine-Tune Model ```bash python3 finetune_mistral.py # OR ./finetune_mistral.sh ``` ### 4. Test Model ```bash python3 test_finetuned_model.py ``` ### 5. Manual Inference ```bash python3 -m mlx_lm.generate \ --model mistralai/Mistral-7B-v0.1 \ --adapter-path ./adapters \ --prompt "[INST] Analyze the following website text and classify it as 'Approved' or 'Rejected'. Website text: [YOUR TEXT HERE] [/INST]" \ --max-tokens 10 ``` ## Requirements ```bash pip3 install mlx mlx-lm requests beautifulsoup4 huggingface-hub ``` ## Notes - Training runs on Apple Silicon using MLX framework - Some website texts are very long (up to 11K tokens) and get truncated to 2048 tokens - Model checkpoints are saved every 100 iterations to `./adapters/` - Initial validation loss: 1.826