# ML Models Integration Guide

This document explains how to train and use the ML models for conflict prediction and package similarity.

## Overview

The project includes two ML models:

1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching

## Training the Models

### Step 1: Install Training Dependencies

```bash
pip install scikit-learn sentence-transformers numpy
```

### Step 2: Train Conflict Prediction Model

```bash
cd "code to upload"
python train_conflict_model.py
```

This will:
- Load the synthetic dataset (`synthetic_requirements_dataset.json`)
- Extract features from requirements
- Train a Random Forest classifier
- Save the model to `models/conflict_predictor.pkl`
- Display accuracy and feature importance

**Expected Output:**
- Model size: ~2-5 MB
- Test accuracy: ~85-95% (depending on dataset)

### Step 3: Generate Package Embeddings

```bash
python generate_embeddings.py
```

This will:
- Load a sentence transformer model
- Generate embeddings for common Python packages
- Save embeddings to `models/package_embeddings.json`
- Save model info to `models/embedding_info.json`

**Expected Output:**
- Embeddings file: ~5-10 MB
- Embedding dimension: 384
- Number of packages: ~100+

## Model Files Structure

After training, you should have:

```
code to upload/
├── models/
│   ├── conflict_predictor.pkl      # Classification model
│   ├── package_embeddings.json     # Pre-computed embeddings
│   └── embedding_info.json         # Model metadata
```

## Integration in Main App

The models are automatically loaded when available:

1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings
2. **Package Similarity**: Enhances spell-checking with semantic matching

### Features

- **Graceful Fallback**: If models aren't available, the app works with rule-based methods
- **Lazy Loading**: Models load only when needed
- **Error Handling**: ML failures don't break the app

## Usage in Code

### Conflict Prediction

```python
from ml_models import ConflictPredictor

predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)

if has_conflict:
    print(f"Conflict predicted with {confidence:.1%} confidence")
```

### Package Similarity

```python
from ml_models import PackageEmbeddings

embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'
```

## Hugging Face Spaces Deployment

### Option 1: Include Models in Repo

1. Train models locally
2. Commit model files to the repo
3. Models load automatically on Spaces

**Pros**: Simple, no external dependencies  
**Cons**: Larger repo size (~10-15 MB)

### Option 2: Upload to Hugging Face Hub

1. Train models locally
2. Upload to Hugging Face Hub:
   ```python
   from huggingface_hub import upload_file
   upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
   ```
3. Load from Hub in app:
   ```python
   from huggingface_hub import hf_hub_download
   model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
   ```

**Pros**: Smaller repo, version control for models  
**Cons**: Requires internet connection at startup

## Performance

- **Conflict Prediction**: <10ms per prediction
- **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly)
- **Model Loading**: ~1-2 seconds at startup

## Troubleshooting

### Models Not Loading

- Check that `models/` directory exists
- Verify model files are present
- Check file permissions

### Low Prediction Accuracy

- Retrain with more data
- Adjust feature engineering
- Try different model parameters

### Embeddings Not Working

- Ensure `sentence-transformers` is installed
- Check internet connection (for first-time model download)
- Verify embeddings file format

## Future Improvements

- [ ] Train on larger, real-world dataset
- [ ] Add version-specific embeddings
- [ ] Implement online learning
- [ ] Add confidence intervals
- [ ] Support for custom model paths