File size: 4,336 Bytes
329b91e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# ML Models Integration Guide

This document explains how to train and use the ML models for conflict prediction and package similarity.

## Overview

The project includes two ML models:

1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching

## Training the Models

### Step 1: Install Training Dependencies

```bash
pip install scikit-learn sentence-transformers numpy
```

### Step 2: Train Conflict Prediction Model

```bash
cd "code to upload"
python train_conflict_model.py
```

This will:
- Load the synthetic dataset (`synthetic_requirements_dataset.json`)
- Extract features from requirements
- Train a Random Forest classifier
- Save the model to `models/conflict_predictor.pkl`
- Display accuracy and feature importance

**Expected Output:**
- Model size: ~2-5 MB
- Test accuracy: ~85-95% (depending on dataset)

### Step 3: Generate Package Embeddings

```bash
python generate_embeddings.py
```

This will:
- Load a sentence transformer model
- Generate embeddings for common Python packages
- Save embeddings to `models/package_embeddings.json`
- Save model info to `models/embedding_info.json`

**Expected Output:**
- Embeddings file: ~5-10 MB
- Embedding dimension: 384
- Number of packages: ~100+

## Model Files Structure

After training, you should have:

```
code to upload/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ conflict_predictor.pkl      # Classification model
β”‚   β”œβ”€β”€ package_embeddings.json     # Pre-computed embeddings
β”‚   └── embedding_info.json         # Model metadata
```

## Integration in Main App

The models are automatically loaded when available:

1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings
2. **Package Similarity**: Enhances spell-checking with semantic matching

### Features

- **Graceful Fallback**: If models aren't available, the app works with rule-based methods
- **Lazy Loading**: Models load only when needed
- **Error Handling**: ML failures don't break the app

## Usage in Code

### Conflict Prediction

```python
from ml_models import ConflictPredictor

predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)

if has_conflict:
    print(f"Conflict predicted with {confidence:.1%} confidence")
```

### Package Similarity

```python
from ml_models import PackageEmbeddings

embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'
```

## Hugging Face Spaces Deployment

### Option 1: Include Models in Repo

1. Train models locally
2. Commit model files to the repo
3. Models load automatically on Spaces

**Pros**: Simple, no external dependencies  
**Cons**: Larger repo size (~10-15 MB)

### Option 2: Upload to Hugging Face Hub

1. Train models locally
2. Upload to Hugging Face Hub:
   ```python
   from huggingface_hub import upload_file
   upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
   ```
3. Load from Hub in app:
   ```python
   from huggingface_hub import hf_hub_download
   model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
   ```

**Pros**: Smaller repo, version control for models  
**Cons**: Requires internet connection at startup

## Performance

- **Conflict Prediction**: <10ms per prediction
- **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly)
- **Model Loading**: ~1-2 seconds at startup

## Troubleshooting

### Models Not Loading

- Check that `models/` directory exists
- Verify model files are present
- Check file permissions

### Low Prediction Accuracy

- Retrain with more data
- Adjust feature engineering
- Try different model parameters

### Embeddings Not Working

- Ensure `sentence-transformers` is installed
- Check internet connection (for first-time model download)
- Verify embeddings file format

## Future Improvements

- [ ] Train on larger, real-world dataset
- [ ] Add version-specific embeddings
- [ ] Implement online learning
- [ ] Add confidence intervals
- [ ] Support for custom model paths