File size: 4,336 Bytes
329b91e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# ML Models Integration Guide
This document explains how to train and use the ML models for conflict prediction and package similarity.
## Overview
The project includes two ML models:
1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching
## Training the Models
### Step 1: Install Training Dependencies
```bash
pip install scikit-learn sentence-transformers numpy
```
### Step 2: Train Conflict Prediction Model
```bash
cd "code to upload"
python train_conflict_model.py
```
This will:
- Load the synthetic dataset (`synthetic_requirements_dataset.json`)
- Extract features from requirements
- Train a Random Forest classifier
- Save the model to `models/conflict_predictor.pkl`
- Display accuracy and feature importance
**Expected Output:**
- Model size: ~2-5 MB
- Test accuracy: ~85-95% (depending on dataset)
### Step 3: Generate Package Embeddings
```bash
python generate_embeddings.py
```
This will:
- Load a sentence transformer model
- Generate embeddings for common Python packages
- Save embeddings to `models/package_embeddings.json`
- Save model info to `models/embedding_info.json`
**Expected Output:**
- Embeddings file: ~5-10 MB
- Embedding dimension: 384
- Number of packages: ~100+
## Model Files Structure
After training, you should have:
```
code to upload/
βββ models/
β βββ conflict_predictor.pkl # Classification model
β βββ package_embeddings.json # Pre-computed embeddings
β βββ embedding_info.json # Model metadata
```
## Integration in Main App
The models are automatically loaded when available:
1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings
2. **Package Similarity**: Enhances spell-checking with semantic matching
### Features
- **Graceful Fallback**: If models aren't available, the app works with rule-based methods
- **Lazy Loading**: Models load only when needed
- **Error Handling**: ML failures don't break the app
## Usage in Code
### Conflict Prediction
```python
from ml_models import ConflictPredictor
predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)
if has_conflict:
print(f"Conflict predicted with {confidence:.1%} confidence")
```
### Package Similarity
```python
from ml_models import PackageEmbeddings
embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]
best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'
```
## Hugging Face Spaces Deployment
### Option 1: Include Models in Repo
1. Train models locally
2. Commit model files to the repo
3. Models load automatically on Spaces
**Pros**: Simple, no external dependencies
**Cons**: Larger repo size (~10-15 MB)
### Option 2: Upload to Hugging Face Hub
1. Train models locally
2. Upload to Hugging Face Hub:
```python
from huggingface_hub import upload_file
upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
```
3. Load from Hub in app:
```python
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
```
**Pros**: Smaller repo, version control for models
**Cons**: Requires internet connection at startup
## Performance
- **Conflict Prediction**: <10ms per prediction
- **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly)
- **Model Loading**: ~1-2 seconds at startup
## Troubleshooting
### Models Not Loading
- Check that `models/` directory exists
- Verify model files are present
- Check file permissions
### Low Prediction Accuracy
- Retrain with more data
- Adjust feature engineering
- Try different model parameters
### Embeddings Not Working
- Ensure `sentence-transformers` is installed
- Check internet connection (for first-time model download)
- Verify embeddings file format
## Future Improvements
- [ ] Train on larger, real-world dataset
- [ ] Add version-specific embeddings
- [ ] Implement online learning
- [ ] Add confidence intervals
- [ ] Support for custom model paths
|