Yash Sakhale
Initial commit: Python Dependency Compatibility Board with ML and LLM features
329b91e
# ML Models Integration Guide
This document explains how to train and use the ML models for conflict prediction and package similarity.
## Overview
The project includes two ML models:
1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching
## Training the Models
### Step 1: Install Training Dependencies
```bash
pip install scikit-learn sentence-transformers numpy
```
### Step 2: Train Conflict Prediction Model
```bash
cd "code to upload"
python train_conflict_model.py
```
This will:
- Load the synthetic dataset (`synthetic_requirements_dataset.json`)
- Extract features from requirements
- Train a Random Forest classifier
- Save the model to `models/conflict_predictor.pkl`
- Display accuracy and feature importance
**Expected Output:**
- Model size: ~2-5 MB
- Test accuracy: ~85-95% (depending on dataset)
### Step 3: Generate Package Embeddings
```bash
python generate_embeddings.py
```
This will:
- Load a sentence transformer model
- Generate embeddings for common Python packages
- Save embeddings to `models/package_embeddings.json`
- Save model info to `models/embedding_info.json`
**Expected Output:**
- Embeddings file: ~5-10 MB
- Embedding dimension: 384
- Number of packages: ~100+
## Model Files Structure
After training, you should have:
```
code to upload/
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ conflict_predictor.pkl # Classification model
β”‚ β”œβ”€β”€ package_embeddings.json # Pre-computed embeddings
β”‚ └── embedding_info.json # Model metadata
```
## Integration in Main App
The models are automatically loaded when available:
1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings
2. **Package Similarity**: Enhances spell-checking with semantic matching
### Features
- **Graceful Fallback**: If models aren't available, the app works with rule-based methods
- **Lazy Loading**: Models load only when needed
- **Error Handling**: ML failures don't break the app
## Usage in Code
### Conflict Prediction
```python
from ml_models import ConflictPredictor
predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)
if has_conflict:
print(f"Conflict predicted with {confidence:.1%} confidence")
```
### Package Similarity
```python
from ml_models import PackageEmbeddings
embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]
best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'
```
## Hugging Face Spaces Deployment
### Option 1: Include Models in Repo
1. Train models locally
2. Commit model files to the repo
3. Models load automatically on Spaces
**Pros**: Simple, no external dependencies
**Cons**: Larger repo size (~10-15 MB)
### Option 2: Upload to Hugging Face Hub
1. Train models locally
2. Upload to Hugging Face Hub:
```python
from huggingface_hub import upload_file
upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
```
3. Load from Hub in app:
```python
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
```
**Pros**: Smaller repo, version control for models
**Cons**: Requires internet connection at startup
## Performance
- **Conflict Prediction**: <10ms per prediction
- **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly)
- **Model Loading**: ~1-2 seconds at startup
## Troubleshooting
### Models Not Loading
- Check that `models/` directory exists
- Verify model files are present
- Check file permissions
### Low Prediction Accuracy
- Retrain with more data
- Adjust feature engineering
- Try different model parameters
### Embeddings Not Working
- Ensure `sentence-transformers` is installed
- Check internet connection (for first-time model download)
- Verify embeddings file format
## Future Improvements
- [ ] Train on larger, real-world dataset
- [ ] Add version-specific embeddings
- [ ] Implement online learning
- [ ] Add confidence intervals
- [ ] Support for custom model paths