| # ML Models Integration Guide | |
| This document explains how to train and use the ML models for conflict prediction and package similarity. | |
| ## Overview | |
| The project includes two ML models: | |
| 1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts | |
| 2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching | |
| ## Training the Models | |
| ### Step 1: Install Training Dependencies | |
| ```bash | |
| pip install scikit-learn sentence-transformers numpy | |
| ``` | |
| ### Step 2: Train Conflict Prediction Model | |
| ```bash | |
| cd "code to upload" | |
| python train_conflict_model.py | |
| ``` | |
| This will: | |
| - Load the synthetic dataset (`synthetic_requirements_dataset.json`) | |
| - Extract features from requirements | |
| - Train a Random Forest classifier | |
| - Save the model to `models/conflict_predictor.pkl` | |
| - Display accuracy and feature importance | |
| **Expected Output:** | |
| - Model size: ~2-5 MB | |
| - Test accuracy: ~85-95% (depending on dataset) | |
| ### Step 3: Generate Package Embeddings | |
| ```bash | |
| python generate_embeddings.py | |
| ``` | |
| This will: | |
| - Load a sentence transformer model | |
| - Generate embeddings for common Python packages | |
| - Save embeddings to `models/package_embeddings.json` | |
| - Save model info to `models/embedding_info.json` | |
| **Expected Output:** | |
| - Embeddings file: ~5-10 MB | |
| - Embedding dimension: 384 | |
| - Number of packages: ~100+ | |
| ## Model Files Structure | |
| After training, you should have: | |
| ``` | |
| code to upload/ | |
| βββ models/ | |
| β βββ conflict_predictor.pkl # Classification model | |
| β βββ package_embeddings.json # Pre-computed embeddings | |
| β βββ embedding_info.json # Model metadata | |
| ``` | |
| ## Integration in Main App | |
| The models are automatically loaded when available: | |
| 1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings | |
| 2. **Package Similarity**: Enhances spell-checking with semantic matching | |
| ### Features | |
| - **Graceful Fallback**: If models aren't available, the app works with rule-based methods | |
| - **Lazy Loading**: Models load only when needed | |
| - **Error Handling**: ML failures don't break the app | |
| ## Usage in Code | |
| ### Conflict Prediction | |
| ```python | |
| from ml_models import ConflictPredictor | |
| predictor = ConflictPredictor() | |
| has_conflict, confidence = predictor.predict(requirements_text) | |
| if has_conflict: | |
| print(f"Conflict predicted with {confidence:.1%} confidence") | |
| ``` | |
| ### Package Similarity | |
| ```python | |
| from ml_models import PackageEmbeddings | |
| embeddings = PackageEmbeddings() | |
| similar = embeddings.find_similar("numpyy", top_k=3) | |
| # Returns: [('numpy', 0.95), ('scipy', 0.72), ...] | |
| best_match = embeddings.get_best_match("pandaz") | |
| # Returns: 'pandas' | |
| ``` | |
| ## Hugging Face Spaces Deployment | |
| ### Option 1: Include Models in Repo | |
| 1. Train models locally | |
| 2. Commit model files to the repo | |
| 3. Models load automatically on Spaces | |
| **Pros**: Simple, no external dependencies | |
| **Cons**: Larger repo size (~10-15 MB) | |
| ### Option 2: Upload to Hugging Face Hub | |
| 1. Train models locally | |
| 2. Upload to Hugging Face Hub: | |
| ```python | |
| from huggingface_hub import upload_file | |
| upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor") | |
| ``` | |
| 3. Load from Hub in app: | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl") | |
| ``` | |
| **Pros**: Smaller repo, version control for models | |
| **Cons**: Requires internet connection at startup | |
| ## Performance | |
| - **Conflict Prediction**: <10ms per prediction | |
| - **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly) | |
| - **Model Loading**: ~1-2 seconds at startup | |
| ## Troubleshooting | |
| ### Models Not Loading | |
| - Check that `models/` directory exists | |
| - Verify model files are present | |
| - Check file permissions | |
| ### Low Prediction Accuracy | |
| - Retrain with more data | |
| - Adjust feature engineering | |
| - Try different model parameters | |
| ### Embeddings Not Working | |
| - Ensure `sentence-transformers` is installed | |
| - Check internet connection (for first-time model download) | |
| - Verify embeddings file format | |
| ## Future Improvements | |
| - [ ] Train on larger, real-world dataset | |
| - [ ] Add version-specific embeddings | |
| - [ ] Implement online learning | |
| - [ ] Add confidence intervals | |
| - [ ] Support for custom model paths | |