# ML Models Integration Guide This document explains how to train and use the ML models for conflict prediction and package similarity. ## Overview The project includes two ML models: 1. **Conflict Prediction Model**: A Random Forest classifier that predicts whether a set of dependencies will have conflicts 2. **Package Embeddings**: Pre-computed semantic embeddings for common Python packages for similarity matching ## Training the Models ### Step 1: Install Training Dependencies ```bash pip install scikit-learn sentence-transformers numpy ``` ### Step 2: Train Conflict Prediction Model ```bash cd "code to upload" python train_conflict_model.py ``` This will: - Load the synthetic dataset (`synthetic_requirements_dataset.json`) - Extract features from requirements - Train a Random Forest classifier - Save the model to `models/conflict_predictor.pkl` - Display accuracy and feature importance **Expected Output:** - Model size: ~2-5 MB - Test accuracy: ~85-95% (depending on dataset) ### Step 3: Generate Package Embeddings ```bash python generate_embeddings.py ``` This will: - Load a sentence transformer model - Generate embeddings for common Python packages - Save embeddings to `models/package_embeddings.json` - Save model info to `models/embedding_info.json` **Expected Output:** - Embeddings file: ~5-10 MB - Embedding dimension: 384 - Number of packages: ~100+ ## Model Files Structure After training, you should have: ``` code to upload/ ├── models/ │ ├── conflict_predictor.pkl # Classification model │ ├── package_embeddings.json # Pre-computed embeddings │ └── embedding_info.json # Model metadata ``` ## Integration in Main App The models are automatically loaded when available: 1. **Conflict Prediction**: Runs before detailed analysis to provide early warnings 2. **Package Similarity**: Enhances spell-checking with semantic matching ### Features - **Graceful Fallback**: If models aren't available, the app works with rule-based methods - **Lazy Loading**: Models load only when needed - **Error Handling**: ML failures don't break the app ## Usage in Code ### Conflict Prediction ```python from ml_models import ConflictPredictor predictor = ConflictPredictor() has_conflict, confidence = predictor.predict(requirements_text) if has_conflict: print(f"Conflict predicted with {confidence:.1%} confidence") ``` ### Package Similarity ```python from ml_models import PackageEmbeddings embeddings = PackageEmbeddings() similar = embeddings.find_similar("numpyy", top_k=3) # Returns: [('numpy', 0.95), ('scipy', 0.72), ...] best_match = embeddings.get_best_match("pandaz") # Returns: 'pandas' ``` ## Hugging Face Spaces Deployment ### Option 1: Include Models in Repo 1. Train models locally 2. Commit model files to the repo 3. Models load automatically on Spaces **Pros**: Simple, no external dependencies **Cons**: Larger repo size (~10-15 MB) ### Option 2: Upload to Hugging Face Hub 1. Train models locally 2. Upload to Hugging Face Hub: ```python from huggingface_hub import upload_file upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor") ``` 3. Load from Hub in app: ```python from huggingface_hub import hf_hub_download model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl") ``` **Pros**: Smaller repo, version control for models **Cons**: Requires internet connection at startup ## Performance - **Conflict Prediction**: <10ms per prediction - **Embedding Lookup**: <1ms (pre-computed) or ~50ms (on-the-fly) - **Model Loading**: ~1-2 seconds at startup ## Troubleshooting ### Models Not Loading - Check that `models/` directory exists - Verify model files are present - Check file permissions ### Low Prediction Accuracy - Retrain with more data - Adjust feature engineering - Try different model parameters ### Embeddings Not Working - Ensure `sentence-transformers` is installed - Check internet connection (for first-time model download) - Verify embeddings file format ## Future Improvements - [ ] Train on larger, real-world dataset - [ ] Add version-specific embeddings - [ ] Implement online learning - [ ] Add confidence intervals - [ ] Support for custom model paths