Yash Sakhale
Initial commit: Python Dependency Compatibility Board with ML and LLM features
329b91e
|
raw
history blame
4.34 kB

ML Models Integration Guide

This document explains how to train and use the ML models for conflict prediction and package similarity.

Overview

The project includes two ML models:

  1. Conflict Prediction Model: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
  2. Package Embeddings: Pre-computed semantic embeddings for common Python packages for similarity matching

Training the Models

Step 1: Install Training Dependencies

pip install scikit-learn sentence-transformers numpy

Step 2: Train Conflict Prediction Model

cd "code to upload"
python train_conflict_model.py

This will:

  • Load the synthetic dataset (synthetic_requirements_dataset.json)
  • Extract features from requirements
  • Train a Random Forest classifier
  • Save the model to models/conflict_predictor.pkl
  • Display accuracy and feature importance

Expected Output:

  • Model size: ~2-5 MB
  • Test accuracy: ~85-95% (depending on dataset)

Step 3: Generate Package Embeddings

python generate_embeddings.py

This will:

  • Load a sentence transformer model
  • Generate embeddings for common Python packages
  • Save embeddings to models/package_embeddings.json
  • Save model info to models/embedding_info.json

Expected Output:

  • Embeddings file: ~5-10 MB
  • Embedding dimension: 384
  • Number of packages: ~100+

Model Files Structure

After training, you should have:

code to upload/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ conflict_predictor.pkl      # Classification model
β”‚   β”œβ”€β”€ package_embeddings.json     # Pre-computed embeddings
β”‚   └── embedding_info.json         # Model metadata

Integration in Main App

The models are automatically loaded when available:

  1. Conflict Prediction: Runs before detailed analysis to provide early warnings
  2. Package Similarity: Enhances spell-checking with semantic matching

Features

  • Graceful Fallback: If models aren't available, the app works with rule-based methods
  • Lazy Loading: Models load only when needed
  • Error Handling: ML failures don't break the app

Usage in Code

Conflict Prediction

from ml_models import ConflictPredictor

predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)

if has_conflict:
    print(f"Conflict predicted with {confidence:.1%} confidence")

Package Similarity

from ml_models import PackageEmbeddings

embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'

Hugging Face Spaces Deployment

Option 1: Include Models in Repo

  1. Train models locally
  2. Commit model files to the repo
  3. Models load automatically on Spaces

Pros: Simple, no external dependencies
Cons: Larger repo size (~10-15 MB)

Option 2: Upload to Hugging Face Hub

  1. Train models locally
  2. Upload to Hugging Face Hub:
    from huggingface_hub import upload_file
    upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
    
  3. Load from Hub in app:
    from huggingface_hub import hf_hub_download
    model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
    

Pros: Smaller repo, version control for models
Cons: Requires internet connection at startup

Performance

  • Conflict Prediction: <10ms per prediction
  • Embedding Lookup: <1ms (pre-computed) or ~50ms (on-the-fly)
  • Model Loading: ~1-2 seconds at startup

Troubleshooting

Models Not Loading

  • Check that models/ directory exists
  • Verify model files are present
  • Check file permissions

Low Prediction Accuracy

  • Retrain with more data
  • Adjust feature engineering
  • Try different model parameters

Embeddings Not Working

  • Ensure sentence-transformers is installed
  • Check internet connection (for first-time model download)
  • Verify embeddings file format

Future Improvements

  • Train on larger, real-world dataset
  • Add version-specific embeddings
  • Implement online learning
  • Add confidence intervals
  • Support for custom model paths