Spaces:

ysakhale
/

python-dependency-compatibility-board

Sleeping

App Files Files Community

python-dependency-compatibility-board / ML_MODELS_README.md

Yash Sakhale

Initial commit: Python Dependency Compatibility Board with ML and LLM features

329b91e about 1 month ago

preview code

raw

history blame

4.34 kB

ML Models Integration Guide

This document explains how to train and use the ML models for conflict prediction and package similarity.

Overview

The project includes two ML models:

Conflict Prediction Model: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
Package Embeddings: Pre-computed semantic embeddings for common Python packages for similarity matching

Training the Models

Step 1: Install Training Dependencies

pip install scikit-learn sentence-transformers numpy

Step 2: Train Conflict Prediction Model

cd "code to upload"
python train_conflict_model.py

This will:

Load the synthetic dataset (synthetic_requirements_dataset.json)
Extract features from requirements
Train a Random Forest classifier
Save the model to models/conflict_predictor.pkl
Display accuracy and feature importance

Expected Output:

Model size: ~2-5 MB
Test accuracy: ~85-95% (depending on dataset)

Step 3: Generate Package Embeddings

python generate_embeddings.py

This will:

Load a sentence transformer model
Generate embeddings for common Python packages
Save embeddings to models/package_embeddings.json
Save model info to models/embedding_info.json

Expected Output:

Embeddings file: ~5-10 MB
Embedding dimension: 384
Number of packages: ~100+

Model Files Structure

After training, you should have:

code to upload/
├── models/
│   ├── conflict_predictor.pkl      # Classification model
│   ├── package_embeddings.json     # Pre-computed embeddings
│   └── embedding_info.json         # Model metadata

Integration in Main App

The models are automatically loaded when available:

Conflict Prediction: Runs before detailed analysis to provide early warnings
Package Similarity: Enhances spell-checking with semantic matching

Features

Graceful Fallback: If models aren't available, the app works with rule-based methods
Lazy Loading: Models load only when needed
Error Handling: ML failures don't break the app

Usage in Code

Conflict Prediction

from ml_models import ConflictPredictor

predictor = ConflictPredictor()
has_conflict, confidence = predictor.predict(requirements_text)

if has_conflict:
    print(f"Conflict predicted with {confidence:.1%} confidence")

Package Similarity

from ml_models import PackageEmbeddings

embeddings = PackageEmbeddings()
similar = embeddings.find_similar("numpyy", top_k=3)
# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

best_match = embeddings.get_best_match("pandaz")
# Returns: 'pandas'

Hugging Face Spaces Deployment

Option 1: Include Models in Repo

Train models locally
Commit model files to the repo
Models load automatically on Spaces

Pros: Simple, no external dependencies
Cons: Larger repo size (~10-15 MB)

Option 2: Upload to Hugging Face Hub

Train models locally

Upload to Hugging Face Hub:

from huggingface_hub import upload_file
upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")

Load from Hub in app:

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")

Pros: Smaller repo, version control for models
Cons: Requires internet connection at startup

Performance

Conflict Prediction: <10ms per prediction
Embedding Lookup: <1ms (pre-computed) or ~50ms (on-the-fly)
Model Loading: ~1-2 seconds at startup

Troubleshooting

Models Not Loading

Check that models/ directory exists
Verify model files are present
Check file permissions

Low Prediction Accuracy

Retrain with more data
Adjust feature engineering
Try different model parameters

Embeddings Not Working

Ensure sentence-transformers is installed
Check internet connection (for first-time model download)
Verify embeddings file format

Future Improvements

Train on larger, real-world dataset
Add version-specific embeddings
Implement online learning
Add confidence intervals
Support for custom model paths