Spaces:

ysakhale
/

python-dependency-compatibility-board

Sleeping

App Files Files Community

python-dependency-compatibility-board / ML_MODELS_README.md

Yash Sakhale

Initial commit: Python Dependency Compatibility Board with ML and LLM features

329b91e 28 days ago

preview code

raw

history blame contribute delete

4.34 kB

	# ML Models Integration Guide

	This document explains how to train and use the ML models for conflict prediction and package similarity.

	## Overview

	The project includes two ML models:

	1. Conflict Prediction Model: A Random Forest classifier that predicts whether a set of dependencies will have conflicts
	2. Package Embeddings: Pre-computed semantic embeddings for common Python packages for similarity matching

	## Training the Models

	### Step 1: Install Training Dependencies

	```bash
	pip install scikit-learn sentence-transformers numpy
	```

	### Step 2: Train Conflict Prediction Model

	```bash
	cd "code to upload"
	python train_conflict_model.py
	```

	This will:
	- Load the synthetic dataset (`synthetic_requirements_dataset.json`)
	- Extract features from requirements
	- Train a Random Forest classifier
	- Save the model to `models/conflict_predictor.pkl`
	- Display accuracy and feature importance

	Expected Output:
	- Model size: ~2-5 MB
	- Test accuracy: ~85-95% (depending on dataset)

	### Step 3: Generate Package Embeddings

	```bash
	python generate_embeddings.py
	```

	This will:
	- Load a sentence transformer model
	- Generate embeddings for common Python packages
	- Save embeddings to `models/package_embeddings.json`
	- Save model info to `models/embedding_info.json`

	Expected Output:
	- Embeddings file: ~5-10 MB
	- Embedding dimension: 384
	- Number of packages: ~100+

	## Model Files Structure

	After training, you should have:

	```
	code to upload/
	├── models/
	│ ├── conflict_predictor.pkl # Classification model
	│ ├── package_embeddings.json # Pre-computed embeddings
	│ └── embedding_info.json # Model metadata
	```

	## Integration in Main App

	The models are automatically loaded when available:

	1. Conflict Prediction: Runs before detailed analysis to provide early warnings
	2. Package Similarity: Enhances spell-checking with semantic matching

	### Features

	- Graceful Fallback: If models aren't available, the app works with rule-based methods
	- Lazy Loading: Models load only when needed
	- Error Handling: ML failures don't break the app

	## Usage in Code

	### Conflict Prediction

	```python
	from ml_models import ConflictPredictor

	predictor = ConflictPredictor()
	has_conflict, confidence = predictor.predict(requirements_text)

	if has_conflict:
	print(f"Conflict predicted with {confidence:.1%} confidence")
	```

	### Package Similarity

	```python
	from ml_models import PackageEmbeddings

	embeddings = PackageEmbeddings()
	similar = embeddings.find_similar("numpyy", top_k=3)
	# Returns: [('numpy', 0.95), ('scipy', 0.72), ...]

	best_match = embeddings.get_best_match("pandaz")
	# Returns: 'pandas'
	```

	## Hugging Face Spaces Deployment

	### Option 1: Include Models in Repo

	1. Train models locally
	2. Commit model files to the repo
	3. Models load automatically on Spaces

	Pros: Simple, no external dependencies
	Cons: Larger repo size (~10-15 MB)

	### Option 2: Upload to Hugging Face Hub

	1. Train models locally
	2. Upload to Hugging Face Hub:
	```python
	from huggingface_hub import upload_file
	upload_file("models/conflict_predictor.pkl", repo_id="your-username/conflict-predictor")
	```
	3. Load from Hub in app:
	```python
	from huggingface_hub import hf_hub_download
	model_path = hf_hub_download(repo_id="your-username/conflict-predictor", filename="conflict_predictor.pkl")
	```

	Pros: Smaller repo, version control for models
	Cons: Requires internet connection at startup

	## Performance

	- Conflict Prediction: <10ms per prediction
	- Embedding Lookup: <1ms (pre-computed) or ~50ms (on-the-fly)
	- Model Loading: ~1-2 seconds at startup

	## Troubleshooting

	### Models Not Loading

	- Check that `models/` directory exists
	- Verify model files are present
	- Check file permissions

	### Low Prediction Accuracy

	- Retrain with more data
	- Adjust feature engineering
	- Try different model parameters

	### Embeddings Not Working

	- Ensure `sentence-transformers` is installed
	- Check internet connection (for first-time model download)
	- Verify embeddings file format

	## Future Improvements

	- [ ] Train on larger, real-world dataset
	- [ ] Add version-specific embeddings
	- [ ] Implement online learning
	- [ ] Add confidence intervals
	- [ ] Support for custom model paths