Crawl-Priority-ML: Crawl Budget and Indexability Priority Modeling

Type: Academic | Domain: SEO, Crawl Efficiency
Hugging Face: syeedalireza/crawl-priority-ml

Predict crawl priority and indexability at URL level from structure and signals to optimize crawl budget.

Author

Alireza Aminzadeh

Problem

Large sites need to prioritize which URLs to crawl and index. ML can estimate “crawl value” and indexability from URL, depth, and page signals.

Approach

Features: URL depth, path tokens, internal links, last-modified, content length, canonical/redirect flags, etc.
Targets: Crawl priority (ordinal or score), indexable (binary).
Models: XGBoost/LightGBM for tabular features; optional URL embedding.

Tech Stack

Category	Tools
ML	scikit-learn, XGBoost, LightGBM
Data	pandas, NumPy
Evaluation	sklearn metrics

Setup

pip install -r requirements.txt

Usage

python train.py
python inference.py --input data/urls.csv --output priorities.csv

Project structure

05_crawl-priority-ml/
├── config.py
├── train.py           # Priority (regression) and indexable (classification)
├── inference.py
├── requirements.txt
├── .env.example
├── data/
│   ├── crawl_features.csv   # Sample: features + priority, indexable
│   └── urls.csv             # Sample inference input (features only)
└── models/

Data

Sample data (included): data/crawl_features.csv (training: features + priority, indexable), data/urls.csv (inference: same feature columns only).
Feature columns: depth, internal_links, content_length, word_count, is_canonical. Targets: priority, indexable.
Set DATA_PATH in .env if using another file.

License

MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

syeedalireza
/

crawl-priority-ml