Crawl-Priority-ML: Crawl Budget and Indexability Priority Modeling
Type: Academic | Domain: SEO, Crawl Efficiency
Hugging Face: syeedalireza/crawl-priority-ml
Predict crawl priority and indexability at URL level from structure and signals to optimize crawl budget.
Author
Alireza Aminzadeh
- Hugging Face: syeedalireza
- LinkedIn: alirezaaminzadeh
- Email: alireza.aminzadeh@hotmail.com
Problem
Large sites need to prioritize which URLs to crawl and index. ML can estimate βcrawl valueβ and indexability from URL, depth, and page signals.
Approach
- Features: URL depth, path tokens, internal links, last-modified, content length, canonical/redirect flags, etc.
- Targets: Crawl priority (ordinal or score), indexable (binary).
- Models: XGBoost/LightGBM for tabular features; optional URL embedding.
Tech Stack
| Category | Tools |
|---|---|
| ML | scikit-learn, XGBoost, LightGBM |
| Data | pandas, NumPy |
| Evaluation | sklearn metrics |
Setup
pip install -r requirements.txt
Usage
python train.py
python inference.py --input data/urls.csv --output priorities.csv
Project structure
05_crawl-priority-ml/
βββ config.py
βββ train.py # Priority (regression) and indexable (classification)
βββ inference.py
βββ requirements.txt
βββ .env.example
βββ data/
β βββ crawl_features.csv # Sample: features + priority, indexable
β βββ urls.csv # Sample inference input (features only)
βββ models/
Data
- Sample data (included):
data/crawl_features.csv(training: features +priority,indexable),data/urls.csv(inference: same feature columns only). - Feature columns:
depth,internal_links,content_length,word_count,is_canonical. Targets:priority,indexable. - Set
DATA_PATHin.envif using another file.
License
MIT.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support