Crawl-Priority-ML: Crawl Budget and Indexability Priority Modeling

Type: Academic | Domain: SEO, Crawl Efficiency
Hugging Face: syeedalireza/crawl-priority-ml

Predict crawl priority and indexability at URL level from structure and signals to optimize crawl budget.

Author

Alireza Aminzadeh

Problem

Large sites need to prioritize which URLs to crawl and index. ML can estimate β€œcrawl value” and indexability from URL, depth, and page signals.

Approach

  • Features: URL depth, path tokens, internal links, last-modified, content length, canonical/redirect flags, etc.
  • Targets: Crawl priority (ordinal or score), indexable (binary).
  • Models: XGBoost/LightGBM for tabular features; optional URL embedding.

Tech Stack

Category Tools
ML scikit-learn, XGBoost, LightGBM
Data pandas, NumPy
Evaluation sklearn metrics

Setup

pip install -r requirements.txt

Usage

python train.py
python inference.py --input data/urls.csv --output priorities.csv

Project structure

05_crawl-priority-ml/
β”œβ”€β”€ config.py
β”œβ”€β”€ train.py           # Priority (regression) and indexable (classification)
β”œβ”€β”€ inference.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ crawl_features.csv   # Sample: features + priority, indexable
β”‚   └── urls.csv             # Sample inference input (features only)
└── models/

Data

  • Sample data (included): data/crawl_features.csv (training: features + priority, indexable), data/urls.csv (inference: same feature columns only).
  • Feature columns: depth, internal_links, content_length, word_count, is_canonical. Targets: priority, indexable.
  • Set DATA_PATH in .env if using another file.

License

MIT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using syeedalireza/crawl-priority-ml 1