| --- |
| library_name: scikit-learn |
| tags: |
| - jobs |
| - classification |
| - tf-idf |
| pipeline_tag: text-classification |
| --- |
| |
| # Jobs job-category classifier (sklearn) |
|
|
| This repo holds the trained artifact consumed by **jobs-shared** |
| (`jobs_shared.ai.categorizers.pipeline`) for the findjobs taxonomy. |
| |
| - **Weights:** `category.joblib` — `joblib`-serialized scikit-learn `Pipeline` |
| (`TfidfVectorizer` + `LogisticRegression`), plus artifact keys `fields` |
| and `input_joiner`. Compressed with `joblib.dump(..., compress=3)`. |
| - **Upstream:** Produced by ``scripts/train/train_category.py``. |
| - **HF repo:** `gateswang00/job_classifier` |
| |
| ### Load locally |
| |
| ```python |
| import joblib |
| from huggingface_hub import hf_hub_download |
| |
| path = hf_hub_download(repo_id="gateswang00/job_classifier", filename="category.joblib") |
| artifact = joblib.load(path) |
| clf = artifact["model"] # sklearn Pipeline |
| fields = artifact.get("fields", ["title", "llm_skills", "description"]) |
| print(fields) |
| ``` |
| |
| ### Training metadata snapshot |
| |
| ``` |
| categorizer_filter: ['qwen2.5:7b', 'qwen3-jobs-classifier'] |
| categorizer_mix: {'qwen2.5:7b': 8086} |
| category_source_filter: "category_source IS DISTINCT FROM 'rules'" |
| category_source_mix: {'(null)': 8167} |
| llm_skills_coverage: 0.7056641108088053 |
| min_per_class: 50 |
| n_classes: 14 |
| n_rows: 8086 |
| random_state: 42 |
| source: 'jobs.job_categorized JOIN jobs.jobs_found LEFT JOIN LATERAL jobs.job_extracted' |
| test_size: 0.2 |
| trained_at: '2026-05-23T15:57:36.400167+00:00' |
| ``` |
| |
| Replace this README’s license/frontmatter via the Hugging Face model card UI if needed. |