Spaces:

dima806
/

developer_salary_prediction

Running

App Files Files Community

dima806 commited on Feb 20

Commit

1a584f9

verified ·

1 Parent(s): 561c0bd

Upload 38 files

Browse files

Files changed (25) hide show

.github/workflows/ci.yml +29 -0
.gitignore +1 -1
.pre-commit-config.yaml +31 -0
Claude.md +237 -163
README.md +356 -107
app.py +16 -0
config/model_parameters.yaml +1 -0
config/valid_categories.yaml +10 -0
example_inference.py +8 -0
guardrail_evaluation.py +2 -6
models/model.pkl +2 -2
pyproject.toml +2 -1
src/infer.py +8 -0
src/preprocessing.py +15 -3
src/schema.py +4 -0
src/train.py +10 -1
src/tune.py +2 -6
tests/conftest.py +1 -0
tests/test_feature_impact.py +45 -2
tests/test_infer.py +7 -0
tests/test_preprocessing.py +6 -0
tests/test_schema.py +18 -0
tests/test_train.py +10 -1
tests/test_tune.py +3 -9
uv.lock +72 -4

.github/workflows/ci.yml ADDED Viewed

	@@ -0,0 +1,29 @@

+name: CI
+on:
+  push:
+    branches: ["**"]
+jobs:
+  lint-and-test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Install dependencies
+        run: uv sync --all-extras
+      - name: Lint
+        run: make lint
+      - name: Test
+        run: make test

.gitignore CHANGED Viewed

@@ -217,4 +217,4 @@ data/*.zip
 # models/*.joblib
 # LLM
-.llm/

 # models/*.joblib
 # LLM
+.llm/

.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,31 @@

+repos:
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: trailing-whitespace
+      - id: end-of-file-fixer
+      - id: mixed-line-ending
+        args: [--fix=lf]
+      - id: check-yaml
+      - id: check-toml
+      - id: check-json
+      - id: check-added-large-files
+        args: [--maxkb=1000]
+      - id: check-merge-conflict
+      - id: debug-statements
+  - repo: local
+    hooks:
+      - id: format
+        name: ruff format
+        entry: make format
+        language: system
+        types: [python]
+        pass_filenames: false
+      - id: lint
+        name: ruff lint
+        entry: make lint
+        language: system
+        types: [python]
+        pass_filenames: false

Claude.md CHANGED Viewed

@@ -1,229 +1,303 @@
 # Claude Development Guide
 ## Project Overview
-This is a minimal, local-first ML application built in Python that predicts developer salaries using Stack Overflow Developer Survey data. The project emphasizes clarity and simplicity over production completeness.
 ## Tech Stack
-- **Python 3.11+**
-- **uv** - Package & virtual environment management
-- **pandas** - Data manipulation
-- **scikit-learn** - ML modeling
-- **pydantic** - Input validation
-- **streamlit** - Web UI
-- **xgboost** - Advanced gradient boosting (optional)
 ## Project Structure
-```
 .
 ├── data/
-│   └── survey_results_public.csv    # Stack Overflow survey data
 ├── models/
-│   └── model.pkl                    # Serialized trained model
 ├── src/
-│   ├── schema.py                    # Pydantic validation models
-│   ├── train.py                     # Model training script
-│   └── infer.py                     # Inference utilities
-├── app.py                           # Streamlit web application
-├── example_inference.py             # Example inference script
-├── pyproject.toml                   # Project dependencies (uv)
-├── uv.lock                          # Locked dependencies
-└── README.md                        # Project documentation
 ```
 ## Setup & Installation
-### Initial Setup
 ```bash
-# The virtual environment is already created at .venv/
-# Activate it:
-source .venv/bin/activate  # On Linux/Mac
-# or
-.venv\Scripts\activate     # On Windows
-# Install/sync dependencies with uv:
 uv sync
 ```
-### Adding New Dependencies
 ```bash
-uv add <package-name>
 ```
-## Key Workflows
-### Training the Model
 ```bash
-python src/train.py
 ```
-This will:
-- Load data from `data/survey_results_public.csv`
-- Clean and preprocess features
-- Train the regression model
-- Save model to `models/model.pkl`
-### Running the Streamlit App
 ```bash
-streamlit run app.py
 ```
-Opens a browser interface for salary predictions.
-### Running Inference Programmatically
 ```python
 from src.schema import SalaryInput
 from src.infer import predict_salary
 input_data = SalaryInput(
-    country="United States",
     years_code=5.0,
-    education_level="Bachelor's degree",
-    dev_type="Developer, back-end",
     industry="Software Development",
-    age="25-34 years old"
 )
 salary = predict_salary(input_data)
 ```
 ## Key Files
 ### [src/schema.py](src/schema.py)
-Contains Pydantic models for:
-- Input validation (`SalaryInput`)
-- Type safety across the application
 ### [src/train.py](src/train.py)
-Training pipeline:
-- Data loading and cleaning
-- Feature engineering
-- Model training
-- Model persistence
 ### [src/infer.py](src/infer.py)
-Inference utilities:
-- Model loading
-- Prediction logic
-- Validated input processing
 ### [app.py](app.py)
-Streamlit UI:
-- User input forms
-- Real-time predictions
-- Results visualization
-## Development Guidelines
-### Code Style
-- Keep code simple and readable
-- Total codebase should remain under ~200 lines
-- Focus on clarity over cleverness
-- Use type hints where helpful
-### Data Requirements
-The dataset must include these columns:
-- `Country` - Developer location
-- `YearsCode` - Total years of coding (including education)
-- `EdLevel` - Education level
-- `DevType` - Developer type
-- `Industry` - Industry the developer works in
-- `Age` - Developer's age range
-- `ConvertedCompYearly` - Annual salary (target variable)
-### Model Expectations
-- Basic regression model (LinearRegression or similar)
-- Simple feature encoding (one-hot for categoricals)
-- No hyperparameter tuning required
-- Focus on working end-to-end pipeline
-## Common Tasks
-### Debugging Training Issues
-1. Check if data file exists: `ls -la data/`
-2. Verify CSV columns: `head -1 data/survey_results_public.csv`
-3. Check for missing values in target column
-4. Review data types and encoding
-### Updating Features
-1. Modify `SalaryInput` schema in [src/schema.py](src/schema.py)
-2. Update feature extraction in [src/train.py](src/train.py)
-3. Update inference logic in [src/infer.py](src/infer.py)
-4. Update UI inputs in [app.py](app.py)
-5. Retrain the model
-### Testing Predictions
-```python
-# Quick test in Python REPL
-from src.infer import predict_salary
-from src.schema import SalaryInput
-test_input = SalaryInput(
-    country="United States",
-    years_code=3.0,
-    education_level="Bachelor's degree",
-    dev_type="Developer, back-end",
-    industry="Software Development",
-    age="25-34 years old"
-)
-print(predict_salary(test_input))
-```
-## Non-Goals (Intentionally Excluded)
-- Cloud deployment or serving
-- Hyperparameter tuning
-- Model registry or experiment tracking
-- Advanced feature engineering
-- Production monitoring
-- API endpoints (beyond Streamlit)
-## Useful Commands
-```bash
-# Check environment
-which python
-python --version
-# Verify uv installation
-uv --version
-# List installed packages
-uv pip list
-# Run with specific Python version
-uv run python src/train.py
-# Clean generated files
-rm -f models/model.pkl
-# Check data file size
-du -h data/survey_results_public.csv
 ```
-## Troubleshooting
-### Model file not found
-- Run training first: `python src/train.py`
-- Check file exists: `ls -la models/model.pkl`
-### Missing dependencies
-- Sync environment: `uv sync`
-- Verify pyproject.toml has all required packages
-### Data file issues
-- Ensure CSV is in `data/` directory
-- Check file encoding (should be UTF-8)
-- Verify required columns exist
-### Streamlit won't start
-- Check port 8501 is available
-- Try specifying port: `streamlit run app.py --server.port 8502`
 ## Additional Resources
-- [PRD](.llm/prd.md) - Full product requirements
-- [README.md](README.md) - Project readme
-- [Stack Overflow Survey](https://insights.stackoverflow.com/survey) - Data source
-## Working with Claude Code
-When asking Claude to help with this project:
-- Reference specific files using markdown links: [filename](path)
-- Be specific about which component needs changes
-- Mention if you need training, inference, or UI updates
-- Provide error messages in full when debugging
-- Ask for explanations of model choices if unclear

 # Claude Development Guide
 ## Project Overview
+A minimal, local-first ML application that predicts developer salaries using Stack Overflow
+Developer Survey data. Built with Python 3.12, XGBoost, Pydantic v2, and Streamlit. Emphasises
+clarity and simplicity over production completeness.
 ## Tech Stack
+- **Python 3.12+**
+- **uv** — Package & virtual environment management
+- **pandas** — Data manipulation
+- **xgboost** — Gradient boosting model (primary model)
+- **scikit-learn** — Cross-validation and train/test split
+- **optuna** — Hyperparameter optimisation
+- **pydantic** — Input schema validation (v2)
+- **streamlit** — Web UI
+- **ruff** — Linting and formatting
+- **radon** — Cyclomatic complexity and maintainability metrics
+- **bandit** — Static security analysis
+- **pip-audit** — Dependency vulnerability scanning
+- **pre-commit** — Git hook management
 ## Project Structure
+```text
 .
+├── .github/
+│   └── workflows/
+│       └── ci.yml                   # GitHub Actions CI (lint + test on every push)
+├── config/
+│   ├── model_parameters.yaml        # Model config, guardrails, cardinality settings
+│   ├── optuna_config.yaml           # Optuna search space and trial settings
+│   ├── valid_categories.yaml        # Valid input categories (generated by training)
+│   └── currency_rates.yaml          # Per-country currency rates (generated by training)
 ├── data/
+│   └── survey_results_public.csv    # Stack Overflow survey data (download required)
 ├── models/
+│   └── model.pkl                    # Trained model artifact (generated by training)
 ├── src/
+│   ├── __init__.py
+│   ├── schema.py                    # Pydantic input model (SalaryInput)
+│   ├── preprocessing.py             # Feature engineering (one-hot encoding, scaling)
+│   ├── train.py                     # Training pipeline
+│   ├── tune.py                      # Optuna hyperparameter optimisation
+│   └── infer.py                     # Inference with runtime category guardrails
+├── tests/
+│   ├── conftest.py                  # Shared pytest fixtures
+│   ├── test_schema.py               # Pydantic validation tests
+│   ├── test_infer.py                # Inference and guardrail tests
+│   ├── test_train.py                # Training pipeline helper tests
+│   ├── test_preprocessing.py        # Feature engineering tests
+│   ├── test_tune.py                 # Optuna tuning tests
+│   └── test_feature_impact.py       # Model sanity — each feature affects predictions
+├── app.py                           # Streamlit web app
+├── example_inference.py             # Programmatic usage examples
+├── Makefile                         # Developer workflow commands
+├── .pre-commit-config.yaml          # Pre-commit hooks (format, lint, standard checks)
+├── pyproject.toml                   # Project metadata and dependencies (uv)
+└── README.md                        # Project documentation + HuggingFace Space config
 ```
 ## Setup & Installation
 ```bash
+# Install dependencies
 uv sync
+# Install pre-commit hooks (once, after cloning)
+uv run pre-commit install
 ```
+## Key Workflows
+All common tasks are available via the Makefile. Run `make help` to list targets.
+### Full quality check
 ```bash
+make check   # lint + test + complexity + maintainability + audit + security
 ```
+### Individual targets
+| Target | What it does |
+| ------ | ------------ |
+| `make lint` | ruff check (style + errors) |
+| `make format` | ruff format (auto-format) |
+| `make test` | pytest — all tests |
+| `make coverage` | pytest with HTML coverage report |
+| `make complexity` | radon cyclomatic complexity |
+| `make maintainability` | radon maintainability index |
+| `make audit` | pip-audit dependency vulnerability scan |
+| `make security` | bandit static security analysis |
+| `make tune` | Optuna hyperparameter search |
+### Training the model
+```bash
+uv run python -m src.train
+```
+Generates:
+- `models/model.pkl` — trained XGBoost model
+- `config/valid_categories.yaml` — valid input values for runtime guardrails
+- `config/currency_rates.yaml` — per-country median currency conversion rates
+### Hyperparameter tuning (optional, run before training)
 ```bash
+make tune
+# or
+uv run python -m src.tune --n-trials 50
 ```
+Reads search space from `config/optuna_config.yaml`, writes best parameters back into
+`config/model_parameters.yaml`.
+### Running the Streamlit app
 ```bash
+uv run streamlit run app.py
 ```
+### Running inference programmatically
 ```python
 from src.schema import SalaryInput
 from src.infer import predict_salary
 input_data = SalaryInput(
+    country="United States of America",
     years_code=5.0,
+    work_exp=3.0,
+    education_level="Bachelor's degree (B.A., B.S., B.Eng., etc.)",
+    dev_type="Developer, full-stack",
     industry="Software Development",
+    age="25-34 years old",
+    ic_or_pm="Individual contributor",
+    org_size="20 to 99 employees",
 )
 salary = predict_salary(input_data)
 ```
+Valid values for each categorical field are listed in `config/valid_categories.yaml`
+(generated at training time).
+## Data Requirements
+The `survey_results_public.csv` must include these columns:
+| Column | Description |
+| ------ | ----------- |
+| `Country` | Developer's country of residence |
+| `YearsCode` | Total years coding (including education) |
+| `WorkExp` | Years of professional work experience |
+| `EdLevel` | Highest education level |
+| `DevType` | Primary developer role |
+| `Industry` | Industry the developer works in |
+| `Age` | Age range |
+| `ICorPM` | Individual contributor or people manager |
+| `OrgSize` | Organisation size (number of employees) |
+| `ConvertedCompYearly` | Annual salary in USD (target variable) |
+## Input Validation (Two Layers)
+### Layer 1 — Pydantic schema (`src/schema.py`)
+All 9 fields are required. `years_code` and `work_exp` must be `>= 0`. Validated at
+object construction time — raises `ValidationError` on failure.
+### Layer 2 — Runtime guardrails (`src/infer.py`)
+Each categorical field is checked against `config/valid_categories.yaml` at inference
+time. Raises `ValueError` with a clear message on invalid input.
 ## Key Files
 ### [src/schema.py](src/schema.py)
+Pydantic v2 `SalaryInput` model — defines all 9 required input fields, types, and
+constraints. The JSON schema example in the docstring is the canonical usage example.
+### [src/preprocessing.py](src/preprocessing.py)
+`prepare_features(df)` — takes a raw DataFrame and returns an encoded feature matrix:
+- Unicode apostrophe normalisation on all categorical columns
+- Rare category → "Other" normalisation
+- Missing numeric values → 0
+- Missing categoricals → "Unknown"
+- One-hot encoding (training uses `drop_first=True`; inference uses `drop_first=False`
+  then reindexes to match training columns)
 ### [src/train.py](src/train.py)
+Full training pipeline:
+1. Load CSV, filter salaries (min/max percentile per country)
+2. `apply_cardinality_reduction` — collapse rare categories to "Other"
+3. `drop_other_rows` — remove rows with "Other" in specified columns
+4. `prepare_features` — encode features
+5. 5-fold CV with MAPE metric
+6. Train final XGBoost model on full data with early stopping
+7. Save `models/model.pkl`, `config/valid_categories.yaml`, `config/currency_rates.yaml`
+### [src/tune.py](src/tune.py)
+Optuna study over the search space in `config/optuna_config.yaml`. Writes best
+parameters back into `config/model_parameters.yaml` after the study completes.
 ### [src/infer.py](src/infer.py)
+`predict_salary(SalaryInput)` — validates categories, builds a single-row DataFrame,
+runs `prepare_features`, reindexes to training columns, returns float USD salary.
+`get_local_currency(country, salary)` — converts to local currency using rates from
+`config/currency_rates.yaml`.
 ### [app.py](app.py)
+Streamlit UI with dropdowns populated from `config/valid_categories.yaml` (only values
+that appeared in the training data). Shows USD + local currency side by side.
+## Updating Features
+When adding a new input feature, update **all** of the following in order:
+1. `config/model_parameters.yaml` — add to `features.cardinality.drop_other_from` if applicable
+2. `src/schema.py` — add field to `SalaryInput`
+3. `src/preprocessing.py` — add to `_categorical_cols` (or numeric handling)
+4. `src/train.py` — add to `CATEGORICAL_FEATURES` and `usecols`
+5. `src/infer.py` — add validation block and DataFrame column
+6. `app.py` — add selectbox, default, sidebar entry, `SalaryInput` construction
+7. `tests/conftest.py` — add to `sample_salary_input` fixture
+8. `tests/test_schema.py` — assert field, add missing-field test
+9. `tests/test_infer.py` — add invalid-value test
+10. `tests/test_feature_impact.py` — add to all `base_input` dicts, add impact test
+11. `tests/test_preprocessing.py` — add column to all `pd.DataFrame(...)` fixtures
+12. `tests/test_train.py` — add column to `_make_salary_df` and all test DataFrames
+13. `README.md` — required columns, valid categories list, code example
+14. `example_inference.py` — add to all `SalaryInput` calls
+15. Retrain: `uv run python -m src.train`
+## Versioning
+Follows [Semantic Versioning](https://semver.org/):
+- **MAJOR** — new required input field, incompatible model artifact, renamed API
+- **MINOR** — new optional field, new supported country, new Makefile target
+- **PATCH** — bug fix, model retrain with same schema, config tuning
+Current version: `2.0.0` (added `OrgSize` required field).
+Update `pyproject.toml` before tagging:
+```bash
+git tag v2.0.0
+git push origin v2.0.0
 ```
+## Code Style
+- Line length: **79 characters** (enforced by ruff `E501`)
+- Formatter: `ruff format` (run via `make format`)
+- Linter: `ruff check` (run via `make lint`)
+- No `# noqa` suppressions — fix the code instead
+## Common Debugging
+```bash
+# Check data columns
+head -1 data/survey_results_public.csv | tr ',' '\n'
+# Verify model exists
+ls -lh models/model.pkl
+# Check valid categories after training
+cat config/valid_categories.yaml
+# Run a single test file
+uv run pytest tests/test_infer.py -v
+# Run pre-commit manually
+uv run pre-commit run --all-files
+```
+## Troubleshooting
+| Symptom | Fix |
+| ------- | --- |
+| `FileNotFoundError: model.pkl` | Run `uv run python -m src.train` |
+| `FileNotFoundError: valid_categories.yaml` | Same — generated by training |
+| `ValidationError` on `SalaryInput` | Check all 9 fields are present and non-negative numerics |
+| `ValueError: Invalid ...` at inference | Value not in `config/valid_categories.yaml`; retrain or use a listed value |
+| `E501` ruff errors | Lines > 79 chars — split strings, use variables, or wrap lists |
+| Tests fail after adding a feature | Check the "Updating Features" checklist above |
 ## Additional Resources
+- [README.md](README.md) — User-facing documentation and HuggingFace Space config
+- [Stack Overflow Survey](https://insights.stackoverflow.com/survey) — Data source
+- [semver.org](https://semver.org/) — Versioning reference

README.md CHANGED Viewed

@@ -14,14 +14,14 @@ license: apache-2.0
 # Developer Salary Prediction
-A minimal, local-first ML application that predicts developer salaries using Stack Overflow Developer Survey data. Built with Python, scikit-learn, Pydantic, and Streamlit.
 ## Features
 - 🎯 XGBoost (gradient boosting) model for salary prediction
-- ✅ Input validation with Pydantic
 - 🌐 Interactive web UI with Streamlit
-- 📊 Trained on Stack Overflow Developer Survey data
 - 🔧 Easy setup with `uv` package manager
 ## Quick Start
@@ -37,14 +37,15 @@ uv sync
 Download the Stack Overflow Developer Survey CSV file:
 1. Visit: https://insights.stackoverflow.com/survey
-2. Download the latest survey results (2024 or 2025)
 3. Extract the `survey_results_public.csv` file
 4. Place it in the `data/` directory:
-   ```
    data/survey_results_public.csv
    ```
-**Required columns:** `Country`, `YearsCode`, `WorkExp`, `EdLevel`, `DevType`, `Industry`, `Age`, `ICorPM`, `ConvertedCompYearly`
 ### 3. Train the Model
@@ -53,11 +54,14 @@ uv run python -m src.train
 ```
 This will:
 - Load configuration from `config/model_parameters.yaml`
-- Load and preprocess the survey data (with cardinality reduction)
-- Train an XGBoost model with early stopping
-- Save the model to `models/model.pkl`
-- Generate `config/valid_categories.yaml` with valid country, education, developer type, industry, age, and IC/PM values
 ### 4. Run the Streamlit App
@@ -67,11 +71,68 @@ uv run streamlit run app.py
 The app will open in your browser at `http://localhost:8501`
 ## Usage
 ### Web Interface
 Launch the Streamlit app and enter:
 - **Country**: Developer's country
 - **Years of Coding (Total)**: Total years coding including education
 - **Years of Professional Work Experience**: Years of professional work experience
@@ -80,18 +141,17 @@ Launch the Streamlit app and enter:
 - **Industry**: Industry the developer works in
 - **Age**: Developer's age range
 - **IC or PM**: Individual contributor or people manager
-Click "Predict Salary" to see the estimated annual salary.
 ### Programmatic Usage
-**Quick example:**
 ```python
 from src.schema import SalaryInput
 from src.infer import predict_salary
-# Create input
 input_data = SalaryInput(
     country="United States of America",
     years_code=5.0,
@@ -100,10 +160,10 @@ input_data = SalaryInput(
     dev_type="Developer, full-stack",
     industry="Software Development",
     age="25-34 years old",
-    ic_or_pm="Individual contributor"
 )
-# Get prediction
 salary = predict_salary(input_data)
 print(f"Estimated salary: ${salary:,.0f}")
 ```
@@ -114,186 +174,365 @@ print(f"Estimated salary: ${salary:,.0f}")
 uv run python example_inference.py
 ```
-This will show predictions for multiple sample scenarios (junior, mid-level, senior developers, different countries).
-## Input Validation
-The model validates inputs against actual training data categories:
-- **Valid Countries**: Only countries from `config/valid_categories.yaml` (~21 countries)
-- **Valid Education Levels**: Only education levels from training data (~9 levels)
-- **Valid Developer Types**: Only developer types from training data (~20 types)
-- **Valid Industries**: Only industries from training data (~15 industries)
-- **Valid Age Ranges**: Only age ranges from training data (~7 ranges)
-- **Valid IC/PM Values**: Only IC/PM values from training data (~3 values)
-The Streamlit app uses dropdown menus with only valid options. If you use the programmatic API with invalid values, you'll get a helpful error message pointing to the valid categories file.
-**Example validation:**
 ```python
 from src.infer import predict_salary
 from src.schema import SalaryInput
-# This will raise ValueError - Japan not in training data after cardinality reduction
-invalid_input = SalaryInput(
-    country="Japan",  # Invalid!
-    years_code=5.0,
-    work_exp=3.0,
-    education_level="Bachelor's degree (B.A., B.S., B.Eng., etc.)",
-    dev_type="Developer, back-end",
-    industry="Software Development",
-    age="25-34 years old",
-    ic_or_pm="Individual contributor"
-)
 ```
 **View valid categories:**
 ```bash
 cat config/valid_categories.yaml
 ```
-## Configuration
-Model parameters are centralized in [config/model_parameters.yaml](config/model_parameters.yaml). You can customize:
-- **Data Processing**: Salary thresholds, percentile bounds, train/test split ratio
-- **Feature Engineering**: Cardinality reduction settings (max categories, min frequency)
-- **Model Hyperparameters**: Learning rate, tree depth, early stopping, etc.
-- **Training Settings**: Verbosity, model save path
-**To modify parameters:**
 ```bash
-# Edit the config file
-nano config/model_parameters.yaml
-# Then retrain the model
-uv run python -m src.train
 ```
 **Example parameter changes:**
 ```yaml
 # Increase model complexity
 model:
-  max_depth: 8                 # Default: 6
   n_estimators: 10000          # Default: 5000
 # Keep more categories
 features:
   cardinality:
-    max_categories: 30         # Default: 20
-    min_frequency: 100         # Default: 50
 ```
 ## Project Structure
-```
 .
 ├── config/
-│   ├── model_parameters.yaml        # Model configuration
-│   └── valid_categories.yaml        # Valid input categories (generated)
 ├── data/
 │   └── survey_results_public.csv    # Stack Overflow survey data (download required)
 ├── models/
-│   └── model.pkl                    # Trained model (generated)
 ├── src/
-│   ├── __init__.py                  # Package initialization
-│   ├── schema.py                    # Pydantic models
-│   ├── preprocessing.py             # Feature engineering utilities
-│   ├── train.py                     # Training script
-│   └── infer.py                     # Inference utilities
 ├── app.py                           # Streamlit web app
-├── example_inference.py             # Example inference script
 ├── pyproject.toml                   # Project dependencies
-└── README.md                        # This file
 ```
 ## Tech Stack
 - **Python 3.12+**
-- **uv** - Package manager
-- **pandas** - Data manipulation
-- **xgboost** - Gradient boosting model
-- **scikit-learn** - ML utilities (train/test split)
-- **pydantic** - Data validation
-- **streamlit** - Web UI
 ## Development
 For detailed development information, see [Claude.md](Claude.md).
 ### Re-training the Model
 If you want to use a different survey year or update the model:
 ```bash
-# Place new CSV in data/ directory
 uv run python -m src.train
 ```
 ### Running Tests
-**Quick one-liner test:**
 ```bash
-uv run python -c "from src.schema import SalaryInput; from src.infer import predict_salary; test = SalaryInput(country='United States of America', years_code=5.0, work_exp=3.0, education_level='Bachelor'\''s degree (B.A., B.S., B.Eng., etc.)', dev_type='Developer, full-stack', industry='Software Development', age='25-34 years old', ic_or_pm='Individual contributor'); print(f'Prediction: \${predict_salary(test):,.0f}')"
-```
-**Or run the full example script:**
-```bash
-uv run python example_inference.py
 ```
-## Deployment
-### Hugging Face Spaces
-This application is Docker-ready for deployment on Hugging Face Spaces:
-**1. Build the Docker image:**
 ```bash
-docker build -t developer-salary-predictor .
 ```
-**2. Test locally:**
 ```bash
-docker run -p 8501:8501 developer-salary-predictor
 ```
-Then visit `http://localhost:8501`
-**3. Deploy to Hugging Face:**
-1. Create a new Space on [Hugging Face](https://huggingface.co/new-space)
-2. Select "Docker" as the SDK
-3. Clone your Space repository
-4. Copy these files to your Space:
-   ```text
-   Dockerfile
-   requirements.txt
-   app.py
-   src/
-   config/
-   models/
-   ```
-5. Push to your Space:
    ```bash
-   git add .
-   git commit -m "Initial deployment"
-   git push
    ```
-**Note:** The pre-trained model (`models/model.pkl`) and configuration (`config/valid_categories.yaml`) are included in the Docker image. If you want to use a different model, retrain locally first, then rebuild the Docker image.
-### Alternative: Local Deployment
 **Using uv (recommended for development):**
 ```bash
 uv run streamlit run app.py
 ```
 **Using pip:**
 ```bash
 pip install -r requirements.txt
 streamlit run app.py
@@ -302,24 +541,34 @@ streamlit run app.py
 ## Troubleshooting
 ### "Model file not found"
 - Run `uv run python -m src.train` first to generate the model
 ### "Data file not found"
 - Download the Stack Overflow survey CSV and place it in `data/`
 ### "Configuration file not found"
 - The `config/model_parameters.yaml` file should exist in the project root
 - Check that you're running commands from the project root directory
 ### Dependencies issues
 - Run `uv sync` to ensure all packages are installed
 ## Design Principles
-- **Simplicity**: Under 200 lines of code total
-- **Clarity**: Easy to understand and modify
-- **Local-first**: No cloud dependencies
-- **Hackable**: Plain Python, no complex frameworks
 ## License

 # Developer Salary Prediction
+A minimal, local-first ML application that predicts developer salaries using Stack Overflow Developer Survey data. Built with Python, XGBoost, Pydantic, and Streamlit.
 ## Features
 - 🎯 XGBoost (gradient boosting) model for salary prediction
+- ✅ Input validation with Pydantic (schema) and runtime guardrails (valid categories)
 - 🌐 Interactive web UI with Streamlit
+- 📊 Trained on Stack Overflow 2025 Developer Survey data
 - 🔧 Easy setup with `uv` package manager
 ## Quick Start
 Download the Stack Overflow Developer Survey CSV file:
 1. Visit: https://insights.stackoverflow.com/survey
+2. Download the latest survey results (2025)
 3. Extract the `survey_results_public.csv` file
 4. Place it in the `data/` directory:
+   ```text
    data/survey_results_public.csv
    ```
+**Required columns:** `Country`, `YearsCode`, `WorkExp`, `EdLevel`, `DevType`, `Industry`, `Age`, `ICorPM`, `OrgSize`, `ConvertedCompYearly`
 ### 3. Train the Model
 ```
 This will:
 - Load configuration from `config/model_parameters.yaml`
+- Filter salaries and reduce cardinality of categorical features
+- Run 5-fold cross-validation and report mean MAPE per fold
+- Train a final XGBoost model on the full dataset with early stopping
+- Save the model artifact to `models/model.pkl`
+- Generate `config/valid_categories.yaml` — valid input values for runtime guardrails
+- Generate `config/currency_rates.yaml` — per-country median currency conversion rates
 ### 4. Run the Streamlit App
 The app will open in your browser at `http://localhost:8501`
+## Development Cycle
+The full development workflow from data to deployment:
+```text
+data/ ──► (optional) tune ──► train ──► test ──► commit ──► CI passes ──► deploy
+```
+### Step-by-step
+#### 1. (Optional) Tune hyperparameters
+Run Optuna to search for optimal XGBoost hyperparameters. The search space is
+defined in `config/optuna_config.yaml`. Best parameters are written directly
+back into `config/model_parameters.yaml`.
+```bash
+make tune
+# or with a custom number of trials:
+uv run python -m src.tune --n-trials 50
+```
+#### 2. Train the model
+```bash
+uv run python -m src.train
+```
+#### 3. Check code quality (lint + test + complexity + security)
+```bash
+make check
+```
+This runs all quality gates in sequence:
+| Target | Tool | What it checks |
+| ------ | ---- | -------------- |
+| `make lint` | ruff | Style and linting errors |
+| `make format` | ruff | Auto-formats code |
+| `make test` | pytest | Unit and integration tests |
+| `make coverage` | pytest-cov | Test coverage report |
+| `make complexity` | radon CC | Cyclomatic complexity |
+| `make maintainability` | radon MI | Maintainability index |
+| `make audit` | pip-audit | Dependency vulnerability scan |
+| `make security` | bandit | Static security analysis |
+`make check` runs lint, test, complexity, maintainability, audit, and security together.
+`make all` is an alias for `make check`.
+#### 4. Run all pre-commit checks manually
+```bash
+uv run pre-commit run --all-files
+```
 ## Usage
 ### Web Interface
 Launch the Streamlit app and enter:
 - **Country**: Developer's country
 - **Years of Coding (Total)**: Total years coding including education
 - **Years of Professional Work Experience**: Years of professional work experience
 - **Industry**: Industry the developer works in
 - **Age**: Developer's age range
 - **IC or PM**: Individual contributor or people manager
+- **Organization Size**: Approximate number of employees at the developer's company
+Click "Predict Salary" to see the estimated annual salary in USD plus a local
+currency equivalent where available.
 ### Programmatic Usage
 ```python
 from src.schema import SalaryInput
 from src.infer import predict_salary
 input_data = SalaryInput(
     country="United States of America",
     years_code=5.0,
     dev_type="Developer, full-stack",
     industry="Software Development",
     age="25-34 years old",
+    ic_or_pm="Individual contributor",
+    org_size="20 to 99 employees",
 )
 salary = predict_salary(input_data)
 print(f"Estimated salary: ${salary:,.0f}")
 ```
 uv run python example_inference.py
 ```
+## Input Validation and Guardrails
+Validation is enforced at two layers:
+### Layer 1 — Pydantic schema (`src/schema.py`)
+Checked at object construction time:
+- All 9 fields are required
+- `years_code` must be `>= 0`
+- `work_exp` must be `>= 0`
+### Layer 2 — Runtime category guardrails (`src/infer.py`)
+Checked at inference time against `config/valid_categories.yaml`, which is
+generated during training to reflect only categories that appeared frequently
+enough in the training data (controlled by `features.cardinality.min_frequency`
+in `config/model_parameters.yaml`):
+- **Valid Countries** (~21) — low-frequency countries collapsed to `Other`, which is then dropped
+- **Valid Education Levels** (~9)
+- **Valid Developer Types** (~20) — `Other` dropped
+- **Valid Industries** (~15) — `Other` dropped
+- **Valid Age Ranges** (~7) — `Other` dropped
+- **Valid IC/PM Values** (~3) — `Other` dropped
+- **Valid Organization Sizes** (~8) — `Other` dropped
+Passing an invalid value raises a `ValueError` with a message pointing to
+`config/valid_categories.yaml`.
+**Example:**
 ```python
 from src.infer import predict_salary
 from src.schema import SalaryInput
+# Raises ValueError: "Invalid country: 'Japan'. Check config/valid_categories.yaml"
+predict_salary(SalaryInput(country="Japan", ...))
 ```
 **View valid categories:**
 ```bash
 cat config/valid_categories.yaml
 ```
+### Model guardrails (`config/model_parameters.yaml`)
+The `guardrails` section defines thresholds used during training evaluation:
+```yaml
+guardrails:
+  max_mape_per_category: 100   # max acceptable MAPE per category (%)
+  max_abs_pct_diff: 100        # max acceptable absolute % difference
+```
+## Testing
+Tests live in `tests/` and cover all major modules:
+| File | What it tests |
+| ---- | ------------- |
+| `test_schema.py` | Pydantic validation — required fields, `ge=0` constraints |
+| `test_infer.py` | Inference pipeline — valid predictions, `ValueError` on invalid categories, currency lookup |
+| `test_train.py` | Training helpers — salary filtering, cardinality reduction, valid category extraction, currency rate computation |
+| `test_preprocessing.py` | Feature engineering — one-hot encoding, numeric transforms |
+| `test_tune.py` | Optuna helpers — parameter sampling, objective function construction, best-param saving |
+| `test_feature_impact.py` | Model sanity — changing each input feature (country, education, dev type, etc.) produces a distinct prediction |
+Run all tests:
 ```bash
+make test
+```
+Run with coverage:
+```bash
+make coverage
 ```
+## Configuration
+All runtime parameters are centralised in two YAML files:
+### `config/model_parameters.yaml`
+Controls data processing, feature engineering, model hyperparameters, training
+settings, and guardrail thresholds. You can customise:
+- **Data Processing**: Salary thresholds, percentile bounds, train/test split ratio
+- **Feature Engineering**: Cardinality reduction settings (max categories, min frequency)
+- **Model Hyperparameters**: Learning rate, tree depth, early stopping, etc.
+- **Training Settings**: Verbosity, model save path
+- **Guardrails**: MAPE thresholds for model evaluation
 **Example parameter changes:**
 ```yaml
 # Increase model complexity
 model:
+  max_depth: 8                 # Default: 3
   n_estimators: 10000          # Default: 5000
 # Keep more categories
 features:
   cardinality:
+    max_categories: 30         # Default: 30
+    min_frequency: 50          # Default: 50
 ```
+### `config/optuna_config.yaml`
+Controls the Optuna hyperparameter search — search space (type, bounds, log
+scale), number of trials, CV folds, and fixed parameters that are not tuned
+(e.g. `n_estimators`, `random_state`).
 ## Project Structure
+```text
 .
+├── .github/
+│   └── workflows/
+│       └── ci.yml                   # GitHub Actions CI (lint + test)
 ├── config/
+│   ├── model_parameters.yaml        # Model configuration and guardrails
+│   ├── optuna_config.yaml           # Optuna hyperparameter search space
+│   ├── valid_categories.yaml        # Valid input categories (generated by training)
+│   └── currency_rates.yaml          # Per-country currency rates (generated by training)
 ├── data/
 │   └── survey_results_public.csv    # Stack Overflow survey data (download required)
 ├── models/
+│   └── model.pkl                    # Trained model artifact (generated by training)
 ├── src/
+│   ├── __init__.py
+│   ├── schema.py                    # Pydantic input model
+│   ├── preprocessing.py             # Feature engineering (one-hot encoding, scaling)
+│   ├── train.py                     # Training pipeline
+│   ├── tune.py                      # Optuna hyperparameter optimisation
+│   └── infer.py                     # Inference with runtime guardrails
+├── tests/
+│   ├── conftest.py                  # Shared pytest fixtures
+│   ├── test_schema.py
+│   ├── test_infer.py
+│   ├── test_train.py
+│   ├── test_preprocessing.py
+│   ├── test_tune.py
+│   └── test_feature_impact.py
 ├── app.py                           # Streamlit web app
+├── example_inference.py             # Inference usage examples
+├── Makefile                         # Developer workflow commands
+├── .pre-commit-config.yaml          # Pre-commit hooks
 ├── pyproject.toml                   # Project dependencies
+└── README.md                        # This file (also Hugging Face Space config)
 ```
 ## Tech Stack
 - **Python 3.12+**
+- **uv** — Package manager
+- **pandas** — Data manipulation
+- **xgboost** — Gradient boosting model
+- **scikit-learn** — Cross-validation and train/test split
+- **optuna** — Hyperparameter optimisation
+- **pydantic** — Input schema validation
+- **streamlit** — Web UI
+- **ruff** — Linting and formatting
+- **radon** — Complexity and maintainability metrics
+- **bandit** — Static security analysis
+- **pip-audit** — Dependency vulnerability scanning
 ## Development
 For detailed development information, see [Claude.md](Claude.md).
+### Code Quality
+#### Pre-commit hooks
+The project uses [pre-commit](https://pre-commit.com) to enforce code quality checks before each commit. Hooks are defined in [.pre-commit-config.yaml](.pre-commit-config.yaml) and run:
+- **ruff format** — auto-formats Python files (`make format`)
+- **ruff lint** — checks for linting errors (`make lint`)
+- **Standard checks** — trailing whitespace, end-of-file newline, LF line endings, valid YAML/TOML/JSON, large files, merge conflict markers, stray debug statements
+**Install hooks** (once, after cloning):
+```bash
+uv run pre-commit install
+```
+Hooks will then run automatically on every `git commit`. To run them manually against all files:
+```bash
+uv run pre-commit run --all-files
+```
+#### GitHub Actions CI
+A CI workflow ([.github/workflows/ci.yml](.github/workflows/ci.yml)) runs automatically on every push to any branch. It:
+1. Sets up Python 3.12 and installs `uv`
+2. Installs all dependencies (`uv sync --all-extras`)
+3. Runs `make lint` — ruff linting
+4. Runs `make test` — full pytest suite
+The workflow must pass before merging changes.
 ### Re-training the Model
 If you want to use a different survey year or update the model:
 ```bash
+# 1. Place new CSV in data/
+# 2. (Optional) tune first
+make tune
+# 3. Retrain
 uv run python -m src.train
 ```
 ### Running Tests
 ```bash
+# Run all tests
+make test
+# Run with coverage report
+make coverage
+# Run a specific test file
+uv run pytest tests/test_infer.py -v
 ```
+## Versioning
+This project follows [Semantic Versioning](https://semver.org/) (`MAJOR.MINOR.PATCH`):
+| Version bump | When to use | Examples |
+| --- | --- | --- |
+| **MAJOR** | Breaking changes to the public interface | New required input field, incompatible model artifact format, renamed API |
+| **MINOR** | Backward-compatible new features | New optional input field, new supported country, new Makefile target, UI addition |
+| **PATCH** | Backward-compatible fixes and improvements | Bug fixes, model retrain with same schema, config tuning, dependency updates |
+**Pre-release suffixes** (for work in progress):
+```text
+v1.0.0-alpha.1   # early development, unstable
+v1.0.0-beta.1    # feature-complete, under testing
+v1.0.0-rc.1      # release candidate, final validation
+```
+Tags are applied on `main` after a successful CI run:
 ```bash
+git tag v2.0.0
+git push origin v2.0.0
+```
+## Branching Strategy
+The project uses a **GitFlow-inspired** branching model:
+```text
+main ◄──────────────────────────────────── hotfix/v2.0.1
+  ▲                                              │
+  │ merge + tag                                  │
+  │                                         (urgent fix)
+develop ◄──── feature/add-currency-display
+    ◄──── feature/new-dev-types
+    ◄──── fix/invalid-category-message
+    │
+    └──► release/v2.1.0 ──► (final testing) ──► main + tag v2.1.0
 ```
+### Branches
+| Branch | Purpose | Merges into |
+| ------ | ------- | ----------- |
+| `main` | Production-ready code, always deployable. Tagged on every release. | — |
+| `develop` | Integration branch for completed features. Base for new work. | `main` via release branch |
+| `feature/<name>` | New features or improvements (e.g. `feature/add-local-currency`) | `develop` |
+| `fix/<name>` | Non-urgent bug fixes (e.g. `fix/guardrail-error-message`) | `develop` |
+| `release/v<semver>` | Release preparation — version bump, changelog, final QA | `main` and back to `develop` |
+| `hotfix/v<semver>` | Urgent production fixes (e.g. `hotfix/v2.0.1`) | `main` and back to `develop` |
+### Rules
+- **`main`** is protected — no direct pushes; merge only via PR after CI passes
+- **`develop`** is the default branch for day-to-day work
+- Branch names use lowercase kebab-case: `feature/optuna-cv-splits`
+- Every merge to `main` is tagged with a semver version
+- Hotfixes branch off `main` directly and merge back to both `main` and `develop`
+### Typical workflow
 ```bash
+# Start a new feature
+git checkout develop
+git pull origin develop
+git checkout -b feature/add-local-currency
+# ... work, commit, push ...
+git push -u origin feature/add-local-currency
+# Open a PR into develop, CI must pass before merging
+# Prepare a release
+git checkout -b release/v2.1.0 develop
+# bump version in pyproject.toml, update changelog
+git push -u origin release/v2.1.0
+# Open PR into main, merge, tag
+git tag v2.1.0
+git push origin v2.1.0
 ```
+## Deployment
+### Hugging Face Spaces
+The app is deployed on [Hugging Face Spaces](https://huggingface.co/spaces) using the Docker SDK. The Space configuration is embedded in the frontmatter at the top of this README, which Hugging Face reads automatically:
+- **SDK**: Docker (runs the `Dockerfile` in the repo root)
+- **Port**: 8501 (Streamlit default)
+- **License**: Apache 2.0
+To deploy your own copy:
+1. Create a new Space on [Hugging Face](https://huggingface.co/new-space) and select "Docker" as the SDK
+2. Push this repository to your Space:
    ```bash
+   git remote add space https://huggingface.co/spaces/<your-username>/<your-space-name>
+   git push space main
    ```
+**Note:** The pre-trained model (`models/model.pkl`) and configuration (`config/valid_categories.yaml`, `config/currency_rates.yaml`) must be present before building the Docker image. Train locally first if needed.
+### Local Docker
+**Build and run:**
+```bash
+docker build -t developer-salary-predictor .
+docker run -p 8501:8501 developer-salary-predictor
+```
+Then visit `http://localhost:8501`
+### Local (without Docker)
 **Using uv (recommended for development):**
 ```bash
 uv run streamlit run app.py
 ```
 **Using pip:**
 ```bash
 pip install -r requirements.txt
 streamlit run app.py
 ## Troubleshooting
 ### "Model file not found"
 - Run `uv run python -m src.train` first to generate the model
+### "Valid categories file not found"
+- Run `uv run python -m src.train` — training generates both `models/model.pkl`
+  and `config/valid_categories.yaml`
 ### "Data file not found"
 - Download the Stack Overflow survey CSV and place it in `data/`
 ### "Configuration file not found"
 - The `config/model_parameters.yaml` file should exist in the project root
 - Check that you're running commands from the project root directory
 ### Dependencies issues
 - Run `uv sync` to ensure all packages are installed
 ## Design Principles
+- **Simplicity**: Minimal codebase, easy to read and modify
+- **Separation of concerns**: Schema validation, preprocessing, training, and inference are distinct modules
+- **Config-driven**: All tunable parameters in YAML — no magic numbers in code
+- **Local-first**: No cloud dependencies for training or inference
+- **Testable**: Every public function has unit tests; model sanity covered by feature-impact tests
 ## License

app.py CHANGED Viewed

@@ -33,6 +33,7 @@ with st.sidebar:
         - Industry
         - Age
         - Individual contributor or people manager
         """
     )
     st.info("💡 Tip: Results are estimates based on survey averages.")
@@ -45,6 +46,7 @@ with st.sidebar:
     st.write(f"**Industries:** {len(valid_categories['Industry'])} available")
     st.write(f"**Age Ranges:** {len(valid_categories['Age'])} available")
     st.write(f"**IC/PM Roles:** {len(valid_categories['ICorPM'])} available")
     st.caption("Only values from the training data are shown in the dropdowns.")
 # Main input form
@@ -59,6 +61,7 @@ valid_dev_types = valid_categories["DevType"]
 valid_industries = valid_categories["Industry"]
 valid_ages = valid_categories["Age"]
 valid_icorpm = valid_categories["ICorPM"]
 # Set default values (if available)
 default_country = (
@@ -87,6 +90,11 @@ default_icorpm = (
     if "Individual contributor" in valid_icorpm
     else valid_icorpm[0]
 )
 with col1:
     country = st.selectbox(
@@ -150,6 +158,13 @@ ic_or_pm = st.selectbox(
     help="Are you an individual contributor or people manager?",
 )
 # Prediction button
 if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
     try:
@@ -163,6 +178,7 @@ if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
             industry=industry,
             age=age,
             ic_or_pm=ic_or_pm,
         )
         # Make prediction

         - Industry
         - Age
         - Individual contributor or people manager
+        - Organization size
         """
     )
     st.info("💡 Tip: Results are estimates based on survey averages.")
     st.write(f"**Industries:** {len(valid_categories['Industry'])} available")
     st.write(f"**Age Ranges:** {len(valid_categories['Age'])} available")
     st.write(f"**IC/PM Roles:** {len(valid_categories['ICorPM'])} available")
+    st.write(f"**Org Sizes:** {len(valid_categories['OrgSize'])} available")
     st.caption("Only values from the training data are shown in the dropdowns.")
 # Main input form
 valid_industries = valid_categories["Industry"]
 valid_ages = valid_categories["Age"]
 valid_icorpm = valid_categories["ICorPM"]
+valid_org_sizes = valid_categories["OrgSize"]
 # Set default values (if available)
 default_country = (
     if "Individual contributor" in valid_icorpm
     else valid_icorpm[0]
 )
+default_org_size = (
+    "20 to 99 employees"
+    if "20 to 99 employees" in valid_org_sizes
+    else valid_org_sizes[0]
+)
 with col1:
     country = st.selectbox(
     help="Are you an individual contributor or people manager?",
 )
+org_size = st.selectbox(
+    "Organization Size",
+    options=valid_org_sizes,
+    index=valid_org_sizes.index(default_org_size),
+    help="Approximate number of employees at the developer's company",
+)
 # Prediction button
 if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
     try:
             industry=industry,
             age=age,
             ic_or_pm=ic_or_pm,
+            org_size=org_size,
         )
         # Make prediction

config/model_parameters.yaml CHANGED Viewed

@@ -16,6 +16,7 @@ features:
     - Industry
     - Age
     - ICorPM
   encoding:
     drop_first: true
 model:

     - Industry
     - Age
     - ICorPM
+    - OrgSize
   encoding:
     drop_first: true
 model:

config/valid_categories.yaml CHANGED Viewed

@@ -93,3 +93,13 @@ Age:
 ICorPM:
 - Individual contributor
 - People manager

 ICorPM:
 - Individual contributor
 - People manager
+OrgSize:
+- 1,000 to 4,999 employees
+- 10,000 or more employees
+- 100 to 499 employees
+- 20 to 99 employees
+- 5,000 to 9,999 employees
+- 500 to 999 employees
+- I don't know
+- Just me - I am a freelancer, sole proprietor, etc.
+- Less than 20 employees

example_inference.py CHANGED Viewed

@@ -24,6 +24,7 @@ def main():
         industry="Software Development",
         age="25-34 years old",
         ic_or_pm="Individual contributor",
     )
     print(f"Country: {input_data_1.country}")
@@ -34,6 +35,7 @@ def main():
     print(f"Industry: {input_data_1.industry}")
     print(f"Age: {input_data_1.age}")
     print(f"IC or PM: {input_data_1.ic_or_pm}")
     salary_1 = predict_salary(input_data_1)
     print(f"💰 Predicted Salary: ${salary_1:,.2f} USD/year")
@@ -51,6 +53,7 @@ def main():
         industry="Fintech",
         age="18-24 years old",
         ic_or_pm="Individual contributor",
     )
     print(f"Country: {input_data_2.country}")
@@ -61,6 +64,7 @@ def main():
     print(f"Industry: {input_data_2.industry}")
     print(f"Age: {input_data_2.age}")
     print(f"IC or PM: {input_data_2.ic_or_pm}")
     salary_2 = predict_salary(input_data_2)
     print(f"💰 Predicted Salary: ${salary_2:,.2f} USD/year")
@@ -78,6 +82,7 @@ def main():
         industry="Banking/Financial Services",
         age="35-44 years old",
         ic_or_pm="People manager",
     )
     print(f"Country: {input_data_3.country}")
@@ -88,6 +93,7 @@ def main():
     print(f"Industry: {input_data_3.industry}")
     print(f"Age: {input_data_3.age}")
     print(f"IC or PM: {input_data_3.ic_or_pm}")
     salary_3 = predict_salary(input_data_3)
     print(f"💰 Predicted Salary: ${salary_3:,.2f} USD/year")
@@ -105,6 +111,7 @@ def main():
         industry="Manufacturing",
         age="25-34 years old",
         ic_or_pm="Individual contributor",
     )
     print(f"Country: {input_data_4.country}")
@@ -115,6 +122,7 @@ def main():
     print(f"Industry: {input_data_4.industry}")
     print(f"Age: {input_data_4.age}")
     print(f"IC or PM: {input_data_4.ic_or_pm}")
     salary_4 = predict_salary(input_data_4)
     print(f"💰 Predicted Salary: ${salary_4:,.2f} USD/year")

         industry="Software Development",
         age="25-34 years old",
         ic_or_pm="Individual contributor",
+        org_size="20 to 99 employees",
     )
     print(f"Country: {input_data_1.country}")
     print(f"Industry: {input_data_1.industry}")
     print(f"Age: {input_data_1.age}")
     print(f"IC or PM: {input_data_1.ic_or_pm}")
+    print(f"Organization Size: {input_data_1.org_size}")
     salary_1 = predict_salary(input_data_1)
     print(f"💰 Predicted Salary: ${salary_1:,.2f} USD/year")
         industry="Fintech",
         age="18-24 years old",
         ic_or_pm="Individual contributor",
+        org_size="20 to 99 employees",
     )
     print(f"Country: {input_data_2.country}")
     print(f"Industry: {input_data_2.industry}")
     print(f"Age: {input_data_2.age}")
     print(f"IC or PM: {input_data_2.ic_or_pm}")
+    print(f"Organization Size: {input_data_2.org_size}")
     salary_2 = predict_salary(input_data_2)
     print(f"💰 Predicted Salary: ${salary_2:,.2f} USD/year")
         industry="Banking/Financial Services",
         age="35-44 years old",
         ic_or_pm="People manager",
+        org_size="1,000 to 4,999 employees",
     )
     print(f"Country: {input_data_3.country}")
     print(f"Industry: {input_data_3.industry}")
     print(f"Age: {input_data_3.age}")
     print(f"IC or PM: {input_data_3.ic_or_pm}")
+    print(f"Organization Size: {input_data_3.org_size}")
     salary_3 = predict_salary(input_data_3)
     print(f"💰 Predicted Salary: ${salary_3:,.2f} USD/year")
         industry="Manufacturing",
         age="25-34 years old",
         ic_or_pm="Individual contributor",
+        org_size="100 to 499 employees",
     )
     print(f"Country: {input_data_4.country}")
     print(f"Industry: {input_data_4.industry}")
     print(f"Age: {input_data_4.age}")
     print(f"IC or PM: {input_data_4.ic_or_pm}")
+    print(f"Organization Size: {input_data_4.org_size}")
     salary_4 = predict_salary(input_data_4)
     print(f"💰 Predicted Salary: ${salary_4:,.2f} USD/year")

guardrail_evaluation.py CHANGED Viewed

@@ -172,12 +172,8 @@ def compute_category_metrics(
 def format_table(metrics_df: pd.DataFrame) -> str:
     """Format metrics DataFrame as a markdown table."""
     lines = []
-    header = (
-        "| Category | Count | MAPE (%) | Mean Actual ($) | Mean Predicted ($) | Abs % Diff |"
-    )
-    sep = (
-        "|----------|------:|---------:|----------------:|-------------------:|-----------:|"
-    )
     lines.append(header)
     lines.append(sep)

 def format_table(metrics_df: pd.DataFrame) -> str:
     """Format metrics DataFrame as a markdown table."""
     lines = []
+    header = "| Category | Count | MAPE (%) | Mean Actual ($) | Mean Predicted ($) | Abs % Diff |"
+    sep = "|----------|------:|---------:|----------------:|-------------------:|-----------:|"
     lines.append(header)
     lines.append(sep)

models/model.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7a22e7a728aeb84f766e9acbef698afe0b4733a3385eed44d1663dc771d68be2
-size 1851836

 version https://git-lfs.github.com/spec/v1
+oid sha256:295c1996202ba1a93f502705986ac96ff3e802d4009479a22d82d52b0b5e7f42
+size 1830437

pyproject.toml CHANGED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "developer-salary-prediction"
-version = "0.1.0"
 description = "Simple ML app for predicting developer salaries using Stack Overflow survey data"
 readme = "README.md"
 requires-python = ">=3.12"
@@ -17,6 +17,7 @@ dependencies = [
     "radon>=6.0.1",
     "pip-audit>=2.10.0",
     "bandit>=1.9.3",
 ]
 [project.optional-dependencies]

 [project]
 name = "developer-salary-prediction"
+version = "2.0.0"
 description = "Simple ML app for predicting developer salaries using Stack Overflow survey data"
 readme = "README.md"
 requires-python = ">=3.12"
     "radon>=6.0.1",
     "pip-audit>=2.10.0",
     "bandit>=1.9.3",
+    "pre-commit>=4.5.1",
 ]
 [project.optional-dependencies]

src/infer.py CHANGED Viewed

@@ -113,6 +113,13 @@ def predict_salary(data: SalaryInput) -> float:
             f"Check config/valid_categories.yaml for all valid values."
         )
     # Create a DataFrame with the input data
     input_df = pd.DataFrame(
         {
@@ -124,6 +131,7 @@ def predict_salary(data: SalaryInput) -> float:
             "Industry": [data.industry],
             "Age": [data.age],
             "ICorPM": [data.ic_or_pm],
         }
     )

             f"Check config/valid_categories.yaml for all valid values."
         )
+    if data.org_size not in valid_categories["OrgSize"]:
+        raise ValueError(
+            f"Invalid organization size: '{data.org_size}'. "
+            f"Must be one of {len(valid_categories['OrgSize'])} valid sizes. "
+            f"Check config/valid_categories.yaml for all valid values."
+        )
     # Create a DataFrame with the input data
     input_df = pd.DataFrame(
         {
             "Industry": [data.industry],
             "Age": [data.age],
             "ICorPM": [data.ic_or_pm],
+            "OrgSize": [data.org_size],
         }
     )

src/preprocessing.py CHANGED Viewed

@@ -78,7 +78,8 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
     during training and inference, preventing data leakage and inconsistencies.
     Args:
-        df: DataFrame with columns: Country, YearsCode, WorkExp, EdLevel, DevType, Industry, Age, ICorPM
             NOTE: During training, cardinality reduction should be applied to df
             BEFORE calling this function. During inference, valid_categories.yaml
             ensures only valid (already-reduced) categories are used.
@@ -98,14 +99,23 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
     # Normalize Unicode apostrophes to regular apostrophes for consistency
     # This handles cases where data has \u2019 (') instead of '
-    for col in ["Country", "EdLevel", "DevType", "Industry", "Age", "ICorPM"]:
         if col in df_processed.columns:
             df_processed[col] = df_processed[col].str.replace(
                 "\u2019", "'", regex=False
             )
     # Normalize "Other" category variants (e.g. "Other (please specify):" -> "Other")
-    for col in ["Country", "EdLevel", "DevType", "Industry", "Age", "ICorPM"]:
         if col in df_processed.columns:
             df_processed[col] = normalize_other_categories(df_processed[col])
@@ -125,6 +135,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
     df_processed["Industry"] = df_processed["Industry"].fillna("Unknown")
     df_processed["Age"] = df_processed["Age"].fillna("Unknown")
     df_processed["ICorPM"] = df_processed["ICorPM"].fillna("Unknown")
     # NOTE: Cardinality reduction is NOT applied here
     # It should be applied during training BEFORE calling this function
@@ -140,6 +151,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
         "Industry",
         "Age",
         "ICorPM",
     ]
     df_features = df_processed[feature_cols]

     during training and inference, preventing data leakage and inconsistencies.
     Args:
+        df: DataFrame with columns: Country, YearsCode, WorkExp, EdLevel,
+            DevType, Industry, Age, ICorPM, OrgSize.
             NOTE: During training, cardinality reduction should be applied to df
             BEFORE calling this function. During inference, valid_categories.yaml
             ensures only valid (already-reduced) categories are used.
     # Normalize Unicode apostrophes to regular apostrophes for consistency
     # This handles cases where data has \u2019 (') instead of '
+    _categorical_cols = [
+        "Country",
+        "EdLevel",
+        "DevType",
+        "Industry",
+        "Age",
+        "ICorPM",
+        "OrgSize",
+    ]
+    for col in _categorical_cols:
         if col in df_processed.columns:
             df_processed[col] = df_processed[col].str.replace(
                 "\u2019", "'", regex=False
             )
     # Normalize "Other" category variants (e.g. "Other (please specify):" -> "Other")
+    for col in _categorical_cols:
         if col in df_processed.columns:
             df_processed[col] = normalize_other_categories(df_processed[col])
     df_processed["Industry"] = df_processed["Industry"].fillna("Unknown")
     df_processed["Age"] = df_processed["Age"].fillna("Unknown")
     df_processed["ICorPM"] = df_processed["ICorPM"].fillna("Unknown")
+    df_processed["OrgSize"] = df_processed["OrgSize"].fillna("Unknown")
     # NOTE: Cardinality reduction is NOT applied here
     # It should be applied during training BEFORE calling this function
         "Industry",
         "Age",
         "ICorPM",
+        "OrgSize",
     ]
     df_features = df_processed[feature_cols]

src/schema.py CHANGED Viewed

@@ -18,6 +18,7 @@ class SalaryInput(BaseModel):
                     "industry": "Software Development",
                     "age": "25-34 years old",
                     "ic_or_pm": "Individual contributor",
                 }
             ]
         }
@@ -39,3 +40,6 @@ class SalaryInput(BaseModel):
     industry: str = Field(..., description="Industry the developer works in")
     age: str = Field(..., description="Developer's age range")
     ic_or_pm: str = Field(..., description="Individual contributor or people manager")

                     "industry": "Software Development",
                     "age": "25-34 years old",
                     "ic_or_pm": "Individual contributor",
+                    "org_size": "20 to 99 employees",
                 }
             ]
         }
     industry: str = Field(..., description="Industry the developer works in")
     age: str = Field(..., description="Developer's age range")
     ic_or_pm: str = Field(..., description="Individual contributor or people manager")
+    org_size: str = Field(
+        ..., description="Size of the organisation the developer works for"
+    )

src/train.py CHANGED Viewed

@@ -11,7 +11,15 @@ from sklearn.model_selection import KFold, train_test_split
 from src.preprocessing import prepare_features, reduce_cardinality
-CATEGORICAL_FEATURES = ["Country", "EdLevel", "DevType", "Industry", "Age", "ICorPM"]
 def filter_salaries(df: pd.DataFrame, config: dict) -> pd.DataFrame:
@@ -160,6 +168,7 @@ def main():
             "Industry",
             "Age",
             "ICorPM",
             "Currency",
             "CompTotal",
             "ConvertedCompYearly",

 from src.preprocessing import prepare_features, reduce_cardinality
+CATEGORICAL_FEATURES = [
+    "Country",
+    "EdLevel",
+    "DevType",
+    "Industry",
+    "Age",
+    "ICorPM",
+    "OrgSize",
+]
 def filter_salaries(df: pd.DataFrame, config: dict) -> pd.DataFrame:
             "Industry",
             "Age",
             "ICorPM",
+            "OrgSize",
             "Currency",
             "CompTotal",
             "ConvertedCompYearly",

src/tune.py CHANGED Viewed

@@ -36,15 +36,11 @@ def sample_params(trial: optuna.Trial, search_space: dict) -> dict:
             params[name] = trial.suggest_int(name, spec["low"], spec["high"])
         elif param_type == "float":
             log = spec.get("log", False)
-            params[name] = trial.suggest_float(
-                name, spec["low"], spec["high"], log=log
-            )
     return params
-def build_objective(
-    X: pd.DataFrame, y: pd.Series, optuna_config: dict
-) -> callable:
     """Build an Optuna objective function for XGBoost CV evaluation.
     Args:

             params[name] = trial.suggest_int(name, spec["low"], spec["high"])
         elif param_type == "float":
             log = spec.get("log", False)
+            params[name] = trial.suggest_float(name, spec["low"], spec["high"], log=log)
     return params
+def build_objective(X: pd.DataFrame, y: pd.Series, optuna_config: dict) -> callable:
     """Build an Optuna objective function for XGBoost CV evaluation.
     Args:

tests/conftest.py CHANGED Viewed

@@ -18,6 +18,7 @@ def sample_salary_input():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }

         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }

tests/test_feature_impact.py CHANGED Viewed

@@ -14,6 +14,7 @@ def test_years_experience_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     years_tests = [0, 2, 5, 10, 20]
@@ -37,6 +38,7 @@ def test_country_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     test_countries = [
@@ -71,12 +73,14 @@ def test_education_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     test_education = [
         e
         for e in [
-            "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",
             "Some college/university study without earning a degree",
             "Associate degree (A.A., A.S., etc.)",
             "Bachelor's degree (B.A., B.S., B.Eng., etc.)",
@@ -106,6 +110,7 @@ def test_devtype_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     test_devtypes = [
@@ -141,6 +146,7 @@ def test_industry_impact():
         "dev_type": "Developer, full-stack",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     test_industries = [
@@ -176,6 +182,7 @@ def test_age_impact():
         "dev_type": "Developer, full-stack",
         "industry": "Software Development",
         "ic_or_pm": "Individual contributor",
     }
     test_ages = [
@@ -210,6 +217,7 @@ def test_work_exp_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     work_exp_tests = [0, 1, 3, 5, 10, 20]
@@ -219,7 +227,8 @@ def test_work_exp_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) >= len(predictions) - 1, (
-        f"Expected at least {len(predictions) - 1} unique predictions, got {len(set(predictions))}"
     )
@@ -233,6 +242,7 @@ def test_icorpm_impact():
         "dev_type": "Developer, full-stack",
         "industry": "Software Development",
         "age": "25-34 years old",
     }
     test_icorpm = [
@@ -251,6 +261,31 @@ def test_icorpm_impact():
     )
 def test_combined_features():
     """Test that combining different features produces expected variations."""
     test_cases = [
@@ -263,6 +298,7 @@ def test_combined_features():
             "Software Development",
             "18-24 years old",
             "Individual contributor",
         ),
         (
             "Germany",
@@ -273,6 +309,7 @@ def test_combined_features():
             "Manufacturing",
             "25-34 years old",
             "Individual contributor",
         ),
         (
             "United States of America",
@@ -283,6 +320,7 @@ def test_combined_features():
             "Fintech",
             "35-44 years old",
             "People manager",
         ),
         (
             "Poland",
@@ -293,6 +331,7 @@ def test_combined_features():
             "Healthcare",
             "45-54 years old",
             "Individual contributor",
         ),
         (
             "Brazil",
@@ -303,6 +342,7 @@ def test_combined_features():
             "Government",
             "25-34 years old",
             "Individual contributor",
         ),
     ]
@@ -316,6 +356,7 @@ def test_combined_features():
         industry,
         age,
         icorpm,
     ) in test_cases:
         if (
             country not in valid_categories["Country"]
@@ -324,6 +365,7 @@ def test_combined_features():
             or industry not in valid_categories["Industry"]
             or age not in valid_categories["Age"]
             or icorpm not in valid_categories["ICorPM"]
         ):
             continue
@@ -336,6 +378,7 @@ def test_combined_features():
             industry=industry,
             age=age,
             ic_or_pm=icorpm,
         )
         predictions.append(predict_salary(input_data))

         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     years_tests = [0, 2, 5, 10, 20]
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     test_countries = [
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     test_education = [
         e
         for e in [
+            "Secondary school (e.g. American high school, "
+            "German Realschule or Gymnasium, etc.)",
             "Some college/university study without earning a degree",
             "Associate degree (A.A., A.S., etc.)",
             "Bachelor's degree (B.A., B.S., B.Eng., etc.)",
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     test_devtypes = [
         "dev_type": "Developer, full-stack",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     test_industries = [
         "dev_type": "Developer, full-stack",
         "industry": "Software Development",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     test_ages = [
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "org_size": "20 to 99 employees",
     }
     work_exp_tests = [0, 1, 3, 5, 10, 20]
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) >= len(predictions) - 1, (
+        f"Expected at least {len(predictions) - 1} unique predictions, "
+        f"got {len(set(predictions))}"
     )
         "dev_type": "Developer, full-stack",
         "industry": "Software Development",
         "age": "25-34 years old",
+        "org_size": "20 to 99 employees",
     }
     test_icorpm = [
     )
+def test_org_size_impact():
+    """Test that changing organization size changes prediction."""
+    base_input = {
+        "country": "United States of America",
+        "years_code": 5.0,
+        "work_exp": 3.0,
+        "education_level": "Bachelor's degree (B.A., B.S., B.Eng., etc.)",
+        "dev_type": "Developer, full-stack",
+        "industry": "Software Development",
+        "age": "25-34 years old",
+        "ic_or_pm": "Individual contributor",
+    }
+    test_org_sizes = valid_categories["OrgSize"][:5]
+    predictions = []
+    for org_size in test_org_sizes:
+        input_data = SalaryInput(**base_input, org_size=org_size)
+        predictions.append(predict_salary(input_data))
+    assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
+    )
 def test_combined_features():
     """Test that combining different features produces expected variations."""
     test_cases = [
             "Software Development",
             "18-24 years old",
             "Individual contributor",
+            "20 to 99 employees",
         ),
         (
             "Germany",
             "Manufacturing",
             "25-34 years old",
             "Individual contributor",
+            "100 to 499 employees",
         ),
         (
             "United States of America",
             "Fintech",
             "35-44 years old",
             "People manager",
+            "1,000 to 4,999 employees",
         ),
         (
             "Poland",
             "Healthcare",
             "45-54 years old",
             "Individual contributor",
+            "20 to 99 employees",
         ),
         (
             "Brazil",
             "Government",
             "25-34 years old",
             "Individual contributor",
+            "20 to 99 employees",
         ),
     ]
         industry,
         age,
         icorpm,
+        org_size,
     ) in test_cases:
         if (
             country not in valid_categories["Country"]
             or industry not in valid_categories["Industry"]
             or age not in valid_categories["Age"]
             or icorpm not in valid_categories["ICorPM"]
+            or org_size not in valid_categories["OrgSize"]
         ):
             continue
             industry=industry,
             age=age,
             ic_or_pm=icorpm,
+            org_size=org_size,
         )
         predictions.append(predict_salary(input_data))

tests/test_infer.py CHANGED Viewed

@@ -55,6 +55,13 @@ def test_invalid_ic_or_pm(sample_salary_input):
         predict_salary(SalaryInput(**sample_salary_input))
 def test_get_local_currency_unknown_country():
     """get_local_currency returns None for unknown country."""
     result = get_local_currency("Narnia", 100000)

         predict_salary(SalaryInput(**sample_salary_input))
+def test_invalid_org_size(sample_salary_input):
+    """Invalid organization size raises ValueError."""
+    sample_salary_input["org_size"] = "Megacorp 10M+"
+    with pytest.raises(ValueError, match="Invalid organization size"):
+        predict_salary(SalaryInput(**sample_salary_input))
 def test_get_local_currency_unknown_country():
     """get_local_currency returns None for unknown country."""
     result = get_local_currency("Narnia", 100000)

tests/test_preprocessing.py CHANGED Viewed

@@ -86,6 +86,7 @@ class TestPrepareFeatures:
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
             }
         )
         result = prepare_features(df)
@@ -104,6 +105,7 @@ class TestPrepareFeatures:
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
             }
         )
         result = prepare_features(df)
@@ -122,6 +124,7 @@ class TestPrepareFeatures:
                 "Industry": ["Software Development", "Healthcare"],
                 "Age": ["25-34 years old", "35-44 years old"],
                 "ICorPM": ["Individual contributor", "People manager"],
             }
         )
         result = prepare_features(df)
@@ -143,6 +146,7 @@ class TestPrepareFeatures:
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
             }
         )
         result = prepare_features(df)
@@ -161,6 +165,7 @@ class TestPrepareFeatures:
                 "Industry": [None],
                 "Age": [None],
                 "ICorPM": [None],
             }
         )
         result = prepare_features(df)
@@ -181,6 +186,7 @@ class TestPrepareFeatures:
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
             }
         )
         original_country = df["Country"].iloc[0]

                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
+                "OrgSize": ["20 to 99 employees"],
             }
         )
         result = prepare_features(df)
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
+                "OrgSize": ["20 to 99 employees"],
             }
         )
         result = prepare_features(df)
                 "Industry": ["Software Development", "Healthcare"],
                 "Age": ["25-34 years old", "35-44 years old"],
                 "ICorPM": ["Individual contributor", "People manager"],
+                "OrgSize": ["20 to 99 employees", "100 to 499 employees"],
             }
         )
         result = prepare_features(df)
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
+                "OrgSize": ["20 to 99 employees"],
             }
         )
         result = prepare_features(df)
                 "Industry": [None],
                 "Age": [None],
                 "ICorPM": [None],
+                "OrgSize": [None],
             }
         )
         result = prepare_features(df)
                 "Industry": ["Software Development"],
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
+                "OrgSize": ["20 to 99 employees"],
             }
         )
         original_country = df["Country"].iloc[0]

tests/test_schema.py CHANGED Viewed

@@ -17,6 +17,7 @@ def test_valid_input(sample_salary_input):
     assert result.industry == sample_salary_input["industry"]
     assert result.age == sample_salary_input["age"]
     assert result.ic_or_pm == sample_salary_input["ic_or_pm"]
 def test_negative_years_code(sample_salary_input):
@@ -44,6 +45,7 @@ def test_missing_country():
             industry="Software Development",
             age="25-34 years old",
             ic_or_pm="Individual contributor",
         )
@@ -58,6 +60,22 @@ def test_missing_education_level():
             industry="Software Development",
             age="25-34 years old",
             ic_or_pm="Individual contributor",
         )

     assert result.industry == sample_salary_input["industry"]
     assert result.age == sample_salary_input["age"]
     assert result.ic_or_pm == sample_salary_input["ic_or_pm"]
+    assert result.org_size == sample_salary_input["org_size"]
 def test_negative_years_code(sample_salary_input):
             industry="Software Development",
             age="25-34 years old",
             ic_or_pm="Individual contributor",
+            org_size="20 to 99 employees",
         )
             industry="Software Development",
             age="25-34 years old",
             ic_or_pm="Individual contributor",
+            org_size="20 to 99 employees",
+        )
+def test_missing_org_size():
+    """Missing org_size raises ValidationError."""
+    with pytest.raises(ValidationError):
+        SalaryInput(
+            country="United States of America",
+            years_code=5.0,
+            work_exp=3.0,
+            education_level="Bachelor's degree (B.A., B.S., B.Eng., etc.)",
+            dev_type="Developer, full-stack",
+            industry="Software Development",
+            age="25-34 years old",
+            ic_or_pm="Individual contributor",
         )

tests/test_train.py CHANGED Viewed

@@ -34,6 +34,7 @@ def _make_salary_df(countries=None, salaries=None, n=100) -> pd.DataFrame:
             "Industry": ["Software Development"] * n,
             "Age": ["25-34 years old"] * n,
             "ICorPM": ["Individual contributor"] * n,
             "Currency": ["USD United States Dollar"] * n,
             "CompTotal": salaries,
             "ConvertedCompYearly": salaries,
@@ -140,6 +141,7 @@ class TestDropOtherRows:
                 "Industry": ["SW", "SW", "SW"],
                 "Age": ["25-34", "25-34", "25-34"],
                 "ICorPM": ["IC", "IC", "IC"],
             }
         )
         config = {
@@ -164,6 +166,7 @@ class TestDropOtherRows:
                 "Industry": ["SW", "SW"],
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
             }
         )
         config = {
@@ -187,6 +190,7 @@ class TestDropOtherRows:
                 "Industry": ["SW", "SW"],
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
             }
         )
         config = {
@@ -214,15 +218,17 @@ class TestExtractValidCategories:
                 "Industry": ["SW", "Fin", "SW"],
                 "Age": ["25-34", "35-44", "25-34"],
                 "ICorPM": ["IC", "PM", "IC"],
             }
         )
         result = extract_valid_categories(df)
         assert result["Country"] == ["Germany", "USA"]
         assert result["EdLevel"] == ["BS", "MS"]
         assert result["ICorPM"] == ["IC", "PM"]
     def test_all_categorical_features_present(self):
-        """All 6 categorical features are present as keys."""
         df = pd.DataFrame(
             {
                 "Country": ["USA"],
@@ -231,6 +237,7 @@ class TestExtractValidCategories:
                 "Industry": ["SW"],
                 "Age": ["25-34"],
                 "ICorPM": ["IC"],
             }
         )
         result = extract_valid_categories(df)
@@ -241,6 +248,7 @@ class TestExtractValidCategories:
             "Industry",
             "Age",
             "ICorPM",
         }
     def test_excludes_nan_values(self):
@@ -253,6 +261,7 @@ class TestExtractValidCategories:
                 "Industry": ["SW", "SW"],
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
             }
         )
         result = extract_valid_categories(df)

             "Industry": ["Software Development"] * n,
             "Age": ["25-34 years old"] * n,
             "ICorPM": ["Individual contributor"] * n,
+            "OrgSize": ["20 to 99 employees"] * n,
             "Currency": ["USD United States Dollar"] * n,
             "CompTotal": salaries,
             "ConvertedCompYearly": salaries,
                 "Industry": ["SW", "SW", "SW"],
                 "Age": ["25-34", "25-34", "25-34"],
                 "ICorPM": ["IC", "IC", "IC"],
+                "OrgSize": ["Small", "Small", "Small"],
             }
         )
         config = {
                 "Industry": ["SW", "SW"],
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
+                "OrgSize": ["Small", "Small"],
             }
         )
         config = {
                 "Industry": ["SW", "SW"],
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
+                "OrgSize": ["Small", "Small"],
             }
         )
         config = {
                 "Industry": ["SW", "Fin", "SW"],
                 "Age": ["25-34", "35-44", "25-34"],
                 "ICorPM": ["IC", "PM", "IC"],
+                "OrgSize": ["Small", "Large", "Small"],
             }
         )
         result = extract_valid_categories(df)
         assert result["Country"] == ["Germany", "USA"]
         assert result["EdLevel"] == ["BS", "MS"]
         assert result["ICorPM"] == ["IC", "PM"]
+        assert result["OrgSize"] == ["Large", "Small"]
     def test_all_categorical_features_present(self):
+        """All 7 categorical features are present as keys."""
         df = pd.DataFrame(
             {
                 "Country": ["USA"],
                 "Industry": ["SW"],
                 "Age": ["25-34"],
                 "ICorPM": ["IC"],
+                "OrgSize": ["Small"],
             }
         )
         result = extract_valid_categories(df)
             "Industry",
             "Age",
             "ICorPM",
+            "OrgSize",
         }
     def test_excludes_nan_values(self):
                 "Industry": ["SW", "SW"],
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
+                "OrgSize": ["Small", "Small"],
             }
         )
         result = extract_valid_categories(df)

tests/test_tune.py CHANGED Viewed

@@ -87,9 +87,7 @@ class TestSaveBestParams:
                 "max_depth": 6,
             },
         }
-        with tempfile.NamedTemporaryFile(
-            mode="w", suffix=".yaml", delete=False
-        ) as f:
             yaml.dump(config, f)
             tmp_path = Path(f.name)
@@ -110,9 +108,7 @@ class TestSaveBestParams:
             "model": {"n_estimators": 5000, "max_depth": 6},
             "training": {"verbose": False},
         }
-        with tempfile.NamedTemporaryFile(
-            mode="w", suffix=".yaml", delete=False
-        ) as f:
             yaml.dump(config, f)
             tmp_path = Path(f.name)
@@ -134,9 +130,7 @@ class TestSaveBestParams:
                 "n_jobs": -1,
             },
         }
-        with tempfile.NamedTemporaryFile(
-            mode="w", suffix=".yaml", delete=False
-        ) as f:
             yaml.dump(config, f)
             tmp_path = Path(f.name)

                 "max_depth": 6,
             },
         }
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
             yaml.dump(config, f)
             tmp_path = Path(f.name)
             "model": {"n_estimators": 5000, "max_depth": 6},
             "training": {"verbose": False},
         }
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
             yaml.dump(config, f)
             tmp_path = Path(f.name)
                 "n_jobs": -1,
             },
         }
+        with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
             yaml.dump(config, f)
             tmp_path = Path(f.name)

uv.lock CHANGED Viewed

@@ -127,6 +127,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/e6/ad/3cc14f097111b4de0040c83a525973216457bbeeb63739ef1ed275c1c021/certifi-2026.1.4-py3-none-any.whl", hash = "sha256:9943707519e4add1115f44c2bc244f782c0249876bf51b6599fee1ffbedd685c", size = 152900, upload-time = "2026-01-04T02:42:40.15Z" },
 ]
 [[package]]
 name = "charset-normalizer"
 version = "3.4.4"
@@ -328,7 +337,7 @@ wheels = [
 [[package]]
 name = "developer-salary-prediction"
-version = "0.1.0"
 source = { virtual = "." }
 dependencies = [
     { name = "bandit" },
@@ -336,6 +345,7 @@ dependencies = [
     { name = "optuna" },
     { name = "pandas" },
     { name = "pip-audit" },
     { name = "pydantic" },
     { name = "pyyaml" },
     { name = "radon" },
@@ -363,6 +373,7 @@ requires-dist = [
     { name = "pandas", specifier = ">=2.0.0" },
     { name = "pip-audit", specifier = ">=2.10.0" },
     { name = "pip-audit", marker = "extra == 'dev'", specifier = ">=2.7.0" },
     { name = "pydantic", specifier = ">=2.0.0" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
     { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=6.0.0" },
@@ -376,13 +387,22 @@ requires-dist = [
 ]
 provides-extras = ["dev"]
 [[package]]
 name = "filelock"
-version = "3.24.0"
 source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/00/cd/fa3ab025a8f9772e8a9146d8fd8eef6d62649274d231ca84249f54a0de4a/filelock-3.24.0.tar.gz", hash = "sha256:aeeab479339ddf463a1cdd1f15a6e6894db976071e5883efc94d22ed5139044b", size = 37166, upload-time = "2026-02-14T16:05:28.723Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/d9/dd/d7e7f4f49180e8591c9e1281d15ecf8e7f25eb2c829771d9682f1f9fe0c8/filelock-3.24.0-py3-none-any.whl", hash = "sha256:eebebb403d78363ef7be8e236b63cc6760b0004c7464dceaba3fd0afbd637ced", size = 23977, upload-time = "2026-02-14T16:05:27.578Z" },
 ]
 [[package]]
@@ -448,6 +468,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/e1/2b/98c7f93e6db9977aaee07eb1e51ca63bd5f779b900d362791d3252e60558/greenlet-3.3.1-cp314-cp314t-win_amd64.whl", hash = "sha256:301860987846c24cb8964bdec0e31a96ad4a2a801b41b4ef40963c1b44f33451", size = 233181, upload-time = "2026-01-23T15:33:00.29Z" },
 ]
 [[package]]
 name = "idna"
 version = "3.11"
@@ -687,6 +716,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/03/cc/7cb74758e6df95e0c4e1253f203b6dd7f348bf2f29cf89e9210a2416d535/narwhals-2.16.0-py3-none-any.whl", hash = "sha256:846f1fd7093ac69d63526e50732033e86c30ea0026a44d9b23991010c7d1485d", size = 443951, upload-time = "2026-02-02T10:30:58.635Z" },
 ]
 [[package]]
 name = "numpy"
 version = "2.4.2"
@@ -982,6 +1020,22 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
 ]
 [[package]]
 name = "protobuf"
 version = "6.33.5"
@@ -1790,6 +1844,20 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/39/08/aaaad47bc4e9dc8c725e68f9d04865dbcb2052843ff09c97b08904852d84/urllib3-2.6.3-py3-none-any.whl", hash = "sha256:bf272323e553dfb2e87d9bfd225ca7b0f467b919d7bbd355436d3fd37cb0acd4", size = 131584, upload-time = "2026-01-07T16:24:42.685Z" },
 ]
 [[package]]
 name = "watchdog"
 version = "6.0.0"

     { url = "https://files.pythonhosted.org/packages/e6/ad/3cc14f097111b4de0040c83a525973216457bbeeb63739ef1ed275c1c021/certifi-2026.1.4-py3-none-any.whl", hash = "sha256:9943707519e4add1115f44c2bc244f782c0249876bf51b6599fee1ffbedd685c", size = 152900, upload-time = "2026-01-04T02:42:40.15Z" },
 ]
+[[package]]
+name = "cfgv"
+version = "3.5.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/4e/b5/721b8799b04bf9afe054a3899c6cf4e880fcf8563cc71c15610242490a0c/cfgv-3.5.0.tar.gz", hash = "sha256:d5b1034354820651caa73ede66a6294d6e95c1b00acc5e9b098e917404669132", size = 7334, upload-time = "2025-11-19T20:55:51.612Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/db/3c/33bac158f8ab7f89b2e59426d5fe2e4f63f7ed25df84c036890172b412b5/cfgv-3.5.0-py2.py3-none-any.whl", hash = "sha256:a8dc6b26ad22ff227d2634a65cb388215ce6cc96bbcc5cfde7641ae87e8dacc0", size = 7445, upload-time = "2025-11-19T20:55:50.744Z" },
+]
 [[package]]
 name = "charset-normalizer"
 version = "3.4.4"
 [[package]]
 name = "developer-salary-prediction"
+version = "1.0.0"
 source = { virtual = "." }
 dependencies = [
     { name = "bandit" },
     { name = "optuna" },
     { name = "pandas" },
     { name = "pip-audit" },
+    { name = "pre-commit" },
     { name = "pydantic" },
     { name = "pyyaml" },
     { name = "radon" },
     { name = "pandas", specifier = ">=2.0.0" },
     { name = "pip-audit", specifier = ">=2.10.0" },
     { name = "pip-audit", marker = "extra == 'dev'", specifier = ">=2.7.0" },
+    { name = "pre-commit", specifier = ">=4.5.1" },
     { name = "pydantic", specifier = ">=2.0.0" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },
     { name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=6.0.0" },
 ]
 provides-extras = ["dev"]
+[[package]]
+name = "distlib"
+version = "0.4.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/96/8e/709914eb2b5749865801041647dc7f4e6d00b549cfe88b65ca192995f07c/distlib-0.4.0.tar.gz", hash = "sha256:feec40075be03a04501a973d81f633735b4b69f98b05450592310c0f401a4e0d", size = 614605, upload-time = "2025-07-17T16:52:00.465Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16", size = 469047, upload-time = "2025-07-17T16:51:58.613Z" },
+]
 [[package]]
 name = "filelock"
+version = "3.24.3"
 source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/73/92/a8e2479937ff39185d20dd6a851c1a63e55849e447a55e798cc2e1f49c65/filelock-3.24.3.tar.gz", hash = "sha256:011a5644dc937c22699943ebbfc46e969cdde3e171470a6e40b9533e5a72affa", size = 37935, upload-time = "2026-02-19T00:48:20.543Z" }
 wheels = [
+    { url = "https://files.pythonhosted.org/packages/9c/0f/5d0c71a1aefeb08efff26272149e07ab922b64f46c63363756224bd6872e/filelock-3.24.3-py3-none-any.whl", hash = "sha256:426e9a4660391f7f8a810d71b0555bce9008b0a1cc342ab1f6947d37639e002d", size = 24331, upload-time = "2026-02-19T00:48:18.465Z" },
 ]
 [[package]]
     { url = "https://files.pythonhosted.org/packages/e1/2b/98c7f93e6db9977aaee07eb1e51ca63bd5f779b900d362791d3252e60558/greenlet-3.3.1-cp314-cp314t-win_amd64.whl", hash = "sha256:301860987846c24cb8964bdec0e31a96ad4a2a801b41b4ef40963c1b44f33451", size = 233181, upload-time = "2026-01-23T15:33:00.29Z" },
 ]
+[[package]]
+name = "identify"
+version = "2.6.16"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/5b/8d/e8b97e6bd3fb6fb271346f7981362f1e04d6a7463abd0de79e1fda17c067/identify-2.6.16.tar.gz", hash = "sha256:846857203b5511bbe94d5a352a48ef2359532bc8f6727b5544077a0dcfb24980", size = 99360, upload-time = "2026-01-12T18:58:58.201Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b8/58/40fbbcefeda82364720eba5cf2270f98496bdfa19ea75b4cccae79c698e6/identify-2.6.16-py2.py3-none-any.whl", hash = "sha256:391ee4d77741d994189522896270b787aed8670389bfd60f326d677d64a6dfb0", size = 99202, upload-time = "2026-01-12T18:58:56.627Z" },
+]
 [[package]]
 name = "idna"
 version = "3.11"
     { url = "https://files.pythonhosted.org/packages/03/cc/7cb74758e6df95e0c4e1253f203b6dd7f348bf2f29cf89e9210a2416d535/narwhals-2.16.0-py3-none-any.whl", hash = "sha256:846f1fd7093ac69d63526e50732033e86c30ea0026a44d9b23991010c7d1485d", size = 443951, upload-time = "2026-02-02T10:30:58.635Z" },
 ]
+[[package]]
+name = "nodeenv"
+version = "1.10.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/24/bf/d1bda4f6168e0b2e9e5958945e01910052158313224ada5ce1fb2e1113b8/nodeenv-1.10.0.tar.gz", hash = "sha256:996c191ad80897d076bdfba80a41994c2b47c68e224c542b48feba42ba00f8bb", size = 55611, upload-time = "2025-12-20T14:08:54.006Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/88/b2/d0896bdcdc8d28a7fc5717c305f1a861c26e18c05047949fb371034d98bd/nodeenv-1.10.0-py2.py3-none-any.whl", hash = "sha256:5bb13e3eed2923615535339b3c620e76779af4cb4c6a90deccc9e36b274d3827", size = 23438, upload-time = "2025-12-20T14:08:52.782Z" },
+]
 [[package]]
 name = "numpy"
 version = "2.4.2"
     { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
 ]
+[[package]]
+name = "pre-commit"
+version = "4.5.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "cfgv" },
+    { name = "identify" },
+    { name = "nodeenv" },
+    { name = "pyyaml" },
+    { name = "virtualenv" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/40/f1/6d86a29246dfd2e9b6237f0b5823717f60cad94d47ddc26afa916d21f525/pre_commit-4.5.1.tar.gz", hash = "sha256:eb545fcff725875197837263e977ea257a402056661f09dae08e4b149b030a61", size = 198232, upload-time = "2025-12-16T21:14:33.552Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/5d/19/fd3ef348460c80af7bb4669ea7926651d1f95c23ff2df18b9d24bab4f3fa/pre_commit-4.5.1-py2.py3-none-any.whl", hash = "sha256:3b3afd891e97337708c1674210f8eba659b52a38ea5f822ff142d10786221f77", size = 226437, upload-time = "2025-12-16T21:14:32.409Z" },
+]
 [[package]]
 name = "protobuf"
 version = "6.33.5"
     { url = "https://files.pythonhosted.org/packages/39/08/aaaad47bc4e9dc8c725e68f9d04865dbcb2052843ff09c97b08904852d84/urllib3-2.6.3-py3-none-any.whl", hash = "sha256:bf272323e553dfb2e87d9bfd225ca7b0f467b919d7bbd355436d3fd37cb0acd4", size = 131584, upload-time = "2026-01-07T16:24:42.685Z" },
 ]
+[[package]]
+name = "virtualenv"
+version = "20.38.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "distlib" },
+    { name = "filelock" },
+    { name = "platformdirs" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/d2/03/a94d404ca09a89a7301a7008467aed525d4cdeb9186d262154dd23208709/virtualenv-20.38.0.tar.gz", hash = "sha256:94f39b1abaea5185bf7ea5a46702b56f1d0c9aa2f41a6c2b8b0af4ddc74c10a7", size = 5864558, upload-time = "2026-02-19T07:48:02.385Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/42/d7/394801755d4c8684b655d35c665aea7836ec68320304f62ab3c94395b442/virtualenv-20.38.0-py3-none-any.whl", hash = "sha256:d6e78e5889de3a4742df2d3d44e779366325a90cf356f15621fddace82431794", size = 5837778, upload-time = "2026-02-19T07:47:59.778Z" },
+]
 [[package]]
 name = "watchdog"
 version = "6.0.0"