Spaces:

dima806
/

developer_salary_prediction

Sleeping

App Files Files Community

dima806 commited on Feb 24

Commit

eeeaee6

verified ·

1 Parent(s): 2cc5253

Upload 39 files

Browse files

Files changed (19) hide show

.gitignore +1 -1
Claude.md +22 -15
Makefile +8 -1
README.md +7 -2
app.py +14 -0
config/model_parameters.yaml +9 -8
config/valid_categories.yaml +6 -0
models/model.pkl +2 -2
src/infer.py +8 -0
src/preprocess.py +1 -0
src/preprocessing.py +4 -1
src/schema.py +2 -0
src/train.py +8 -0
src/tune.py +1 -0
tests/conftest.py +1 -0
tests/test_feature_impact.py +19 -9
tests/test_preprocessing.py +8 -1
tests/test_schema.py +3 -0
tests/test_train.py +10 -1

.gitignore CHANGED Viewed

@@ -186,7 +186,7 @@ cython_debug/
 #  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
 #  and can be added to the global gitignore or merged into this file. However, if you prefer,
 #  you could uncomment the following to ignore the entire vscode folder
-# .vscode/
 # Ruff stuff:
 .ruff_cache/

 #  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
 #  and can be added to the global gitignore or merged into this file. However, if you prefer,
 #  you could uncomment the following to ignore the entire vscode folder
+.vscode/
 # Ruff stuff:
 .ruff_cache/

Claude.md CHANGED Viewed

@@ -95,6 +95,8 @@ make check   # lint + test + complexity + maintainability + audit + security
 | `make security` | bandit static security analysis |
 | `make pre-process` | Validate data + generate config artifacts (no model) |
 | `make tune` | Optuna hyperparameter search |
 ### Training the model
@@ -141,6 +143,7 @@ input_data = SalaryInput(
     age="25-34 years old",
     ic_or_pm="Individual contributor",
     org_size="20 to 99 employees",
 )
 salary = predict_salary(input_data)
 ```
@@ -163,13 +166,14 @@ The `survey_results_public.csv` must include these columns:
 | `Age` | Age range |
 | `ICorPM` | Individual contributor or people manager |
 | `OrgSize` | Organisation size (number of employees) |
 | `ConvertedCompYearly` | Annual salary in USD (target variable) |
 ## Input Validation (Two Layers)
 ### Layer 1 — Pydantic schema (`src/schema.py`)
-All 9 fields are required. `years_code` and `work_exp` must be `>= 0`. Validated at
 object construction time — raises `ValidationError` on failure.
 ### Layer 2 — Runtime guardrails (`src/infer.py`)
@@ -181,7 +185,7 @@ time. Raises `ValueError` with a clear message on invalid input.
 ### [src/schema.py](src/schema.py)
-Pydantic v2 `SalaryInput` model — defines all 9 required input fields, types, and
 constraints. The JSON schema example in the docstring is the canonical usage example.
 ### [src/preprocessing.py](src/preprocessing.py)
@@ -232,17 +236,20 @@ When adding a new input feature, update **all** of the following in order:
 2. `src/schema.py` — add field to `SalaryInput`
 3. `src/preprocessing.py` — add to `_categorical_cols` (or numeric handling)
 4. `src/train.py` — add to `CATEGORICAL_FEATURES` and `usecols`
-5. `src/infer.py` — add validation block and DataFrame column
-6. `app.py` — add selectbox, default, sidebar entry, `SalaryInput` construction
-7. `tests/conftest.py` — add to `sample_salary_input` fixture
-8. `tests/test_schema.py` — assert field, add missing-field test
-9. `tests/test_infer.py` — add invalid-value test
-10. `tests/test_feature_impact.py` — add to all `base_input` dicts, add impact test
-11. `tests/test_preprocessing.py` — add column to all `pd.DataFrame(...)` fixtures
-12. `tests/test_train.py` — add column to `_make_salary_df` and all test DataFrames
-13. `README.md` — required columns, valid categories list, code example
-14. `example_inference.py` — add to all `SalaryInput` calls
-15. Retrain: `uv run python -m src.train`
 ## Versioning
@@ -252,7 +259,7 @@ Follows [Semantic Versioning](https://semver.org/):
 - **MINOR** — new optional field, new supported country, new Makefile target
 - **PATCH** — bug fix, model retrain with same schema, config tuning
-Current version: `2.0.0` (added `OrgSize` required field).
 Update `pyproject.toml` before tagging:
@@ -293,7 +300,7 @@ uv run pre-commit run --all-files
 | ------- | --- |
 | `FileNotFoundError: model.pkl` | Run `uv run python -m src.train` |
 | `FileNotFoundError: valid_categories.yaml` | Same — generated by training |
-| `ValidationError` on `SalaryInput` | Check all 9 fields are present and non-negative numerics |
 | `ValueError: Invalid ...` at inference | Value not in `config/valid_categories.yaml`; retrain or use a listed value |
 | `E501` ruff errors | Lines > 79 chars — split strings, use variables, or wrap lists |
 | Tests fail after adding a feature | Check the "Updating Features" checklist above |

 | `make security` | bandit static security analysis |
 | `make pre-process` | Validate data + generate config artifacts (no model) |
 | `make tune` | Optuna hyperparameter search |
+| `make ci` | Mirror of GitHub Actions CI (lint + test) |
+| `make pre-commit` | Run all pre-commit hooks against every file |
 ### Training the model
     age="25-34 years old",
     ic_or_pm="Individual contributor",
     org_size="20 to 99 employees",
+    employment="Employed",
 )
 salary = predict_salary(input_data)
 ```
 | `Age` | Age range |
 | `ICorPM` | Individual contributor or people manager |
 | `OrgSize` | Organisation size (number of employees) |
+| `Employment` | Current employment status |
 | `ConvertedCompYearly` | Annual salary in USD (target variable) |
 ## Input Validation (Two Layers)
 ### Layer 1 — Pydantic schema (`src/schema.py`)
+All 10 fields are required. `years_code` and `work_exp` must be `>= 0`. Validated at
 object construction time — raises `ValidationError` on failure.
 ### Layer 2 — Runtime guardrails (`src/infer.py`)
 ### [src/schema.py](src/schema.py)
+Pydantic v2 `SalaryInput` model — defines all 10 required input fields, types, and
 constraints. The JSON schema example in the docstring is the canonical usage example.
 ### [src/preprocessing.py](src/preprocessing.py)
 2. `src/schema.py` — add field to `SalaryInput`
 3. `src/preprocessing.py` — add to `_categorical_cols` (or numeric handling)
 4. `src/train.py` — add to `CATEGORICAL_FEATURES` and `usecols`
+5. `src/tune.py` — add to `usecols`
+6. `src/preprocess.py` — add to `REQUIRED_COLUMNS`
+7. `src/infer.py` — add validation block and DataFrame column
+8. `app.py` — add selectbox, default, sidebar entry, `SalaryInput` construction
+9. `tests/conftest.py` — add to `sample_salary_input` fixture
+10. `tests/test_schema.py` — assert field, add missing-field test
+11. `tests/test_infer.py` — add invalid-value test
+12. `tests/test_feature_impact.py` — add to all `base_input` dicts, add impact test
+13. `tests/test_preprocessing.py` — add column to all `pd.DataFrame(...)` fixtures
+14. `tests/test_train.py` — add column to `_make_salary_df` and all test DataFrames
+15. `README.md` — required columns, valid categories list, code example
+16. `Claude.md` — data requirements table, field counts, code example, version
+17. `example_inference.py` — add to all `SalaryInput` calls
+18. Retrain: `uv run python -m src.train`
 ## Versioning
 - **MINOR** — new optional field, new supported country, new Makefile target
 - **PATCH** — bug fix, model retrain with same schema, config tuning
+Current version: `3.0.0` (added `Employment` required field).
 Update `pyproject.toml` before tagging:
 | ------- | --- |
 | `FileNotFoundError: model.pkl` | Run `uv run python -m src.train` |
 | `FileNotFoundError: valid_categories.yaml` | Same — generated by training |
+| `ValidationError` on `SalaryInput` | Check all 10 fields are present and non-negative numerics |
 | `ValueError: Invalid ...` at inference | Value not in `config/valid_categories.yaml`; retrain or use a listed value |
 | `E501` ruff errors | Lines > 79 chars — split strings, use variables, or wrap lists |
 | Tests fail after adding a feature | Check the "Updating Features" checklist above |

Makefile CHANGED Viewed

@@ -1,5 +1,5 @@
 .PHONY: lint format test coverage complexity maintainability audit security \
-        tune pre-process train app smoke-test guardrails check all
 lint:
 	uv run ruff check .
@@ -52,6 +52,13 @@ smoke-test:
 guardrails:
 	uv run python guardrail_evaluation.py
 # CI gate: fast checks that require no model or training data
 check: lint test complexity maintainability audit security

 .PHONY: lint format test coverage complexity maintainability audit security \
+        tune pre-process train app smoke-test guardrails check ci pre-commit all
 lint:
 	uv run ruff check .
 guardrails:
 	uv run python guardrail_evaluation.py
+# Mirrors GitHub Actions CI (.github/workflows/ci.yml): lint + test
+ci: lint test
+# Runs all pre-commit hooks against every file (.pre-commit-config.yaml)
+pre-commit:
+	uv run pre-commit run --all-files
 # CI gate: fast checks that require no model or training data
 check: lint test complexity maintainability audit security

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ Download the Stack Overflow Developer Survey CSV file:
    data/survey_results_public.csv
    ```
-**Required columns:** `Country`, `YearsCode`, `WorkExp`, `EdLevel`, `DevType`, `Industry`, `Age`, `ICorPM`, `OrgSize`, `ConvertedCompYearly`
 ### 3. Train the Model
@@ -109,6 +109,8 @@ This runs all quality gates in sequence:
 | Target | Tool | What it checks |
 | ------ | ---- | -------------- |
 | `make lint` | ruff | Style and linting errors |
 | `make format` | ruff | Auto-formats code |
 | `make test` | pytest | Unit and integration tests |
@@ -142,6 +144,7 @@ Launch the Streamlit app and enter:
 - **Age**: Developer's age range
 - **IC or PM**: Individual contributor or people manager
 - **Organization Size**: Approximate number of employees at the developer's company
 Click "Predict Salary" to see the estimated annual salary in USD plus a local
 currency equivalent where available.
@@ -162,6 +165,7 @@ input_data = SalaryInput(
     age="25-34 years old",
     ic_or_pm="Individual contributor",
     org_size="20 to 99 employees",
 )
 salary = predict_salary(input_data)
@@ -182,7 +186,7 @@ Validation is enforced at two layers:
 Checked at object construction time:
-- All 9 fields are required
 - `years_code` must be `>= 0`
 - `work_exp` must be `>= 0`
@@ -200,6 +204,7 @@ in `config/model_parameters.yaml`):
 - **Valid Age Ranges** (~7) — `Other` dropped
 - **Valid IC/PM Values** (~3) — `Other` dropped
 - **Valid Organization Sizes** (~8) — `Other` dropped
 Passing an invalid value raises a `ValueError` with a message pointing to
 `config/valid_categories.yaml`.

    data/survey_results_public.csv
    ```
+**Required columns:** `Country`, `YearsCode`, `WorkExp`, `EdLevel`, `DevType`, `Industry`, `Age`, `ICorPM`, `OrgSize`, `Employment`, `ConvertedCompYearly`
 ### 3. Train the Model
 | Target | Tool | What it checks |
 | ------ | ---- | -------------- |
+| `make ci` | ruff + pytest | Mirrors GitHub Actions CI (lint + test) |
+| `make pre-commit` | pre-commit | All hooks from `.pre-commit-config.yaml` against every file |
 | `make lint` | ruff | Style and linting errors |
 | `make format` | ruff | Auto-formats code |
 | `make test` | pytest | Unit and integration tests |
 - **Age**: Developer's age range
 - **IC or PM**: Individual contributor or people manager
 - **Organization Size**: Approximate number of employees at the developer's company
+- **Employment Status**: Current employment status
 Click "Predict Salary" to see the estimated annual salary in USD plus a local
 currency equivalent where available.
     age="25-34 years old",
     ic_or_pm="Individual contributor",
     org_size="20 to 99 employees",
+    employment="Employed",
 )
 salary = predict_salary(input_data)
 Checked at object construction time:
+- All 10 fields are required
 - `years_code` must be `>= 0`
 - `work_exp` must be `>= 0`
 - **Valid Age Ranges** (~7) — `Other` dropped
 - **Valid IC/PM Values** (~3) — `Other` dropped
 - **Valid Organization Sizes** (~8) — `Other` dropped
+- **Valid Employment Statuses** (~5)
 Passing an invalid value raises a `ValueError` with a message pointing to
 `config/valid_categories.yaml`.

app.py CHANGED Viewed

@@ -34,6 +34,7 @@ with st.sidebar:
         - Age
         - Individual contributor or people manager
         - Organization size
         """
     )
     st.info("💡 Tip: Results are estimates based on survey averages.")
@@ -47,6 +48,7 @@ with st.sidebar:
     st.write(f"**Age Ranges:** {len(valid_categories['Age'])} available")
     st.write(f"**IC/PM Roles:** {len(valid_categories['ICorPM'])} available")
     st.write(f"**Org Sizes:** {len(valid_categories['OrgSize'])} available")
     st.caption("Only values from the training data are shown in the dropdowns.")
 # Main input form
@@ -62,6 +64,7 @@ valid_industries = valid_categories["Industry"]
 valid_ages = valid_categories["Age"]
 valid_icorpm = valid_categories["ICorPM"]
 valid_org_sizes = valid_categories["OrgSize"]
 # Set default values (if available)
 default_country = (
@@ -95,6 +98,9 @@ default_org_size = (
     if "20 to 99 employees" in valid_org_sizes
     else valid_org_sizes[0]
 )
 with col1:
     country = st.selectbox(
@@ -165,6 +171,13 @@ org_size = st.selectbox(
     help="Approximate number of employees at the developer's company",
 )
 # Prediction button
 if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
     try:
@@ -179,6 +192,7 @@ if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
             age=age,
             ic_or_pm=ic_or_pm,
             org_size=org_size,
         )
         # Make prediction

         - Age
         - Individual contributor or people manager
         - Organization size
+        - Employment status
         """
     )
     st.info("💡 Tip: Results are estimates based on survey averages.")
     st.write(f"**Age Ranges:** {len(valid_categories['Age'])} available")
     st.write(f"**IC/PM Roles:** {len(valid_categories['ICorPM'])} available")
     st.write(f"**Org Sizes:** {len(valid_categories['OrgSize'])} available")
+    st.write(f"**Employment:** {len(valid_categories['Employment'])} available")
     st.caption("Only values from the training data are shown in the dropdowns.")
 # Main input form
 valid_ages = valid_categories["Age"]
 valid_icorpm = valid_categories["ICorPM"]
 valid_org_sizes = valid_categories["OrgSize"]
+valid_employment = valid_categories["Employment"]
 # Set default values (if available)
 default_country = (
     if "20 to 99 employees" in valid_org_sizes
     else valid_org_sizes[0]
 )
+default_employment = (
+    "Employed" if "Employed" in valid_employment else valid_employment[0]
+)
 with col1:
     country = st.selectbox(
     help="Approximate number of employees at the developer's company",
 )
+employment = st.selectbox(
+    "Employment Status",
+    options=valid_employment,
+    index=valid_employment.index(default_employment),
+    help="Current employment status",
+)
 # Prediction button
 if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
     try:
             age=age,
             ic_or_pm=ic_or_pm,
             org_size=org_size,
+            employment=employment,
         )
         # Make prediction

config/model_parameters.yaml CHANGED Viewed

@@ -17,21 +17,22 @@ features:
     - Age
     - ICorPM
     - OrgSize
   encoding:
     drop_first: true
 model:
   n_estimators: 5000
-  learning_rate: 0.020926294479210576
-  max_depth: 5
-  min_child_weight: 18
   random_state: 42
   n_jobs: -1
   early_stopping_rounds: 50
-  subsample: 0.9191289771331972
-  colsample_bytree: 0.5333460923651799
-  reg_alpha: 0.00021933676399241674
-  reg_lambda: 1.6854320949984984
-  gamma: 3.8247794752407254
 training:
   verbose: false
   save_model: true

     - Age
     - ICorPM
     - OrgSize
+    - Employment
   encoding:
     drop_first: true
 model:
   n_estimators: 5000
+  learning_rate: 0.056803456466335424
+  max_depth: 4
+  min_child_weight: 16
   random_state: 42
   n_jobs: -1
   early_stopping_rounds: 50
+  subsample: 0.9378495066287903
+  colsample_bytree: 0.589604213410477
+  reg_alpha: 1.2493619591455039
+  reg_lambda: 0.006641605590505938
+  gamma: 1.269496538435438
 training:
   verbose: false
   save_model: true

config/valid_categories.yaml CHANGED Viewed

@@ -107,3 +107,9 @@ OrgSize:
 - I don't know
 - Just me - I am a freelancer, sole proprietor, etc.
 - Less than 20 employees

 - I don't know
 - Just me - I am a freelancer, sole proprietor, etc.
 - Less than 20 employees
+Employment:
+- Employed
+- Independent contractor, freelancer, or self-employed
+- Not employed
+- Retired
+- Student

models/model.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ea5bae7edfb8d4b29391e413aedfc94b5335b9bb86ede04e03a646a561e255af
-size 3338897

 version https://git-lfs.github.com/spec/v1
+oid sha256:7be828632107b924dc18532601f3fc6bc1da3d0c6b8e3976a85102dc9787b7a3
+size 1244154

src/infer.py CHANGED Viewed

@@ -120,6 +120,13 @@ def predict_salary(data: SalaryInput) -> float:
             f"Check config/valid_categories.yaml for all valid values."
         )
     # Create a DataFrame with the input data
     input_df = pd.DataFrame(
         {
@@ -132,6 +139,7 @@ def predict_salary(data: SalaryInput) -> float:
             "Age": [data.age],
             "ICorPM": [data.ic_or_pm],
             "OrgSize": [data.org_size],
         }
     )

             f"Check config/valid_categories.yaml for all valid values."
         )
+    if data.employment not in valid_categories["Employment"]:
+        raise ValueError(
+            f"Invalid employment status: '{data.employment}'. "
+            f"Must be one of {valid_categories['Employment']}. "
+            f"Check config/valid_categories.yaml for all valid values."
+        )
     # Create a DataFrame with the input data
     input_df = pd.DataFrame(
         {
             "Age": [data.age],
             "ICorPM": [data.ic_or_pm],
             "OrgSize": [data.org_size],
+            "Employment": [data.employment],
         }
     )

src/preprocess.py CHANGED Viewed

@@ -35,6 +35,7 @@ REQUIRED_COLUMNS = [
     "Age",
     "ICorPM",
     "OrgSize",
     "Currency",
     "CompTotal",
     "ConvertedCompYearly",

     "Age",
     "ICorPM",
     "OrgSize",
+    "Employment",
     "Currency",
     "CompTotal",
     "ConvertedCompYearly",

src/preprocessing.py CHANGED Viewed

@@ -79,7 +79,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
     Args:
         df: DataFrame with columns: Country, YearsCode, WorkExp, EdLevel,
-            DevType, Industry, Age, ICorPM, OrgSize.
             NOTE: During training, cardinality reduction should be applied to df
             BEFORE calling this function. During inference, valid_categories.yaml
             ensures only valid (already-reduced) categories are used.
@@ -107,6 +107,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
         "Age",
         "ICorPM",
         "OrgSize",
     ]
     for col in _categorical_cols:
         if col in df_processed.columns:
@@ -136,6 +137,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
     df_processed["Age"] = df_processed["Age"].fillna("Unknown")
     df_processed["ICorPM"] = df_processed["ICorPM"].fillna("Unknown")
     df_processed["OrgSize"] = df_processed["OrgSize"].fillna("Unknown")
     # NOTE: Cardinality reduction is NOT applied here
     # It should be applied during training BEFORE calling this function
@@ -152,6 +154,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
         "Age",
         "ICorPM",
         "OrgSize",
     ]
     df_features = df_processed[feature_cols]

     Args:
         df: DataFrame with columns: Country, YearsCode, WorkExp, EdLevel,
+            DevType, Industry, Age, ICorPM, OrgSize, Employment.
             NOTE: During training, cardinality reduction should be applied to df
             BEFORE calling this function. During inference, valid_categories.yaml
             ensures only valid (already-reduced) categories are used.
         "Age",
         "ICorPM",
         "OrgSize",
+        "Employment",
     ]
     for col in _categorical_cols:
         if col in df_processed.columns:
     df_processed["Age"] = df_processed["Age"].fillna("Unknown")
     df_processed["ICorPM"] = df_processed["ICorPM"].fillna("Unknown")
     df_processed["OrgSize"] = df_processed["OrgSize"].fillna("Unknown")
+    df_processed["Employment"] = df_processed["Employment"].fillna("Unknown")
     # NOTE: Cardinality reduction is NOT applied here
     # It should be applied during training BEFORE calling this function
         "Age",
         "ICorPM",
         "OrgSize",
+        "Employment",
     ]
     df_features = df_processed[feature_cols]

src/schema.py CHANGED Viewed

@@ -19,6 +19,7 @@ class SalaryInput(BaseModel):
                     "age": "25-34 years old",
                     "ic_or_pm": "Individual contributor",
                     "org_size": "20 to 99 employees",
                 }
             ]
         }
@@ -43,3 +44,4 @@ class SalaryInput(BaseModel):
     org_size: str = Field(
         ..., description="Size of the organisation the developer works for"
     )

                     "age": "25-34 years old",
                     "ic_or_pm": "Individual contributor",
                     "org_size": "20 to 99 employees",
+                    "employment": "Employed",
                 }
             ]
         }
     org_size: str = Field(
         ..., description="Size of the organisation the developer works for"
     )
+    employment: str = Field(..., description="Current employment status")

src/train.py CHANGED Viewed

@@ -19,6 +19,7 @@ CATEGORICAL_FEATURES = [
     "Age",
     "ICorPM",
     "OrgSize",
 ]
@@ -169,6 +170,7 @@ def main():
             "Age",
             "ICorPM",
             "OrgSize",
             "Currency",
             "CompTotal",
             "ConvertedCompYearly",
@@ -279,6 +281,12 @@ def main():
     for icorpm, count in top_icorpm.items():
         print(f"  - {icorpm}: {count:,} ({count / len(df) * 100:.1f}%)")
     # Show YearsCode statistics
     print("\n💼 Years of Coding Experience:")
     print(f"  - Min: {df['YearsCode'].min():.1f}")

     "Age",
     "ICorPM",
     "OrgSize",
+    "Employment",
 ]
             "Age",
             "ICorPM",
             "OrgSize",
+            "Employment",
             "Currency",
             "CompTotal",
             "ConvertedCompYearly",
     for icorpm, count in top_icorpm.items():
         print(f"  - {icorpm}: {count:,} ({count / len(df) * 100:.1f}%)")
+    # Show employment distribution
+    print("\n💼 Employment Distribution:")
+    top_employment = df["Employment"].value_counts()
+    for emp, count in top_employment.items():
+        print(f"  - {emp}: {count:,} ({count / len(df) * 100:.1f}%)")
     # Show YearsCode statistics
     print("\n💼 Years of Coding Experience:")
     print(f"  - Min: {df['YearsCode'].min():.1f}")

src/tune.py CHANGED Viewed

@@ -149,6 +149,7 @@ def main():
             "Age",
             "ICorPM",
             "OrgSize",
             "Currency",
             "CompTotal",
             "ConvertedCompYearly",

             "Age",
             "ICorPM",
             "OrgSize",
+            "Employment",
             "Currency",
             "CompTotal",
             "ConvertedCompYearly",

tests/conftest.py CHANGED Viewed

@@ -19,6 +19,7 @@ def sample_salary_input():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }

         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }

tests/test_feature_impact.py CHANGED Viewed

@@ -15,6 +15,7 @@ def test_years_experience_impact():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     years_tests = [0, 2, 5, 10, 20]
@@ -24,7 +25,7 @@ def test_years_experience_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -39,6 +40,7 @@ def test_country_impact():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     test_countries = [
@@ -59,7 +61,7 @@ def test_country_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -74,6 +76,7 @@ def test_education_impact():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     test_education = [
@@ -96,7 +99,7 @@ def test_education_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -111,6 +114,7 @@ def test_devtype_impact():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     test_devtypes = [
@@ -132,7 +136,7 @@ def test_devtype_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -147,6 +151,7 @@ def test_industry_impact():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     test_industries = [
@@ -168,7 +173,7 @@ def test_industry_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -183,6 +188,7 @@ def test_age_impact():
         "industry": "Software Development",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     test_ages = [
@@ -203,7 +209,7 @@ def test_age_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -218,6 +224,7 @@ def test_work_exp_impact():
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
     }
     work_exp_tests = [0, 1, 3, 5, 10, 20]
@@ -243,6 +250,7 @@ def test_icorpm_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "org_size": "20 to 99 employees",
     }
     test_icorpm = [
@@ -257,7 +265,7 @@ def test_icorpm_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -272,6 +280,7 @@ def test_org_size_impact():
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
     }
     test_org_sizes = valid_categories["OrgSize"][:5]
@@ -282,7 +291,7 @@ def test_org_size_impact():
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )
@@ -379,9 +388,10 @@ def test_combined_features():
             age=age,
             ic_or_pm=icorpm,
             org_size=org_size,
         )
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
-        f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
     )

         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     years_tests = [0, 2, 5, 10, 20]
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     test_countries = [
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     test_education = [
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     test_devtypes = [
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     test_industries = [
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "industry": "Software Development",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     test_ages = [
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     work_exp_tests = [0, 1, 3, 5, 10, 20]
         "industry": "Software Development",
         "age": "25-34 years old",
         "org_size": "20 to 99 employees",
+        "employment": "Employed",
     }
     test_icorpm = [
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
         "industry": "Software Development",
         "age": "25-34 years old",
         "ic_or_pm": "Individual contributor",
+        "employment": "Employed",
     }
     test_org_sizes = valid_categories["OrgSize"][:5]
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )
             age=age,
             ic_or_pm=icorpm,
             org_size=org_size,
+            employment="Employed",
         )
         predictions.append(predict_salary(input_data))
     assert len(set(predictions)) == len(predictions), (
+        f"Expected {len(predictions)} unique, got {len(set(predictions))}"
     )

tests/test_preprocessing.py CHANGED Viewed

@@ -87,6 +87,7 @@ class TestPrepareFeatures:
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
             }
         )
         result = prepare_features(df)
@@ -106,6 +107,7 @@ class TestPrepareFeatures:
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
             }
         )
         result = prepare_features(df)
@@ -125,6 +127,7 @@ class TestPrepareFeatures:
                 "Age": ["25-34 years old", "35-44 years old"],
                 "ICorPM": ["Individual contributor", "People manager"],
                 "OrgSize": ["20 to 99 employees", "100 to 499 employees"],
             }
         )
         result = prepare_features(df)
@@ -148,6 +151,7 @@ class TestPrepareFeatures:
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
             }
         )
         result = prepare_features(df)
@@ -167,10 +171,11 @@ class TestPrepareFeatures:
                 "Age": [None],
                 "ICorPM": [None],
                 "OrgSize": [None],
             }
         )
         result = prepare_features(df)
-        # Categoricals filled with "Unknown" → one-hot columns contain "Unknown"
         unknown_cols = [c for c in result.columns if "Unknown" in c]
         assert len(unknown_cols) > 0
@@ -185,6 +190,7 @@ class TestPrepareFeatures:
             "Age": ["25-34 years old"],
             "ICorPM": ["Individual contributor"],
             "OrgSize": ["20 to 99 employees"],
         }
         df_usa = pd.DataFrame({"Country": ["United States of America"], **base})
         df_deu = pd.DataFrame({"Country": ["Germany"], **base})
@@ -210,6 +216,7 @@ class TestPrepareFeatures:
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
             }
         )
         original_country = df["Country"].iloc[0]

                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
+                "Employment": ["Employed"],
             }
         )
         result = prepare_features(df)
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
+                "Employment": ["Employed"],
             }
         )
         result = prepare_features(df)
                 "Age": ["25-34 years old", "35-44 years old"],
                 "ICorPM": ["Individual contributor", "People manager"],
                 "OrgSize": ["20 to 99 employees", "100 to 499 employees"],
+                "Employment": ["Employed", "Employed"],
             }
         )
         result = prepare_features(df)
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
+                "Employment": ["Employed"],
             }
         )
         result = prepare_features(df)
                 "Age": [None],
                 "ICorPM": [None],
                 "OrgSize": [None],
+                "Employment": [None],
             }
         )
         result = prepare_features(df)
+        # Categoricals filled with "Unknown" → one-hot encodes "Unknown"
         unknown_cols = [c for c in result.columns if "Unknown" in c]
         assert len(unknown_cols) > 0
             "Age": ["25-34 years old"],
             "ICorPM": ["Individual contributor"],
             "OrgSize": ["20 to 99 employees"],
+            "Employment": ["Employed"],
         }
         df_usa = pd.DataFrame({"Country": ["United States of America"], **base})
         df_deu = pd.DataFrame({"Country": ["Germany"], **base})
                 "Age": ["25-34 years old"],
                 "ICorPM": ["Individual contributor"],
                 "OrgSize": ["20 to 99 employees"],
+                "Employment": ["Employed"],
             }
         )
         original_country = df["Country"].iloc[0]

tests/test_schema.py CHANGED Viewed

@@ -46,6 +46,7 @@ def test_missing_country():
             age="25-34 years old",
             ic_or_pm="Individual contributor",
             org_size="20 to 99 employees",
         )
@@ -61,6 +62,7 @@ def test_missing_education_level():
             age="25-34 years old",
             ic_or_pm="Individual contributor",
             org_size="20 to 99 employees",
         )
@@ -76,6 +78,7 @@ def test_missing_org_size():
             industry="Software Development",
             age="25-34 years old",
             ic_or_pm="Individual contributor",
         )

             age="25-34 years old",
             ic_or_pm="Individual contributor",
             org_size="20 to 99 employees",
+            employment="Employed",
         )
             age="25-34 years old",
             ic_or_pm="Individual contributor",
             org_size="20 to 99 employees",
+            employment="Employed",
         )
             industry="Software Development",
             age="25-34 years old",
             ic_or_pm="Individual contributor",
+            employment="Employed",
         )

tests/test_train.py CHANGED Viewed

@@ -35,6 +35,7 @@ def _make_salary_df(countries=None, salaries=None, n=100) -> pd.DataFrame:
             "Age": ["25-34 years old"] * n,
             "ICorPM": ["Individual contributor"] * n,
             "OrgSize": ["20 to 99 employees"] * n,
             "Currency": ["USD United States Dollar"] * n,
             "CompTotal": salaries,
             "ConvertedCompYearly": salaries,
@@ -142,6 +143,7 @@ class TestDropOtherRows:
                 "Age": ["25-34", "25-34", "25-34"],
                 "ICorPM": ["IC", "IC", "IC"],
                 "OrgSize": ["Small", "Small", "Small"],
             }
         )
         config = {
@@ -167,6 +169,7 @@ class TestDropOtherRows:
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
                 "OrgSize": ["Small", "Small"],
             }
         )
         config = {
@@ -191,6 +194,7 @@ class TestDropOtherRows:
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
                 "OrgSize": ["Small", "Small"],
             }
         )
         config = {
@@ -219,6 +223,7 @@ class TestExtractValidCategories:
                 "Age": ["25-34", "35-44", "25-34"],
                 "ICorPM": ["IC", "PM", "IC"],
                 "OrgSize": ["Small", "Large", "Small"],
             }
         )
         result = extract_valid_categories(df)
@@ -226,9 +231,10 @@ class TestExtractValidCategories:
         assert result["EdLevel"] == ["BS", "MS"]
         assert result["ICorPM"] == ["IC", "PM"]
         assert result["OrgSize"] == ["Large", "Small"]
     def test_all_categorical_features_present(self):
-        """All 7 categorical features are present as keys."""
         df = pd.DataFrame(
             {
                 "Country": ["USA"],
@@ -238,6 +244,7 @@ class TestExtractValidCategories:
                 "Age": ["25-34"],
                 "ICorPM": ["IC"],
                 "OrgSize": ["Small"],
             }
         )
         result = extract_valid_categories(df)
@@ -249,6 +256,7 @@ class TestExtractValidCategories:
             "Age",
             "ICorPM",
             "OrgSize",
         }
     def test_excludes_nan_values(self):
@@ -262,6 +270,7 @@ class TestExtractValidCategories:
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
                 "OrgSize": ["Small", "Small"],
             }
         )
         result = extract_valid_categories(df)

             "Age": ["25-34 years old"] * n,
             "ICorPM": ["Individual contributor"] * n,
             "OrgSize": ["20 to 99 employees"] * n,
+            "Employment": ["Employed"] * n,
             "Currency": ["USD United States Dollar"] * n,
             "CompTotal": salaries,
             "ConvertedCompYearly": salaries,
                 "Age": ["25-34", "25-34", "25-34"],
                 "ICorPM": ["IC", "IC", "IC"],
                 "OrgSize": ["Small", "Small", "Small"],
+                "Employment": ["FT", "FT", "FT"],
             }
         )
         config = {
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
                 "OrgSize": ["Small", "Small"],
+                "Employment": ["FT", "FT"],
             }
         )
         config = {
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
                 "OrgSize": ["Small", "Small"],
+                "Employment": ["FT", "FT"],
             }
         )
         config = {
                 "Age": ["25-34", "35-44", "25-34"],
                 "ICorPM": ["IC", "PM", "IC"],
                 "OrgSize": ["Small", "Large", "Small"],
+                "Employment": ["FT", "PT", "FT"],
             }
         )
         result = extract_valid_categories(df)
         assert result["EdLevel"] == ["BS", "MS"]
         assert result["ICorPM"] == ["IC", "PM"]
         assert result["OrgSize"] == ["Large", "Small"]
+        assert result["Employment"] == ["FT", "PT"]
     def test_all_categorical_features_present(self):
+        """All 8 categorical features are present as keys."""
         df = pd.DataFrame(
             {
                 "Country": ["USA"],
                 "Age": ["25-34"],
                 "ICorPM": ["IC"],
                 "OrgSize": ["Small"],
+                "Employment": ["FT"],
             }
         )
         result = extract_valid_categories(df)
             "Age",
             "ICorPM",
             "OrgSize",
+            "Employment",
         }
     def test_excludes_nan_values(self):
                 "Age": ["25-34", "25-34"],
                 "ICorPM": ["IC", "IC"],
                 "OrgSize": ["Small", "Small"],
+                "Employment": ["FT", "FT"],
             }
         )
         result = extract_valid_categories(df)