dima806 commited on
Commit
eeeaee6
·
verified ·
1 Parent(s): 2cc5253

Upload 39 files

Browse files
.gitignore CHANGED
@@ -186,7 +186,7 @@ cython_debug/
186
  # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
187
  # and can be added to the global gitignore or merged into this file. However, if you prefer,
188
  # you could uncomment the following to ignore the entire vscode folder
189
- # .vscode/
190
 
191
  # Ruff stuff:
192
  .ruff_cache/
 
186
  # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
187
  # and can be added to the global gitignore or merged into this file. However, if you prefer,
188
  # you could uncomment the following to ignore the entire vscode folder
189
+ .vscode/
190
 
191
  # Ruff stuff:
192
  .ruff_cache/
Claude.md CHANGED
@@ -95,6 +95,8 @@ make check # lint + test + complexity + maintainability + audit + security
95
  | `make security` | bandit static security analysis |
96
  | `make pre-process` | Validate data + generate config artifacts (no model) |
97
  | `make tune` | Optuna hyperparameter search |
 
 
98
 
99
  ### Training the model
100
 
@@ -141,6 +143,7 @@ input_data = SalaryInput(
141
  age="25-34 years old",
142
  ic_or_pm="Individual contributor",
143
  org_size="20 to 99 employees",
 
144
  )
145
  salary = predict_salary(input_data)
146
  ```
@@ -163,13 +166,14 @@ The `survey_results_public.csv` must include these columns:
163
  | `Age` | Age range |
164
  | `ICorPM` | Individual contributor or people manager |
165
  | `OrgSize` | Organisation size (number of employees) |
 
166
  | `ConvertedCompYearly` | Annual salary in USD (target variable) |
167
 
168
  ## Input Validation (Two Layers)
169
 
170
  ### Layer 1 — Pydantic schema (`src/schema.py`)
171
 
172
- All 9 fields are required. `years_code` and `work_exp` must be `>= 0`. Validated at
173
  object construction time — raises `ValidationError` on failure.
174
 
175
  ### Layer 2 — Runtime guardrails (`src/infer.py`)
@@ -181,7 +185,7 @@ time. Raises `ValueError` with a clear message on invalid input.
181
 
182
  ### [src/schema.py](src/schema.py)
183
 
184
- Pydantic v2 `SalaryInput` model — defines all 9 required input fields, types, and
185
  constraints. The JSON schema example in the docstring is the canonical usage example.
186
 
187
  ### [src/preprocessing.py](src/preprocessing.py)
@@ -232,17 +236,20 @@ When adding a new input feature, update **all** of the following in order:
232
  2. `src/schema.py` — add field to `SalaryInput`
233
  3. `src/preprocessing.py` — add to `_categorical_cols` (or numeric handling)
234
  4. `src/train.py` — add to `CATEGORICAL_FEATURES` and `usecols`
235
- 5. `src/infer.py` — add validation block and DataFrame column
236
- 6. `app.py` — add selectbox, default, sidebar entry, `SalaryInput` construction
237
- 7. `tests/conftest.py` — add to `sample_salary_input` fixture
238
- 8. `tests/test_schema.py` — assert field, add missing-field test
239
- 9. `tests/test_infer.py` — add invalid-value test
240
- 10. `tests/test_feature_impact.py` — add to all `base_input` dicts, add impact test
241
- 11. `tests/test_preprocessing.py` — add column to all `pd.DataFrame(...)` fixtures
242
- 12. `tests/test_train.py` — add column to `_make_salary_df` and all test DataFrames
243
- 13. `README.md` — required columns, valid categories list, code example
244
- 14. `example_inference.py` — add to all `SalaryInput` calls
245
- 15. Retrain: `uv run python -m src.train`
 
 
 
246
 
247
  ## Versioning
248
 
@@ -252,7 +259,7 @@ Follows [Semantic Versioning](https://semver.org/):
252
  - **MINOR** — new optional field, new supported country, new Makefile target
253
  - **PATCH** — bug fix, model retrain with same schema, config tuning
254
 
255
- Current version: `2.0.0` (added `OrgSize` required field).
256
 
257
  Update `pyproject.toml` before tagging:
258
 
@@ -293,7 +300,7 @@ uv run pre-commit run --all-files
293
  | ------- | --- |
294
  | `FileNotFoundError: model.pkl` | Run `uv run python -m src.train` |
295
  | `FileNotFoundError: valid_categories.yaml` | Same — generated by training |
296
- | `ValidationError` on `SalaryInput` | Check all 9 fields are present and non-negative numerics |
297
  | `ValueError: Invalid ...` at inference | Value not in `config/valid_categories.yaml`; retrain or use a listed value |
298
  | `E501` ruff errors | Lines > 79 chars — split strings, use variables, or wrap lists |
299
  | Tests fail after adding a feature | Check the "Updating Features" checklist above |
 
95
  | `make security` | bandit static security analysis |
96
  | `make pre-process` | Validate data + generate config artifacts (no model) |
97
  | `make tune` | Optuna hyperparameter search |
98
+ | `make ci` | Mirror of GitHub Actions CI (lint + test) |
99
+ | `make pre-commit` | Run all pre-commit hooks against every file |
100
 
101
  ### Training the model
102
 
 
143
  age="25-34 years old",
144
  ic_or_pm="Individual contributor",
145
  org_size="20 to 99 employees",
146
+ employment="Employed",
147
  )
148
  salary = predict_salary(input_data)
149
  ```
 
166
  | `Age` | Age range |
167
  | `ICorPM` | Individual contributor or people manager |
168
  | `OrgSize` | Organisation size (number of employees) |
169
+ | `Employment` | Current employment status |
170
  | `ConvertedCompYearly` | Annual salary in USD (target variable) |
171
 
172
  ## Input Validation (Two Layers)
173
 
174
  ### Layer 1 — Pydantic schema (`src/schema.py`)
175
 
176
+ All 10 fields are required. `years_code` and `work_exp` must be `>= 0`. Validated at
177
  object construction time — raises `ValidationError` on failure.
178
 
179
  ### Layer 2 — Runtime guardrails (`src/infer.py`)
 
185
 
186
  ### [src/schema.py](src/schema.py)
187
 
188
+ Pydantic v2 `SalaryInput` model — defines all 10 required input fields, types, and
189
  constraints. The JSON schema example in the docstring is the canonical usage example.
190
 
191
  ### [src/preprocessing.py](src/preprocessing.py)
 
236
  2. `src/schema.py` — add field to `SalaryInput`
237
  3. `src/preprocessing.py` — add to `_categorical_cols` (or numeric handling)
238
  4. `src/train.py` — add to `CATEGORICAL_FEATURES` and `usecols`
239
+ 5. `src/tune.py` — add to `usecols`
240
+ 6. `src/preprocess.py` — add to `REQUIRED_COLUMNS`
241
+ 7. `src/infer.py` — add validation block and DataFrame column
242
+ 8. `app.py` — add selectbox, default, sidebar entry, `SalaryInput` construction
243
+ 9. `tests/conftest.py` — add to `sample_salary_input` fixture
244
+ 10. `tests/test_schema.py` — assert field, add missing-field test
245
+ 11. `tests/test_infer.py` — add invalid-value test
246
+ 12. `tests/test_feature_impact.py` — add to all `base_input` dicts, add impact test
247
+ 13. `tests/test_preprocessing.py` — add column to all `pd.DataFrame(...)` fixtures
248
+ 14. `tests/test_train.py` — add column to `_make_salary_df` and all test DataFrames
249
+ 15. `README.md` required columns, valid categories list, code example
250
+ 16. `Claude.md` — data requirements table, field counts, code example, version
251
+ 17. `example_inference.py` — add to all `SalaryInput` calls
252
+ 18. Retrain: `uv run python -m src.train`
253
 
254
  ## Versioning
255
 
 
259
  - **MINOR** — new optional field, new supported country, new Makefile target
260
  - **PATCH** — bug fix, model retrain with same schema, config tuning
261
 
262
+ Current version: `3.0.0` (added `Employment` required field).
263
 
264
  Update `pyproject.toml` before tagging:
265
 
 
300
  | ------- | --- |
301
  | `FileNotFoundError: model.pkl` | Run `uv run python -m src.train` |
302
  | `FileNotFoundError: valid_categories.yaml` | Same — generated by training |
303
+ | `ValidationError` on `SalaryInput` | Check all 10 fields are present and non-negative numerics |
304
  | `ValueError: Invalid ...` at inference | Value not in `config/valid_categories.yaml`; retrain or use a listed value |
305
  | `E501` ruff errors | Lines > 79 chars — split strings, use variables, or wrap lists |
306
  | Tests fail after adding a feature | Check the "Updating Features" checklist above |
Makefile CHANGED
@@ -1,5 +1,5 @@
1
  .PHONY: lint format test coverage complexity maintainability audit security \
2
- tune pre-process train app smoke-test guardrails check all
3
 
4
  lint:
5
  uv run ruff check .
@@ -52,6 +52,13 @@ smoke-test:
52
  guardrails:
53
  uv run python guardrail_evaluation.py
54
 
 
 
 
 
 
 
 
55
  # CI gate: fast checks that require no model or training data
56
  check: lint test complexity maintainability audit security
57
 
 
1
  .PHONY: lint format test coverage complexity maintainability audit security \
2
+ tune pre-process train app smoke-test guardrails check ci pre-commit all
3
 
4
  lint:
5
  uv run ruff check .
 
52
  guardrails:
53
  uv run python guardrail_evaluation.py
54
 
55
+ # Mirrors GitHub Actions CI (.github/workflows/ci.yml): lint + test
56
+ ci: lint test
57
+
58
+ # Runs all pre-commit hooks against every file (.pre-commit-config.yaml)
59
+ pre-commit:
60
+ uv run pre-commit run --all-files
61
+
62
  # CI gate: fast checks that require no model or training data
63
  check: lint test complexity maintainability audit security
64
 
README.md CHANGED
@@ -45,7 +45,7 @@ Download the Stack Overflow Developer Survey CSV file:
45
  data/survey_results_public.csv
46
  ```
47
 
48
- **Required columns:** `Country`, `YearsCode`, `WorkExp`, `EdLevel`, `DevType`, `Industry`, `Age`, `ICorPM`, `OrgSize`, `ConvertedCompYearly`
49
 
50
  ### 3. Train the Model
51
 
@@ -109,6 +109,8 @@ This runs all quality gates in sequence:
109
 
110
  | Target | Tool | What it checks |
111
  | ------ | ---- | -------------- |
 
 
112
  | `make lint` | ruff | Style and linting errors |
113
  | `make format` | ruff | Auto-formats code |
114
  | `make test` | pytest | Unit and integration tests |
@@ -142,6 +144,7 @@ Launch the Streamlit app and enter:
142
  - **Age**: Developer's age range
143
  - **IC or PM**: Individual contributor or people manager
144
  - **Organization Size**: Approximate number of employees at the developer's company
 
145
 
146
  Click "Predict Salary" to see the estimated annual salary in USD plus a local
147
  currency equivalent where available.
@@ -162,6 +165,7 @@ input_data = SalaryInput(
162
  age="25-34 years old",
163
  ic_or_pm="Individual contributor",
164
  org_size="20 to 99 employees",
 
165
  )
166
 
167
  salary = predict_salary(input_data)
@@ -182,7 +186,7 @@ Validation is enforced at two layers:
182
 
183
  Checked at object construction time:
184
 
185
- - All 9 fields are required
186
  - `years_code` must be `>= 0`
187
  - `work_exp` must be `>= 0`
188
 
@@ -200,6 +204,7 @@ in `config/model_parameters.yaml`):
200
  - **Valid Age Ranges** (~7) — `Other` dropped
201
  - **Valid IC/PM Values** (~3) — `Other` dropped
202
  - **Valid Organization Sizes** (~8) — `Other` dropped
 
203
 
204
  Passing an invalid value raises a `ValueError` with a message pointing to
205
  `config/valid_categories.yaml`.
 
45
  data/survey_results_public.csv
46
  ```
47
 
48
+ **Required columns:** `Country`, `YearsCode`, `WorkExp`, `EdLevel`, `DevType`, `Industry`, `Age`, `ICorPM`, `OrgSize`, `Employment`, `ConvertedCompYearly`
49
 
50
  ### 3. Train the Model
51
 
 
109
 
110
  | Target | Tool | What it checks |
111
  | ------ | ---- | -------------- |
112
+ | `make ci` | ruff + pytest | Mirrors GitHub Actions CI (lint + test) |
113
+ | `make pre-commit` | pre-commit | All hooks from `.pre-commit-config.yaml` against every file |
114
  | `make lint` | ruff | Style and linting errors |
115
  | `make format` | ruff | Auto-formats code |
116
  | `make test` | pytest | Unit and integration tests |
 
144
  - **Age**: Developer's age range
145
  - **IC or PM**: Individual contributor or people manager
146
  - **Organization Size**: Approximate number of employees at the developer's company
147
+ - **Employment Status**: Current employment status
148
 
149
  Click "Predict Salary" to see the estimated annual salary in USD plus a local
150
  currency equivalent where available.
 
165
  age="25-34 years old",
166
  ic_or_pm="Individual contributor",
167
  org_size="20 to 99 employees",
168
+ employment="Employed",
169
  )
170
 
171
  salary = predict_salary(input_data)
 
186
 
187
  Checked at object construction time:
188
 
189
+ - All 10 fields are required
190
  - `years_code` must be `>= 0`
191
  - `work_exp` must be `>= 0`
192
 
 
204
  - **Valid Age Ranges** (~7) — `Other` dropped
205
  - **Valid IC/PM Values** (~3) — `Other` dropped
206
  - **Valid Organization Sizes** (~8) — `Other` dropped
207
+ - **Valid Employment Statuses** (~5)
208
 
209
  Passing an invalid value raises a `ValueError` with a message pointing to
210
  `config/valid_categories.yaml`.
app.py CHANGED
@@ -34,6 +34,7 @@ with st.sidebar:
34
  - Age
35
  - Individual contributor or people manager
36
  - Organization size
 
37
  """
38
  )
39
  st.info("💡 Tip: Results are estimates based on survey averages.")
@@ -47,6 +48,7 @@ with st.sidebar:
47
  st.write(f"**Age Ranges:** {len(valid_categories['Age'])} available")
48
  st.write(f"**IC/PM Roles:** {len(valid_categories['ICorPM'])} available")
49
  st.write(f"**Org Sizes:** {len(valid_categories['OrgSize'])} available")
 
50
  st.caption("Only values from the training data are shown in the dropdowns.")
51
 
52
  # Main input form
@@ -62,6 +64,7 @@ valid_industries = valid_categories["Industry"]
62
  valid_ages = valid_categories["Age"]
63
  valid_icorpm = valid_categories["ICorPM"]
64
  valid_org_sizes = valid_categories["OrgSize"]
 
65
 
66
  # Set default values (if available)
67
  default_country = (
@@ -95,6 +98,9 @@ default_org_size = (
95
  if "20 to 99 employees" in valid_org_sizes
96
  else valid_org_sizes[0]
97
  )
 
 
 
98
 
99
  with col1:
100
  country = st.selectbox(
@@ -165,6 +171,13 @@ org_size = st.selectbox(
165
  help="Approximate number of employees at the developer's company",
166
  )
167
 
 
 
 
 
 
 
 
168
  # Prediction button
169
  if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
170
  try:
@@ -179,6 +192,7 @@ if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
179
  age=age,
180
  ic_or_pm=ic_or_pm,
181
  org_size=org_size,
 
182
  )
183
 
184
  # Make prediction
 
34
  - Age
35
  - Individual contributor or people manager
36
  - Organization size
37
+ - Employment status
38
  """
39
  )
40
  st.info("💡 Tip: Results are estimates based on survey averages.")
 
48
  st.write(f"**Age Ranges:** {len(valid_categories['Age'])} available")
49
  st.write(f"**IC/PM Roles:** {len(valid_categories['ICorPM'])} available")
50
  st.write(f"**Org Sizes:** {len(valid_categories['OrgSize'])} available")
51
+ st.write(f"**Employment:** {len(valid_categories['Employment'])} available")
52
  st.caption("Only values from the training data are shown in the dropdowns.")
53
 
54
  # Main input form
 
64
  valid_ages = valid_categories["Age"]
65
  valid_icorpm = valid_categories["ICorPM"]
66
  valid_org_sizes = valid_categories["OrgSize"]
67
+ valid_employment = valid_categories["Employment"]
68
 
69
  # Set default values (if available)
70
  default_country = (
 
98
  if "20 to 99 employees" in valid_org_sizes
99
  else valid_org_sizes[0]
100
  )
101
+ default_employment = (
102
+ "Employed" if "Employed" in valid_employment else valid_employment[0]
103
+ )
104
 
105
  with col1:
106
  country = st.selectbox(
 
171
  help="Approximate number of employees at the developer's company",
172
  )
173
 
174
+ employment = st.selectbox(
175
+ "Employment Status",
176
+ options=valid_employment,
177
+ index=valid_employment.index(default_employment),
178
+ help="Current employment status",
179
+ )
180
+
181
  # Prediction button
182
  if st.button("🔮 Predict Salary", type="primary", use_container_width=True):
183
  try:
 
192
  age=age,
193
  ic_or_pm=ic_or_pm,
194
  org_size=org_size,
195
+ employment=employment,
196
  )
197
 
198
  # Make prediction
config/model_parameters.yaml CHANGED
@@ -17,21 +17,22 @@ features:
17
  - Age
18
  - ICorPM
19
  - OrgSize
 
20
  encoding:
21
  drop_first: true
22
  model:
23
  n_estimators: 5000
24
- learning_rate: 0.020926294479210576
25
- max_depth: 5
26
- min_child_weight: 18
27
  random_state: 42
28
  n_jobs: -1
29
  early_stopping_rounds: 50
30
- subsample: 0.9191289771331972
31
- colsample_bytree: 0.5333460923651799
32
- reg_alpha: 0.00021933676399241674
33
- reg_lambda: 1.6854320949984984
34
- gamma: 3.8247794752407254
35
  training:
36
  verbose: false
37
  save_model: true
 
17
  - Age
18
  - ICorPM
19
  - OrgSize
20
+ - Employment
21
  encoding:
22
  drop_first: true
23
  model:
24
  n_estimators: 5000
25
+ learning_rate: 0.056803456466335424
26
+ max_depth: 4
27
+ min_child_weight: 16
28
  random_state: 42
29
  n_jobs: -1
30
  early_stopping_rounds: 50
31
+ subsample: 0.9378495066287903
32
+ colsample_bytree: 0.589604213410477
33
+ reg_alpha: 1.2493619591455039
34
+ reg_lambda: 0.006641605590505938
35
+ gamma: 1.269496538435438
36
  training:
37
  verbose: false
38
  save_model: true
config/valid_categories.yaml CHANGED
@@ -107,3 +107,9 @@ OrgSize:
107
  - I don't know
108
  - Just me - I am a freelancer, sole proprietor, etc.
109
  - Less than 20 employees
 
 
 
 
 
 
 
107
  - I don't know
108
  - Just me - I am a freelancer, sole proprietor, etc.
109
  - Less than 20 employees
110
+ Employment:
111
+ - Employed
112
+ - Independent contractor, freelancer, or self-employed
113
+ - Not employed
114
+ - Retired
115
+ - Student
models/model.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ea5bae7edfb8d4b29391e413aedfc94b5335b9bb86ede04e03a646a561e255af
3
- size 3338897
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7be828632107b924dc18532601f3fc6bc1da3d0c6b8e3976a85102dc9787b7a3
3
+ size 1244154
src/infer.py CHANGED
@@ -120,6 +120,13 @@ def predict_salary(data: SalaryInput) -> float:
120
  f"Check config/valid_categories.yaml for all valid values."
121
  )
122
 
 
 
 
 
 
 
 
123
  # Create a DataFrame with the input data
124
  input_df = pd.DataFrame(
125
  {
@@ -132,6 +139,7 @@ def predict_salary(data: SalaryInput) -> float:
132
  "Age": [data.age],
133
  "ICorPM": [data.ic_or_pm],
134
  "OrgSize": [data.org_size],
 
135
  }
136
  )
137
 
 
120
  f"Check config/valid_categories.yaml for all valid values."
121
  )
122
 
123
+ if data.employment not in valid_categories["Employment"]:
124
+ raise ValueError(
125
+ f"Invalid employment status: '{data.employment}'. "
126
+ f"Must be one of {valid_categories['Employment']}. "
127
+ f"Check config/valid_categories.yaml for all valid values."
128
+ )
129
+
130
  # Create a DataFrame with the input data
131
  input_df = pd.DataFrame(
132
  {
 
139
  "Age": [data.age],
140
  "ICorPM": [data.ic_or_pm],
141
  "OrgSize": [data.org_size],
142
+ "Employment": [data.employment],
143
  }
144
  )
145
 
src/preprocess.py CHANGED
@@ -35,6 +35,7 @@ REQUIRED_COLUMNS = [
35
  "Age",
36
  "ICorPM",
37
  "OrgSize",
 
38
  "Currency",
39
  "CompTotal",
40
  "ConvertedCompYearly",
 
35
  "Age",
36
  "ICorPM",
37
  "OrgSize",
38
+ "Employment",
39
  "Currency",
40
  "CompTotal",
41
  "ConvertedCompYearly",
src/preprocessing.py CHANGED
@@ -79,7 +79,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
79
 
80
  Args:
81
  df: DataFrame with columns: Country, YearsCode, WorkExp, EdLevel,
82
- DevType, Industry, Age, ICorPM, OrgSize.
83
  NOTE: During training, cardinality reduction should be applied to df
84
  BEFORE calling this function. During inference, valid_categories.yaml
85
  ensures only valid (already-reduced) categories are used.
@@ -107,6 +107,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
107
  "Age",
108
  "ICorPM",
109
  "OrgSize",
 
110
  ]
111
  for col in _categorical_cols:
112
  if col in df_processed.columns:
@@ -136,6 +137,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
136
  df_processed["Age"] = df_processed["Age"].fillna("Unknown")
137
  df_processed["ICorPM"] = df_processed["ICorPM"].fillna("Unknown")
138
  df_processed["OrgSize"] = df_processed["OrgSize"].fillna("Unknown")
 
139
 
140
  # NOTE: Cardinality reduction is NOT applied here
141
  # It should be applied during training BEFORE calling this function
@@ -152,6 +154,7 @@ def prepare_features(df: pd.DataFrame) -> pd.DataFrame:
152
  "Age",
153
  "ICorPM",
154
  "OrgSize",
 
155
  ]
156
  df_features = df_processed[feature_cols]
157
 
 
79
 
80
  Args:
81
  df: DataFrame with columns: Country, YearsCode, WorkExp, EdLevel,
82
+ DevType, Industry, Age, ICorPM, OrgSize, Employment.
83
  NOTE: During training, cardinality reduction should be applied to df
84
  BEFORE calling this function. During inference, valid_categories.yaml
85
  ensures only valid (already-reduced) categories are used.
 
107
  "Age",
108
  "ICorPM",
109
  "OrgSize",
110
+ "Employment",
111
  ]
112
  for col in _categorical_cols:
113
  if col in df_processed.columns:
 
137
  df_processed["Age"] = df_processed["Age"].fillna("Unknown")
138
  df_processed["ICorPM"] = df_processed["ICorPM"].fillna("Unknown")
139
  df_processed["OrgSize"] = df_processed["OrgSize"].fillna("Unknown")
140
+ df_processed["Employment"] = df_processed["Employment"].fillna("Unknown")
141
 
142
  # NOTE: Cardinality reduction is NOT applied here
143
  # It should be applied during training BEFORE calling this function
 
154
  "Age",
155
  "ICorPM",
156
  "OrgSize",
157
+ "Employment",
158
  ]
159
  df_features = df_processed[feature_cols]
160
 
src/schema.py CHANGED
@@ -19,6 +19,7 @@ class SalaryInput(BaseModel):
19
  "age": "25-34 years old",
20
  "ic_or_pm": "Individual contributor",
21
  "org_size": "20 to 99 employees",
 
22
  }
23
  ]
24
  }
@@ -43,3 +44,4 @@ class SalaryInput(BaseModel):
43
  org_size: str = Field(
44
  ..., description="Size of the organisation the developer works for"
45
  )
 
 
19
  "age": "25-34 years old",
20
  "ic_or_pm": "Individual contributor",
21
  "org_size": "20 to 99 employees",
22
+ "employment": "Employed",
23
  }
24
  ]
25
  }
 
44
  org_size: str = Field(
45
  ..., description="Size of the organisation the developer works for"
46
  )
47
+ employment: str = Field(..., description="Current employment status")
src/train.py CHANGED
@@ -19,6 +19,7 @@ CATEGORICAL_FEATURES = [
19
  "Age",
20
  "ICorPM",
21
  "OrgSize",
 
22
  ]
23
 
24
 
@@ -169,6 +170,7 @@ def main():
169
  "Age",
170
  "ICorPM",
171
  "OrgSize",
 
172
  "Currency",
173
  "CompTotal",
174
  "ConvertedCompYearly",
@@ -279,6 +281,12 @@ def main():
279
  for icorpm, count in top_icorpm.items():
280
  print(f" - {icorpm}: {count:,} ({count / len(df) * 100:.1f}%)")
281
 
 
 
 
 
 
 
282
  # Show YearsCode statistics
283
  print("\n💼 Years of Coding Experience:")
284
  print(f" - Min: {df['YearsCode'].min():.1f}")
 
19
  "Age",
20
  "ICorPM",
21
  "OrgSize",
22
+ "Employment",
23
  ]
24
 
25
 
 
170
  "Age",
171
  "ICorPM",
172
  "OrgSize",
173
+ "Employment",
174
  "Currency",
175
  "CompTotal",
176
  "ConvertedCompYearly",
 
281
  for icorpm, count in top_icorpm.items():
282
  print(f" - {icorpm}: {count:,} ({count / len(df) * 100:.1f}%)")
283
 
284
+ # Show employment distribution
285
+ print("\n💼 Employment Distribution:")
286
+ top_employment = df["Employment"].value_counts()
287
+ for emp, count in top_employment.items():
288
+ print(f" - {emp}: {count:,} ({count / len(df) * 100:.1f}%)")
289
+
290
  # Show YearsCode statistics
291
  print("\n💼 Years of Coding Experience:")
292
  print(f" - Min: {df['YearsCode'].min():.1f}")
src/tune.py CHANGED
@@ -149,6 +149,7 @@ def main():
149
  "Age",
150
  "ICorPM",
151
  "OrgSize",
 
152
  "Currency",
153
  "CompTotal",
154
  "ConvertedCompYearly",
 
149
  "Age",
150
  "ICorPM",
151
  "OrgSize",
152
+ "Employment",
153
  "Currency",
154
  "CompTotal",
155
  "ConvertedCompYearly",
tests/conftest.py CHANGED
@@ -19,6 +19,7 @@ def sample_salary_input():
19
  "age": "25-34 years old",
20
  "ic_or_pm": "Individual contributor",
21
  "org_size": "20 to 99 employees",
 
22
  }
23
 
24
 
 
19
  "age": "25-34 years old",
20
  "ic_or_pm": "Individual contributor",
21
  "org_size": "20 to 99 employees",
22
+ "employment": "Employed",
23
  }
24
 
25
 
tests/test_feature_impact.py CHANGED
@@ -15,6 +15,7 @@ def test_years_experience_impact():
15
  "age": "25-34 years old",
16
  "ic_or_pm": "Individual contributor",
17
  "org_size": "20 to 99 employees",
 
18
  }
19
 
20
  years_tests = [0, 2, 5, 10, 20]
@@ -24,7 +25,7 @@ def test_years_experience_impact():
24
  predictions.append(predict_salary(input_data))
25
 
26
  assert len(set(predictions)) == len(predictions), (
27
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
28
  )
29
 
30
 
@@ -39,6 +40,7 @@ def test_country_impact():
39
  "age": "25-34 years old",
40
  "ic_or_pm": "Individual contributor",
41
  "org_size": "20 to 99 employees",
 
42
  }
43
 
44
  test_countries = [
@@ -59,7 +61,7 @@ def test_country_impact():
59
  predictions.append(predict_salary(input_data))
60
 
61
  assert len(set(predictions)) == len(predictions), (
62
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
63
  )
64
 
65
 
@@ -74,6 +76,7 @@ def test_education_impact():
74
  "age": "25-34 years old",
75
  "ic_or_pm": "Individual contributor",
76
  "org_size": "20 to 99 employees",
 
77
  }
78
 
79
  test_education = [
@@ -96,7 +99,7 @@ def test_education_impact():
96
  predictions.append(predict_salary(input_data))
97
 
98
  assert len(set(predictions)) == len(predictions), (
99
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
100
  )
101
 
102
 
@@ -111,6 +114,7 @@ def test_devtype_impact():
111
  "age": "25-34 years old",
112
  "ic_or_pm": "Individual contributor",
113
  "org_size": "20 to 99 employees",
 
114
  }
115
 
116
  test_devtypes = [
@@ -132,7 +136,7 @@ def test_devtype_impact():
132
  predictions.append(predict_salary(input_data))
133
 
134
  assert len(set(predictions)) == len(predictions), (
135
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
136
  )
137
 
138
 
@@ -147,6 +151,7 @@ def test_industry_impact():
147
  "age": "25-34 years old",
148
  "ic_or_pm": "Individual contributor",
149
  "org_size": "20 to 99 employees",
 
150
  }
151
 
152
  test_industries = [
@@ -168,7 +173,7 @@ def test_industry_impact():
168
  predictions.append(predict_salary(input_data))
169
 
170
  assert len(set(predictions)) == len(predictions), (
171
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
172
  )
173
 
174
 
@@ -183,6 +188,7 @@ def test_age_impact():
183
  "industry": "Software Development",
184
  "ic_or_pm": "Individual contributor",
185
  "org_size": "20 to 99 employees",
 
186
  }
187
 
188
  test_ages = [
@@ -203,7 +209,7 @@ def test_age_impact():
203
  predictions.append(predict_salary(input_data))
204
 
205
  assert len(set(predictions)) == len(predictions), (
206
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
207
  )
208
 
209
 
@@ -218,6 +224,7 @@ def test_work_exp_impact():
218
  "age": "25-34 years old",
219
  "ic_or_pm": "Individual contributor",
220
  "org_size": "20 to 99 employees",
 
221
  }
222
 
223
  work_exp_tests = [0, 1, 3, 5, 10, 20]
@@ -243,6 +250,7 @@ def test_icorpm_impact():
243
  "industry": "Software Development",
244
  "age": "25-34 years old",
245
  "org_size": "20 to 99 employees",
 
246
  }
247
 
248
  test_icorpm = [
@@ -257,7 +265,7 @@ def test_icorpm_impact():
257
  predictions.append(predict_salary(input_data))
258
 
259
  assert len(set(predictions)) == len(predictions), (
260
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
261
  )
262
 
263
 
@@ -272,6 +280,7 @@ def test_org_size_impact():
272
  "industry": "Software Development",
273
  "age": "25-34 years old",
274
  "ic_or_pm": "Individual contributor",
 
275
  }
276
 
277
  test_org_sizes = valid_categories["OrgSize"][:5]
@@ -282,7 +291,7 @@ def test_org_size_impact():
282
  predictions.append(predict_salary(input_data))
283
 
284
  assert len(set(predictions)) == len(predictions), (
285
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
286
  )
287
 
288
 
@@ -379,9 +388,10 @@ def test_combined_features():
379
  age=age,
380
  ic_or_pm=icorpm,
381
  org_size=org_size,
 
382
  )
383
  predictions.append(predict_salary(input_data))
384
 
385
  assert len(set(predictions)) == len(predictions), (
386
- f"Expected {len(predictions)} unique predictions, got {len(set(predictions))}"
387
  )
 
15
  "age": "25-34 years old",
16
  "ic_or_pm": "Individual contributor",
17
  "org_size": "20 to 99 employees",
18
+ "employment": "Employed",
19
  }
20
 
21
  years_tests = [0, 2, 5, 10, 20]
 
25
  predictions.append(predict_salary(input_data))
26
 
27
  assert len(set(predictions)) == len(predictions), (
28
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
29
  )
30
 
31
 
 
40
  "age": "25-34 years old",
41
  "ic_or_pm": "Individual contributor",
42
  "org_size": "20 to 99 employees",
43
+ "employment": "Employed",
44
  }
45
 
46
  test_countries = [
 
61
  predictions.append(predict_salary(input_data))
62
 
63
  assert len(set(predictions)) == len(predictions), (
64
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
65
  )
66
 
67
 
 
76
  "age": "25-34 years old",
77
  "ic_or_pm": "Individual contributor",
78
  "org_size": "20 to 99 employees",
79
+ "employment": "Employed",
80
  }
81
 
82
  test_education = [
 
99
  predictions.append(predict_salary(input_data))
100
 
101
  assert len(set(predictions)) == len(predictions), (
102
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
103
  )
104
 
105
 
 
114
  "age": "25-34 years old",
115
  "ic_or_pm": "Individual contributor",
116
  "org_size": "20 to 99 employees",
117
+ "employment": "Employed",
118
  }
119
 
120
  test_devtypes = [
 
136
  predictions.append(predict_salary(input_data))
137
 
138
  assert len(set(predictions)) == len(predictions), (
139
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
140
  )
141
 
142
 
 
151
  "age": "25-34 years old",
152
  "ic_or_pm": "Individual contributor",
153
  "org_size": "20 to 99 employees",
154
+ "employment": "Employed",
155
  }
156
 
157
  test_industries = [
 
173
  predictions.append(predict_salary(input_data))
174
 
175
  assert len(set(predictions)) == len(predictions), (
176
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
177
  )
178
 
179
 
 
188
  "industry": "Software Development",
189
  "ic_or_pm": "Individual contributor",
190
  "org_size": "20 to 99 employees",
191
+ "employment": "Employed",
192
  }
193
 
194
  test_ages = [
 
209
  predictions.append(predict_salary(input_data))
210
 
211
  assert len(set(predictions)) == len(predictions), (
212
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
213
  )
214
 
215
 
 
224
  "age": "25-34 years old",
225
  "ic_or_pm": "Individual contributor",
226
  "org_size": "20 to 99 employees",
227
+ "employment": "Employed",
228
  }
229
 
230
  work_exp_tests = [0, 1, 3, 5, 10, 20]
 
250
  "industry": "Software Development",
251
  "age": "25-34 years old",
252
  "org_size": "20 to 99 employees",
253
+ "employment": "Employed",
254
  }
255
 
256
  test_icorpm = [
 
265
  predictions.append(predict_salary(input_data))
266
 
267
  assert len(set(predictions)) == len(predictions), (
268
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
269
  )
270
 
271
 
 
280
  "industry": "Software Development",
281
  "age": "25-34 years old",
282
  "ic_or_pm": "Individual contributor",
283
+ "employment": "Employed",
284
  }
285
 
286
  test_org_sizes = valid_categories["OrgSize"][:5]
 
291
  predictions.append(predict_salary(input_data))
292
 
293
  assert len(set(predictions)) == len(predictions), (
294
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
295
  )
296
 
297
 
 
388
  age=age,
389
  ic_or_pm=icorpm,
390
  org_size=org_size,
391
+ employment="Employed",
392
  )
393
  predictions.append(predict_salary(input_data))
394
 
395
  assert len(set(predictions)) == len(predictions), (
396
+ f"Expected {len(predictions)} unique, got {len(set(predictions))}"
397
  )
tests/test_preprocessing.py CHANGED
@@ -87,6 +87,7 @@ class TestPrepareFeatures:
87
  "Age": ["25-34 years old"],
88
  "ICorPM": ["Individual contributor"],
89
  "OrgSize": ["20 to 99 employees"],
 
90
  }
91
  )
92
  result = prepare_features(df)
@@ -106,6 +107,7 @@ class TestPrepareFeatures:
106
  "Age": ["25-34 years old"],
107
  "ICorPM": ["Individual contributor"],
108
  "OrgSize": ["20 to 99 employees"],
 
109
  }
110
  )
111
  result = prepare_features(df)
@@ -125,6 +127,7 @@ class TestPrepareFeatures:
125
  "Age": ["25-34 years old", "35-44 years old"],
126
  "ICorPM": ["Individual contributor", "People manager"],
127
  "OrgSize": ["20 to 99 employees", "100 to 499 employees"],
 
128
  }
129
  )
130
  result = prepare_features(df)
@@ -148,6 +151,7 @@ class TestPrepareFeatures:
148
  "Age": ["25-34 years old"],
149
  "ICorPM": ["Individual contributor"],
150
  "OrgSize": ["20 to 99 employees"],
 
151
  }
152
  )
153
  result = prepare_features(df)
@@ -167,10 +171,11 @@ class TestPrepareFeatures:
167
  "Age": [None],
168
  "ICorPM": [None],
169
  "OrgSize": [None],
 
170
  }
171
  )
172
  result = prepare_features(df)
173
- # Categoricals filled with "Unknown" → one-hot columns contain "Unknown"
174
  unknown_cols = [c for c in result.columns if "Unknown" in c]
175
  assert len(unknown_cols) > 0
176
 
@@ -185,6 +190,7 @@ class TestPrepareFeatures:
185
  "Age": ["25-34 years old"],
186
  "ICorPM": ["Individual contributor"],
187
  "OrgSize": ["20 to 99 employees"],
 
188
  }
189
  df_usa = pd.DataFrame({"Country": ["United States of America"], **base})
190
  df_deu = pd.DataFrame({"Country": ["Germany"], **base})
@@ -210,6 +216,7 @@ class TestPrepareFeatures:
210
  "Age": ["25-34 years old"],
211
  "ICorPM": ["Individual contributor"],
212
  "OrgSize": ["20 to 99 employees"],
 
213
  }
214
  )
215
  original_country = df["Country"].iloc[0]
 
87
  "Age": ["25-34 years old"],
88
  "ICorPM": ["Individual contributor"],
89
  "OrgSize": ["20 to 99 employees"],
90
+ "Employment": ["Employed"],
91
  }
92
  )
93
  result = prepare_features(df)
 
107
  "Age": ["25-34 years old"],
108
  "ICorPM": ["Individual contributor"],
109
  "OrgSize": ["20 to 99 employees"],
110
+ "Employment": ["Employed"],
111
  }
112
  )
113
  result = prepare_features(df)
 
127
  "Age": ["25-34 years old", "35-44 years old"],
128
  "ICorPM": ["Individual contributor", "People manager"],
129
  "OrgSize": ["20 to 99 employees", "100 to 499 employees"],
130
+ "Employment": ["Employed", "Employed"],
131
  }
132
  )
133
  result = prepare_features(df)
 
151
  "Age": ["25-34 years old"],
152
  "ICorPM": ["Individual contributor"],
153
  "OrgSize": ["20 to 99 employees"],
154
+ "Employment": ["Employed"],
155
  }
156
  )
157
  result = prepare_features(df)
 
171
  "Age": [None],
172
  "ICorPM": [None],
173
  "OrgSize": [None],
174
+ "Employment": [None],
175
  }
176
  )
177
  result = prepare_features(df)
178
+ # Categoricals filled with "Unknown" → one-hot encodes "Unknown"
179
  unknown_cols = [c for c in result.columns if "Unknown" in c]
180
  assert len(unknown_cols) > 0
181
 
 
190
  "Age": ["25-34 years old"],
191
  "ICorPM": ["Individual contributor"],
192
  "OrgSize": ["20 to 99 employees"],
193
+ "Employment": ["Employed"],
194
  }
195
  df_usa = pd.DataFrame({"Country": ["United States of America"], **base})
196
  df_deu = pd.DataFrame({"Country": ["Germany"], **base})
 
216
  "Age": ["25-34 years old"],
217
  "ICorPM": ["Individual contributor"],
218
  "OrgSize": ["20 to 99 employees"],
219
+ "Employment": ["Employed"],
220
  }
221
  )
222
  original_country = df["Country"].iloc[0]
tests/test_schema.py CHANGED
@@ -46,6 +46,7 @@ def test_missing_country():
46
  age="25-34 years old",
47
  ic_or_pm="Individual contributor",
48
  org_size="20 to 99 employees",
 
49
  )
50
 
51
 
@@ -61,6 +62,7 @@ def test_missing_education_level():
61
  age="25-34 years old",
62
  ic_or_pm="Individual contributor",
63
  org_size="20 to 99 employees",
 
64
  )
65
 
66
 
@@ -76,6 +78,7 @@ def test_missing_org_size():
76
  industry="Software Development",
77
  age="25-34 years old",
78
  ic_or_pm="Individual contributor",
 
79
  )
80
 
81
 
 
46
  age="25-34 years old",
47
  ic_or_pm="Individual contributor",
48
  org_size="20 to 99 employees",
49
+ employment="Employed",
50
  )
51
 
52
 
 
62
  age="25-34 years old",
63
  ic_or_pm="Individual contributor",
64
  org_size="20 to 99 employees",
65
+ employment="Employed",
66
  )
67
 
68
 
 
78
  industry="Software Development",
79
  age="25-34 years old",
80
  ic_or_pm="Individual contributor",
81
+ employment="Employed",
82
  )
83
 
84
 
tests/test_train.py CHANGED
@@ -35,6 +35,7 @@ def _make_salary_df(countries=None, salaries=None, n=100) -> pd.DataFrame:
35
  "Age": ["25-34 years old"] * n,
36
  "ICorPM": ["Individual contributor"] * n,
37
  "OrgSize": ["20 to 99 employees"] * n,
 
38
  "Currency": ["USD United States Dollar"] * n,
39
  "CompTotal": salaries,
40
  "ConvertedCompYearly": salaries,
@@ -142,6 +143,7 @@ class TestDropOtherRows:
142
  "Age": ["25-34", "25-34", "25-34"],
143
  "ICorPM": ["IC", "IC", "IC"],
144
  "OrgSize": ["Small", "Small", "Small"],
 
145
  }
146
  )
147
  config = {
@@ -167,6 +169,7 @@ class TestDropOtherRows:
167
  "Age": ["25-34", "25-34"],
168
  "ICorPM": ["IC", "IC"],
169
  "OrgSize": ["Small", "Small"],
 
170
  }
171
  )
172
  config = {
@@ -191,6 +194,7 @@ class TestDropOtherRows:
191
  "Age": ["25-34", "25-34"],
192
  "ICorPM": ["IC", "IC"],
193
  "OrgSize": ["Small", "Small"],
 
194
  }
195
  )
196
  config = {
@@ -219,6 +223,7 @@ class TestExtractValidCategories:
219
  "Age": ["25-34", "35-44", "25-34"],
220
  "ICorPM": ["IC", "PM", "IC"],
221
  "OrgSize": ["Small", "Large", "Small"],
 
222
  }
223
  )
224
  result = extract_valid_categories(df)
@@ -226,9 +231,10 @@ class TestExtractValidCategories:
226
  assert result["EdLevel"] == ["BS", "MS"]
227
  assert result["ICorPM"] == ["IC", "PM"]
228
  assert result["OrgSize"] == ["Large", "Small"]
 
229
 
230
  def test_all_categorical_features_present(self):
231
- """All 7 categorical features are present as keys."""
232
  df = pd.DataFrame(
233
  {
234
  "Country": ["USA"],
@@ -238,6 +244,7 @@ class TestExtractValidCategories:
238
  "Age": ["25-34"],
239
  "ICorPM": ["IC"],
240
  "OrgSize": ["Small"],
 
241
  }
242
  )
243
  result = extract_valid_categories(df)
@@ -249,6 +256,7 @@ class TestExtractValidCategories:
249
  "Age",
250
  "ICorPM",
251
  "OrgSize",
 
252
  }
253
 
254
  def test_excludes_nan_values(self):
@@ -262,6 +270,7 @@ class TestExtractValidCategories:
262
  "Age": ["25-34", "25-34"],
263
  "ICorPM": ["IC", "IC"],
264
  "OrgSize": ["Small", "Small"],
 
265
  }
266
  )
267
  result = extract_valid_categories(df)
 
35
  "Age": ["25-34 years old"] * n,
36
  "ICorPM": ["Individual contributor"] * n,
37
  "OrgSize": ["20 to 99 employees"] * n,
38
+ "Employment": ["Employed"] * n,
39
  "Currency": ["USD United States Dollar"] * n,
40
  "CompTotal": salaries,
41
  "ConvertedCompYearly": salaries,
 
143
  "Age": ["25-34", "25-34", "25-34"],
144
  "ICorPM": ["IC", "IC", "IC"],
145
  "OrgSize": ["Small", "Small", "Small"],
146
+ "Employment": ["FT", "FT", "FT"],
147
  }
148
  )
149
  config = {
 
169
  "Age": ["25-34", "25-34"],
170
  "ICorPM": ["IC", "IC"],
171
  "OrgSize": ["Small", "Small"],
172
+ "Employment": ["FT", "FT"],
173
  }
174
  )
175
  config = {
 
194
  "Age": ["25-34", "25-34"],
195
  "ICorPM": ["IC", "IC"],
196
  "OrgSize": ["Small", "Small"],
197
+ "Employment": ["FT", "FT"],
198
  }
199
  )
200
  config = {
 
223
  "Age": ["25-34", "35-44", "25-34"],
224
  "ICorPM": ["IC", "PM", "IC"],
225
  "OrgSize": ["Small", "Large", "Small"],
226
+ "Employment": ["FT", "PT", "FT"],
227
  }
228
  )
229
  result = extract_valid_categories(df)
 
231
  assert result["EdLevel"] == ["BS", "MS"]
232
  assert result["ICorPM"] == ["IC", "PM"]
233
  assert result["OrgSize"] == ["Large", "Small"]
234
+ assert result["Employment"] == ["FT", "PT"]
235
 
236
  def test_all_categorical_features_present(self):
237
+ """All 8 categorical features are present as keys."""
238
  df = pd.DataFrame(
239
  {
240
  "Country": ["USA"],
 
244
  "Age": ["25-34"],
245
  "ICorPM": ["IC"],
246
  "OrgSize": ["Small"],
247
+ "Employment": ["FT"],
248
  }
249
  )
250
  result = extract_valid_categories(df)
 
256
  "Age",
257
  "ICorPM",
258
  "OrgSize",
259
+ "Employment",
260
  }
261
 
262
  def test_excludes_nan_values(self):
 
270
  "Age": ["25-34", "25-34"],
271
  "ICorPM": ["IC", "IC"],
272
  "OrgSize": ["Small", "Small"],
273
+ "Employment": ["FT", "FT"],
274
  }
275
  )
276
  result = extract_valid_categories(df)