Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / docs /testing_and_validation.md

maurocarlu

updating the testing and validation doc

7284817 18 days ago

preview code

raw

history blame contribute delete

9.2 kB

Testing and Validation Documentation

This document provides a comprehensive and detailed overview of the testing and validation strategies employed in the Hopcroft project. It consolidates all technical details, execution commands, and analysis reports from Behavioral Testing, Deepchecks, Great Expectations, and Ruff.

1. Behavioral Testing

Report Source: reports/behavioral/ Status: All Tests Passed (36/36) Last Run: November 15, 2025 Model: Random Forest + TF-IDF (SMOTE oversampling) Execution Time: ~8 minutes

Behavioral testing evaluates the model's capabilities and robustness beyond simple accuracy metrics.

Test Categories & Results

Category	Tests	Status	Description
Invariance Tests	9	Passed	Ensure model predictions remain stable under perturbations that shouldn't affect the outcome (e.g., changing variable names, minor typos).
Directional Tests	10	Passed	Verify that specific changes to the input cause expected changes in the output (e.g., adding specific keywords should increase probability of related skills).
Minimum Functionality Tests	17	Passed	Check basic capabilities and sanity checks (e.g., simple inputs produce valid outputs).

How to Regenerate

To run the behavioral tests and generate the JSON report:

python -m pytest tests/behavioral/ \
  --ignore=tests/behavioral/test_model_training.py \
  --json-report \
  --json-report-file=reports/behavioral/behavioral_tests_report.json \
  -v

2. Deepchecks Validation

Report Source: reports/deepchecks/ Status: Cleaned Data is Production-Ready (Score: 96%) Last Run: November 16, 2025

Deepchecks was used to validate the integrity of the dataset before and after cleaning. The validation process confirmed that the data_cleaning.py pipeline successfully resolved critical data quality issues.

Dataset Statistics: Before vs. After Cleaning

Metric	Before Cleaning	After Cleaning	Difference
Total Samples	7,154	6,673	-481 duplicates (6.72%)
Duplicates	481	0	RESOLVED
Data Leakage	Present	0 samples	RESOLVED
Label Conflicts	Present	0	RESOLVED
Train/Test Split	N/A	5,338 / 1,335	80/20 Stratified

Validation Suites Detailed Results

A. Data Integrity Suite (12 checks)

Score: 92% (7 Passed, 2 Non-Critical Failures, 2 Null)

PASSED: Data Duplicates (0), Conflicting Labels (0), Mixed Nulls, Mixed Data Types, String Mismatch, String Length, Feature Label Correlation.
FAILED (Non-Critical/Acceptable):
1. Single Value in Column: Some TF-IDF features are all zeros.
2. Feature-Feature Correlation: High correlation between features.

B. Train-Test Validation Suite (12 checks)

Score: 100% (12 Passed)

PASSED (CRITICAL): Train Test Samples Mix (0 leakage).
PASSED: Datasets Size Comparison (80/20), New Label in Test (0), Feature Drift (< 0.025), Label Drift (0.0), Multivariate Drift.

Interpretation of Results & Important Notes

The validation identified two "failures" that are actually expected behavior for this type of data:

Features with Only Zeros (Non-Critical):
- Reason: TF-IDF creates sparse features. If a specific word (feature) never appears in the specific subset being tested, its column will be all zeros.
- Impact: None. The model simply ignores these features.
High Feature Correlation (Non-Critical):
- Reason: Linguistic terms naturally co-occur (e.g., "machine" and "learning", "python" and "code").
- Impact: Slight multicollinearity, which Random Forest handles well.

Recommendations & Next Steps

Model Retraining: Now that the data is cleaned and leakage-free, the models should be retrained to obtain reliable performance metrics.
Continuous Monitoring: Use run_all_deepchecks.py in CI/CD pipelines to prevent regression.

How to Use the Tests

Run Complete Validation (Recommended):

python tests/deepchecks/run_all_deepchecks.py

Run Specific Suites:

# Data Integrity Only
python tests/deepchecks/test_data_integrity.py

# Train-Test Validation Only
python tests/deepchecks/test_train_test_validation.py

# Compare Original vs Cleaned
python tests/deepchecks/run_all_tests_comparison.py

3. Great Expectations Data Validation

Report Source: tests/great expectations/ Status: All 10 Tests Passed on Cleaned Data

Great Expectations provides a rigorous suite of 10 tests to validate the data pipeline at various stages.

Detailed Test Descriptions

TEST 1: Raw Database Validation

Purpose: Validates integrity/schema of nlbse_tool_competition_data_by_issue table. Ensures data source integrity before expensive feature engineering.
Checks: Row count (7000-10000), Column count (220-230), Required columns present.
Result: PASS. Schema is valid.

TEST 2: TF-IDF Feature Matrix Validation

Purpose: Validates statistical properties of TF-IDF features. Ensures feature matrix is suitable for ML algorithms.
Checks: No NaN/Inf, values >= 0, at least 1 non-zero feature per sample.
Original Data: FAIL (25 samples had 0 features due to empty text).
Cleaned Data: PASS (Sparse samples removed).

TEST 3: Multi-Label Binary Format Validation

Purpose: Ensures label matrix is binary {0,1} for MultiOutputClassifier. Missing labels would invalidate training.
Checks: Values in {0,1}, correct dimensions.
Result: PASS.

TEST 4: Feature-Label Consistency Validation

Purpose: Validates alignment between X and Y matrices. Misalignment causes catastrophic training failures.
Checks: Row counts match, no empty vectors.
Original Data: FAIL (Empty feature vectors present).
Cleaned Data: PASS (Perfect alignment).

TEST 5: Label Distribution & Stratification

Purpose: Ensures labels have enough samples for stratified splitting. Labels with insufficient samples cause stratification failures.
Checks: Min 5 occurrences per label.
Original Data: FAIL (75 labels had 0 occurrences).
Cleaned Data: PASS (Rare labels removed).

TEST 6: Feature Sparsity & SMOTE Compatibility

Purpose: Ensures feature density is sufficient for nearest-neighbor algorithms (SMOTE/ADASYN).
Checks: Min 10 non-zero features per sample.
Original Data: FAIL (31.5% samples < 10 features).
Cleaned Data: PASS (Incompatible samples removed).

TEST 7: Multi-Output Classifier Compatibility

Purpose: Validates multi-label structure. Insufficient multi-label samples would indicate inappropriate architecture.
Checks: >50% samples have multiple labels.
Result: PASS (Strong multi-label characteristics).

TEST 8: Duplicate Samples Detection

Purpose: Detects duplicate feature vectors to prevent leakage.
Original Data: FAIL (481 duplicates found).
Cleaned Data: PASS (0 duplicates).

TEST 9: Train-Test Separation Validation

Purpose: CRITICAL. Validates no data leakage between train and test sets.
Checks: Intersection of Train and Test sets must be empty.
Result: PASS (Cleaned data only).

TEST 10: Label Consistency Validation

Purpose: Ensures identical features have identical labels. Inconsistency indicates ground truth errors.
Original Data: FAIL (640 samples with conflicting labels).
Cleaned Data: PASS (Resolved via majority voting).

Running the Tests

python "tests/great expectations/test_gx.py"

4. Ruff Code Quality Analysis

Report Source: reports/ruff/ Status: All Issues Resolvable Last Analysis: November 17, 2025 Total Issues: 28

Static code analysis was performed using Ruff to ensure code quality and adherence to PEP 8 standards.

Issue Breakdown by File

File	Issues	Severity	Key Findings
`data_cleaning.py`	16	Low/Med	Unsorted imports (I001), Unused imports (F401), f-strings without placeholders (F541), Comparison to False (E712).
`modeling/train.py`	7	Low/Med	Unused `SMOTE` import, Unused variable `n_labels`, f-strings.
`features.py`	2	Low	Unused `nltk` import.
`dataset.py`	2	Low	Unused `DB_PATH` import.
`mlsmote.py`	1	Low	Unsorted imports.

Configuration & Compliance

Command: ruff check . --output-format json --output-file reports/ruff/ruff_report.json
Standards: PEP 8 (Pass), Black compatible (line length 88), isort (Pass).
Fixability: 100% of issues can be fixed (26 automatically, 2 manually).

Conclusion

The project code quality is high, with only minor style and import issues that do not affect functionality but should be cleaned up for maintainability.