Hopcroft-Skill-Classification / docs /testing_and_validation.md
maurocarlu's picture
updating the testing and validation doc
7284817

Testing and Validation Documentation

This document provides a comprehensive and detailed overview of the testing and validation strategies employed in the Hopcroft project. It consolidates all technical details, execution commands, and analysis reports from Behavioral Testing, Deepchecks, Great Expectations, and Ruff.


1. Behavioral Testing

Report Source: reports/behavioral/ Status: All Tests Passed (36/36) Last Run: November 15, 2025 Model: Random Forest + TF-IDF (SMOTE oversampling) Execution Time: ~8 minutes

Behavioral testing evaluates the model's capabilities and robustness beyond simple accuracy metrics.

Test Categories & Results

Category Tests Status Description
Invariance Tests 9 Passed Ensure model predictions remain stable under perturbations that shouldn't affect the outcome (e.g., changing variable names, minor typos).
Directional Tests 10 Passed Verify that specific changes to the input cause expected changes in the output (e.g., adding specific keywords should increase probability of related skills).
Minimum Functionality Tests 17 Passed Check basic capabilities and sanity checks (e.g., simple inputs produce valid outputs).

How to Regenerate

To run the behavioral tests and generate the JSON report:

python -m pytest tests/behavioral/ \
  --ignore=tests/behavioral/test_model_training.py \
  --json-report \
  --json-report-file=reports/behavioral/behavioral_tests_report.json \
  -v

2. Deepchecks Validation

Report Source: reports/deepchecks/ Status: Cleaned Data is Production-Ready (Score: 96%) Last Run: November 16, 2025

Deepchecks was used to validate the integrity of the dataset before and after cleaning. The validation process confirmed that the data_cleaning.py pipeline successfully resolved critical data quality issues.

Dataset Statistics: Before vs. After Cleaning

Metric Before Cleaning After Cleaning Difference
Total Samples 7,154 6,673 -481 duplicates (6.72%)
Duplicates 481 0 RESOLVED
Data Leakage Present 0 samples RESOLVED
Label Conflicts Present 0 RESOLVED
Train/Test Split N/A 5,338 / 1,335 80/20 Stratified

Validation Suites Detailed Results

A. Data Integrity Suite (12 checks)

Score: 92% (7 Passed, 2 Non-Critical Failures, 2 Null)

  • PASSED: Data Duplicates (0), Conflicting Labels (0), Mixed Nulls, Mixed Data Types, String Mismatch, String Length, Feature Label Correlation.
  • FAILED (Non-Critical/Acceptable):
    1. Single Value in Column: Some TF-IDF features are all zeros.
    2. Feature-Feature Correlation: High correlation between features.

B. Train-Test Validation Suite (12 checks)

Score: 100% (12 Passed)

  • PASSED (CRITICAL): Train Test Samples Mix (0 leakage).
  • PASSED: Datasets Size Comparison (80/20), New Label in Test (0), Feature Drift (< 0.025), Label Drift (0.0), Multivariate Drift.

Interpretation of Results & Important Notes

The validation identified two "failures" that are actually expected behavior for this type of data:

  1. Features with Only Zeros (Non-Critical):

    • Reason: TF-IDF creates sparse features. If a specific word (feature) never appears in the specific subset being tested, its column will be all zeros.
    • Impact: None. The model simply ignores these features.
  2. High Feature Correlation (Non-Critical):

    • Reason: Linguistic terms naturally co-occur (e.g., "machine" and "learning", "python" and "code").
    • Impact: Slight multicollinearity, which Random Forest handles well.

Recommendations & Next Steps

  1. Model Retraining: Now that the data is cleaned and leakage-free, the models should be retrained to obtain reliable performance metrics.
  2. Continuous Monitoring: Use run_all_deepchecks.py in CI/CD pipelines to prevent regression.

How to Use the Tests

Run Complete Validation (Recommended):

python tests/deepchecks/run_all_deepchecks.py

Run Specific Suites:

# Data Integrity Only
python tests/deepchecks/test_data_integrity.py

# Train-Test Validation Only
python tests/deepchecks/test_train_test_validation.py

# Compare Original vs Cleaned
python tests/deepchecks/run_all_tests_comparison.py

3. Great Expectations Data Validation

Report Source: tests/great expectations/ Status: All 10 Tests Passed on Cleaned Data

Great Expectations provides a rigorous suite of 10 tests to validate the data pipeline at various stages.

Detailed Test Descriptions

TEST 1: Raw Database Validation

  • Purpose: Validates integrity/schema of nlbse_tool_competition_data_by_issue table. Ensures data source integrity before expensive feature engineering.
  • Checks: Row count (7000-10000), Column count (220-230), Required columns present.
  • Result: PASS. Schema is valid.

TEST 2: TF-IDF Feature Matrix Validation

  • Purpose: Validates statistical properties of TF-IDF features. Ensures feature matrix is suitable for ML algorithms.
  • Checks: No NaN/Inf, values >= 0, at least 1 non-zero feature per sample.
  • Original Data: FAIL (25 samples had 0 features due to empty text).
  • Cleaned Data: PASS (Sparse samples removed).

TEST 3: Multi-Label Binary Format Validation

  • Purpose: Ensures label matrix is binary {0,1} for MultiOutputClassifier. Missing labels would invalidate training.
  • Checks: Values in {0,1}, correct dimensions.
  • Result: PASS.

TEST 4: Feature-Label Consistency Validation

  • Purpose: Validates alignment between X and Y matrices. Misalignment causes catastrophic training failures.
  • Checks: Row counts match, no empty vectors.
  • Original Data: FAIL (Empty feature vectors present).
  • Cleaned Data: PASS (Perfect alignment).

TEST 5: Label Distribution & Stratification

  • Purpose: Ensures labels have enough samples for stratified splitting. Labels with insufficient samples cause stratification failures.
  • Checks: Min 5 occurrences per label.
  • Original Data: FAIL (75 labels had 0 occurrences).
  • Cleaned Data: PASS (Rare labels removed).

TEST 6: Feature Sparsity & SMOTE Compatibility

  • Purpose: Ensures feature density is sufficient for nearest-neighbor algorithms (SMOTE/ADASYN).
  • Checks: Min 10 non-zero features per sample.
  • Original Data: FAIL (31.5% samples < 10 features).
  • Cleaned Data: PASS (Incompatible samples removed).

TEST 7: Multi-Output Classifier Compatibility

  • Purpose: Validates multi-label structure. Insufficient multi-label samples would indicate inappropriate architecture.
  • Checks: >50% samples have multiple labels.
  • Result: PASS (Strong multi-label characteristics).

TEST 8: Duplicate Samples Detection

  • Purpose: Detects duplicate feature vectors to prevent leakage.
  • Original Data: FAIL (481 duplicates found).
  • Cleaned Data: PASS (0 duplicates).

TEST 9: Train-Test Separation Validation

  • Purpose: CRITICAL. Validates no data leakage between train and test sets.
  • Checks: Intersection of Train and Test sets must be empty.
  • Result: PASS (Cleaned data only).

TEST 10: Label Consistency Validation

  • Purpose: Ensures identical features have identical labels. Inconsistency indicates ground truth errors.
  • Original Data: FAIL (640 samples with conflicting labels).
  • Cleaned Data: PASS (Resolved via majority voting).

Running the Tests

python "tests/great expectations/test_gx.py"

4. Ruff Code Quality Analysis

Report Source: reports/ruff/ Status: All Issues Resolvable Last Analysis: November 17, 2025 Total Issues: 28

Static code analysis was performed using Ruff to ensure code quality and adherence to PEP 8 standards.

Issue Breakdown by File

File Issues Severity Key Findings
data_cleaning.py 16 Low/Med Unsorted imports (I001), Unused imports (F401), f-strings without placeholders (F541), Comparison to False (E712).
modeling/train.py 7 Low/Med Unused SMOTE import, Unused variable n_labels, f-strings.
features.py 2 Low Unused nltk import.
dataset.py 2 Low Unused DB_PATH import.
mlsmote.py 1 Low Unsorted imports.

Configuration & Compliance

  • Command: ruff check . --output-format json --output-file reports/ruff/ruff_report.json
  • Standards: PEP 8 (Pass), Black compatible (line length 88), isort (Pass).
  • Fixability: 100% of issues can be fixed (26 automatically, 2 manually).

Conclusion

The project code quality is high, with only minor style and import issues that do not affect functionality but should be cleaned up for maintainability.