# Testing and Validation Documentation This document provides a comprehensive and detailed overview of the testing and validation strategies employed in the Hopcroft project. It consolidates all technical details, execution commands, and analysis reports from Behavioral Testing, Deepchecks, Great Expectations, and Ruff. --- ## 1. Behavioral Testing **Report Source:** `reports/behavioral/` **Status:** All Tests Passed (36/36) **Last Run:** November 15, 2025 **Model:** Random Forest + TF-IDF (SMOTE oversampling) **Execution Time:** ~8 minutes Behavioral testing evaluates the model's capabilities and robustness beyond simple accuracy metrics. ### Test Categories & Results | Category | Tests | Status | Description | |----------|-------|--------|-------------| | **Invariance Tests** | 9 | **Passed** | Ensure model predictions remain stable under perturbations that shouldn't affect the outcome (e.g., changing variable names, minor typos). | | **Directional Tests** | 10 | **Passed** | Verify that specific changes to the input cause expected changes in the output (e.g., adding specific keywords should increase probability of related skills). | | **Minimum Functionality Tests** | 17 | **Passed** | Check basic capabilities and sanity checks (e.g., simple inputs produce valid outputs). | ### How to Regenerate To run the behavioral tests and generate the JSON report: ```bash python -m pytest tests/behavioral/ \ --ignore=tests/behavioral/test_model_training.py \ --json-report \ --json-report-file=reports/behavioral/behavioral_tests_report.json \ -v ``` --- ## 2. Deepchecks Validation **Report Source:** `reports/deepchecks/` **Status:** Cleaned Data is Production-Ready (Score: 96%) **Last Run:** November 16, 2025 Deepchecks was used to validate the integrity of the dataset before and after cleaning. The validation process confirmed that the `data_cleaning.py` pipeline successfully resolved critical data quality issues. ### Dataset Statistics: Before vs. After Cleaning | Metric | Before Cleaning | After Cleaning | Difference | |--------|-----------------|----------------|------------| | **Total Samples** | 7,154 | 6,673 | -481 duplicates (6.72%) | | **Duplicates** | 481 | **0** | **RESOLVED** | | **Data Leakage** | Present | **0 samples** | **RESOLVED** | | **Label Conflicts** | Present | **0** | **RESOLVED** | | **Train/Test Split** | N/A | 5,338 / 1,335 | 80/20 Stratified | ### Validation Suites Detailed Results #### A. Data Integrity Suite (12 checks) **Score:** 92% (7 Passed, 2 Non-Critical Failures, 2 Null) * **PASSED:** Data Duplicates (0), Conflicting Labels (0), Mixed Nulls, Mixed Data Types, String Mismatch, String Length, Feature Label Correlation. * **FAILED (Non-Critical/Acceptable):** 1. **Single Value in Column:** Some TF-IDF features are all zeros. 2. **Feature-Feature Correlation:** High correlation between features. #### B. Train-Test Validation Suite (12 checks) **Score:** 100% (12 Passed) * **PASSED (CRITICAL):** **Train Test Samples Mix (0 leakage)**. * **PASSED:** Datasets Size Comparison (80/20), New Label in Test (0), Feature Drift (< 0.025), Label Drift (0.0), Multivariate Drift. ### Interpretation of Results & Important Notes The validation identified two "failures" that are actually **expected behavior** for this type of data: 1. **Features with Only Zeros (Non-Critical):** * *Reason:* TF-IDF creates sparse features. If a specific word (feature) never appears in the specific subset being tested, its column will be all zeros. * *Impact:* None. The model simply ignores these features. 2. **High Feature Correlation (Non-Critical):** * *Reason:* Linguistic terms naturally co-occur (e.g., "machine" and "learning", "python" and "code"). * *Impact:* Slight multicollinearity, which Random Forest handles well. ### Recommendations & Next Steps 1. **Model Retraining:** Now that the data is cleaned and leakage-free, the models should be retrained to obtain reliable performance metrics. 2. **Continuous Monitoring:** Use `run_all_deepchecks.py` in CI/CD pipelines to prevent regression. ### How to Use the Tests **Run Complete Validation (Recommended):** ```bash python tests/deepchecks/run_all_deepchecks.py ``` **Run Specific Suites:** ```bash # Data Integrity Only python tests/deepchecks/test_data_integrity.py # Train-Test Validation Only python tests/deepchecks/test_train_test_validation.py # Compare Original vs Cleaned python tests/deepchecks/run_all_tests_comparison.py ``` --- ## 3. Great Expectations Data Validation **Report Source:** `tests/great expectations/` **Status:** All 10 Tests Passed on Cleaned Data Great Expectations provides a rigorous suite of 10 tests to validate the data pipeline at various stages. ### Detailed Test Descriptions #### TEST 1: Raw Database Validation * **Purpose:** Validates integrity/schema of `nlbse_tool_competition_data_by_issue` table. Ensures data source integrity before expensive feature engineering. * **Checks:** Row count (7000-10000), Column count (220-230), Required columns present. * **Result:** **PASS**. Schema is valid. #### TEST 2: TF-IDF Feature Matrix Validation * **Purpose:** Validates statistical properties of TF-IDF features. Ensures feature matrix is suitable for ML algorithms. * **Checks:** No NaN/Inf, values >= 0, at least 1 non-zero feature per sample. * **Original Data:** **FAIL** (25 samples had 0 features due to empty text). * **Cleaned Data:** **PASS** (Sparse samples removed). #### TEST 3: Multi-Label Binary Format Validation * **Purpose:** Ensures label matrix is binary {0,1} for MultiOutputClassifier. Missing labels would invalidate training. * **Checks:** Values in {0,1}, correct dimensions. * **Result:** **PASS**. #### TEST 4: Feature-Label Consistency Validation * **Purpose:** Validates alignment between X and Y matrices. Misalignment causes catastrophic training failures. * **Checks:** Row counts match, no empty vectors. * **Original Data:** **FAIL** (Empty feature vectors present). * **Cleaned Data:** **PASS** (Perfect alignment). #### TEST 5: Label Distribution & Stratification * **Purpose:** Ensures labels have enough samples for stratified splitting. Labels with insufficient samples cause stratification failures. * **Checks:** Min 5 occurrences per label. * **Original Data:** **FAIL** (75 labels had 0 occurrences). * **Cleaned Data:** **PASS** (Rare labels removed). #### TEST 6: Feature Sparsity & SMOTE Compatibility * **Purpose:** Ensures feature density is sufficient for nearest-neighbor algorithms (SMOTE/ADASYN). * **Checks:** Min 10 non-zero features per sample. * **Original Data:** **FAIL** (31.5% samples < 10 features). * **Cleaned Data:** **PASS** (Incompatible samples removed). #### TEST 7: Multi-Output Classifier Compatibility * **Purpose:** Validates multi-label structure. Insufficient multi-label samples would indicate inappropriate architecture. * **Checks:** >50% samples have multiple labels. * **Result:** **PASS** (Strong multi-label characteristics). #### TEST 8: Duplicate Samples Detection * **Purpose:** Detects duplicate feature vectors to prevent leakage. * **Original Data:** **FAIL** (481 duplicates found). * **Cleaned Data:** **PASS** (0 duplicates). #### TEST 9: Train-Test Separation Validation * **Purpose:** **CRITICAL**. Validates no data leakage between train and test sets. * **Checks:** Intersection of Train and Test sets must be empty. * **Result:** **PASS** (Cleaned data only). #### TEST 10: Label Consistency Validation * **Purpose:** Ensures identical features have identical labels. Inconsistency indicates ground truth errors. * **Original Data:** **FAIL** (640 samples with conflicting labels). * **Cleaned Data:** **PASS** (Resolved via majority voting). ### Running the Tests ```bash python "tests/great expectations/test_gx.py" ``` --- ## 4. Ruff Code Quality Analysis **Report Source:** `reports/ruff/` **Status:** All Issues Resolvable **Last Analysis:** November 17, 2025 **Total Issues:** 28 Static code analysis was performed using Ruff to ensure code quality and adherence to PEP 8 standards. ### Issue Breakdown by File | File | Issues | Severity | Key Findings | |------|--------|----------|--------------| | `data_cleaning.py` | 16 | Low/Med | Unsorted imports (I001), Unused imports (F401), f-strings without placeholders (F541), Comparison to False (E712). | | `modeling/train.py` | 7 | Low/Med | Unused `SMOTE` import, Unused variable `n_labels`, f-strings. | | `features.py` | 2 | Low | Unused `nltk` import. | | `dataset.py` | 2 | Low | Unused `DB_PATH` import. | | `mlsmote.py` | 1 | Low | Unsorted imports. | ### Configuration & Compliance * **Command:** `ruff check . --output-format json --output-file reports/ruff/ruff_report.json` * **Standards:** PEP 8 (Pass), Black compatible (line length 88), isort (Pass). * **Fixability:** 100% of issues can be fixed (26 automatically, 2 manually). ### Conclusion The project code quality is high, with only minor style and import issues that do not affect functionality but should be cleaned up for maintainability.