File size: 9,202 Bytes
225af6a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
# Testing and Validation Documentation
This document provides a comprehensive and detailed overview of the testing and validation strategies employed in the Hopcroft project. It consolidates all technical details, execution commands, and analysis reports from Behavioral Testing, Deepchecks, Great Expectations, and Ruff.
---
## 1. Behavioral Testing
**Report Source:** `reports/behavioral/`
**Status:** All Tests Passed (36/36)
**Last Run:** November 15, 2025
**Model:** Random Forest + TF-IDF (SMOTE oversampling)
**Execution Time:** ~8 minutes
Behavioral testing evaluates the model's capabilities and robustness beyond simple accuracy metrics.
### Test Categories & Results
| Category | Tests | Status | Description |
|----------|-------|--------|-------------|
| **Invariance Tests** | 9 | **Passed** | Ensure model predictions remain stable under perturbations that shouldn't affect the outcome (e.g., changing variable names, minor typos). |
| **Directional Tests** | 10 | **Passed** | Verify that specific changes to the input cause expected changes in the output (e.g., adding specific keywords should increase probability of related skills). |
| **Minimum Functionality Tests** | 17 | **Passed** | Check basic capabilities and sanity checks (e.g., simple inputs produce valid outputs). |
### How to Regenerate
To run the behavioral tests and generate the JSON report:
```bash
python -m pytest tests/behavioral/ \
--ignore=tests/behavioral/test_model_training.py \
--json-report \
--json-report-file=reports/behavioral/behavioral_tests_report.json \
-v
```
---
## 2. Deepchecks Validation
**Report Source:** `reports/deepchecks/`
**Status:** Cleaned Data is Production-Ready (Score: 96%)
**Last Run:** November 16, 2025
Deepchecks was used to validate the integrity of the dataset before and after cleaning. The validation process confirmed that the `data_cleaning.py` pipeline successfully resolved critical data quality issues.
### Dataset Statistics: Before vs. After Cleaning
| Metric | Before Cleaning | After Cleaning | Difference |
|--------|-----------------|----------------|------------|
| **Total Samples** | 7,154 | 6,673 | -481 duplicates (6.72%) |
| **Duplicates** | 481 | **0** | **RESOLVED** |
| **Data Leakage** | Present | **0 samples** | **RESOLVED** |
| **Label Conflicts** | Present | **0** | **RESOLVED** |
| **Train/Test Split** | N/A | 5,338 / 1,335 | 80/20 Stratified |
### Validation Suites Detailed Results
#### A. Data Integrity Suite (12 checks)
**Score:** 92% (7 Passed, 2 Non-Critical Failures, 2 Null)
* **PASSED:** Data Duplicates (0), Conflicting Labels (0), Mixed Nulls, Mixed Data Types, String Mismatch, String Length, Feature Label Correlation.
* **FAILED (Non-Critical/Acceptable):**
1. **Single Value in Column:** Some TF-IDF features are all zeros.
2. **Feature-Feature Correlation:** High correlation between features.
#### B. Train-Test Validation Suite (12 checks)
**Score:** 100% (12 Passed)
* **PASSED (CRITICAL):** **Train Test Samples Mix (0 leakage)**.
* **PASSED:** Datasets Size Comparison (80/20), New Label in Test (0), Feature Drift (< 0.025), Label Drift (0.0), Multivariate Drift.
### Interpretation of Results & Important Notes
The validation identified two "failures" that are actually **expected behavior** for this type of data:
1. **Features with Only Zeros (Non-Critical):**
* *Reason:* TF-IDF creates sparse features. If a specific word (feature) never appears in the specific subset being tested, its column will be all zeros.
* *Impact:* None. The model simply ignores these features.
2. **High Feature Correlation (Non-Critical):**
* *Reason:* Linguistic terms naturally co-occur (e.g., "machine" and "learning", "python" and "code").
* *Impact:* Slight multicollinearity, which Random Forest handles well.
### Recommendations & Next Steps
1. **Model Retraining:** Now that the data is cleaned and leakage-free, the models should be retrained to obtain reliable performance metrics.
2. **Continuous Monitoring:** Use `run_all_deepchecks.py` in CI/CD pipelines to prevent regression.
### How to Use the Tests
**Run Complete Validation (Recommended):**
```bash
python tests/deepchecks/run_all_deepchecks.py
```
**Run Specific Suites:**
```bash
# Data Integrity Only
python tests/deepchecks/test_data_integrity.py
# Train-Test Validation Only
python tests/deepchecks/test_train_test_validation.py
# Compare Original vs Cleaned
python tests/deepchecks/run_all_tests_comparison.py
```
---
## 3. Great Expectations Data Validation
**Report Source:** `tests/great expectations/`
**Status:** All 10 Tests Passed on Cleaned Data
Great Expectations provides a rigorous suite of 10 tests to validate the data pipeline at various stages.
### Detailed Test Descriptions
#### TEST 1: Raw Database Validation
* **Purpose:** Validates integrity/schema of `nlbse_tool_competition_data_by_issue` table. Ensures data source integrity before expensive feature engineering.
* **Checks:** Row count (7000-10000), Column count (220-230), Required columns present.
* **Result:** **PASS**. Schema is valid.
#### TEST 2: TF-IDF Feature Matrix Validation
* **Purpose:** Validates statistical properties of TF-IDF features. Ensures feature matrix is suitable for ML algorithms.
* **Checks:** No NaN/Inf, values >= 0, at least 1 non-zero feature per sample.
* **Original Data:** **FAIL** (25 samples had 0 features due to empty text).
* **Cleaned Data:** **PASS** (Sparse samples removed).
#### TEST 3: Multi-Label Binary Format Validation
* **Purpose:** Ensures label matrix is binary {0,1} for MultiOutputClassifier. Missing labels would invalidate training.
* **Checks:** Values in {0,1}, correct dimensions.
* **Result:** **PASS**.
#### TEST 4: Feature-Label Consistency Validation
* **Purpose:** Validates alignment between X and Y matrices. Misalignment causes catastrophic training failures.
* **Checks:** Row counts match, no empty vectors.
* **Original Data:** **FAIL** (Empty feature vectors present).
* **Cleaned Data:** **PASS** (Perfect alignment).
#### TEST 5: Label Distribution & Stratification
* **Purpose:** Ensures labels have enough samples for stratified splitting. Labels with insufficient samples cause stratification failures.
* **Checks:** Min 5 occurrences per label.
* **Original Data:** **FAIL** (75 labels had 0 occurrences).
* **Cleaned Data:** **PASS** (Rare labels removed).
#### TEST 6: Feature Sparsity & SMOTE Compatibility
* **Purpose:** Ensures feature density is sufficient for nearest-neighbor algorithms (SMOTE/ADASYN).
* **Checks:** Min 10 non-zero features per sample.
* **Original Data:** **FAIL** (31.5% samples < 10 features).
* **Cleaned Data:** **PASS** (Incompatible samples removed).
#### TEST 7: Multi-Output Classifier Compatibility
* **Purpose:** Validates multi-label structure. Insufficient multi-label samples would indicate inappropriate architecture.
* **Checks:** >50% samples have multiple labels.
* **Result:** **PASS** (Strong multi-label characteristics).
#### TEST 8: Duplicate Samples Detection
* **Purpose:** Detects duplicate feature vectors to prevent leakage.
* **Original Data:** **FAIL** (481 duplicates found).
* **Cleaned Data:** **PASS** (0 duplicates).
#### TEST 9: Train-Test Separation Validation
* **Purpose:** **CRITICAL**. Validates no data leakage between train and test sets.
* **Checks:** Intersection of Train and Test sets must be empty.
* **Result:** **PASS** (Cleaned data only).
#### TEST 10: Label Consistency Validation
* **Purpose:** Ensures identical features have identical labels. Inconsistency indicates ground truth errors.
* **Original Data:** **FAIL** (640 samples with conflicting labels).
* **Cleaned Data:** **PASS** (Resolved via majority voting).
### Running the Tests
```bash
python "tests/great expectations/test_gx.py"
```
---
## 4. Ruff Code Quality Analysis
**Report Source:** `reports/ruff/`
**Status:** All Issues Resolvable
**Last Analysis:** November 17, 2025
**Total Issues:** 28
Static code analysis was performed using Ruff to ensure code quality and adherence to PEP 8 standards.
### Issue Breakdown by File
| File | Issues | Severity | Key Findings |
|------|--------|----------|--------------|
| `data_cleaning.py` | 16 | Low/Med | Unsorted imports (I001), Unused imports (F401), f-strings without placeholders (F541), Comparison to False (E712). |
| `modeling/train.py` | 7 | Low/Med | Unused `SMOTE` import, Unused variable `n_labels`, f-strings. |
| `features.py` | 2 | Low | Unused `nltk` import. |
| `dataset.py` | 2 | Low | Unused `DB_PATH` import. |
| `mlsmote.py` | 1 | Low | Unsorted imports. |
### Configuration & Compliance
* **Command:** `ruff check . --output-format json --output-file reports/ruff/ruff_report.json`
* **Standards:** PEP 8 (Pass), Black compatible (line length 88), isort (Pass).
* **Fixability:** 100% of issues can be fixed (26 automatically, 2 manually).
### Conclusion
The project code quality is high, with only minor style and import issues that do not affect functionality but should be cleaned up for maintainability.
|