| # Behavioral Testing for Skill Classification Model | |
| This directory contains behavioral tests for the skill classification model, following the methodology described in **Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"**. | |
| ## Overview | |
| Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories: | |
| ### 1. **Invariance Tests** (`test_invariance.py`) | |
| Tests that verify certain transformations to the input should **NOT** change the model's predictions significantly. | |
| **Examples:** | |
| - **Typo robustness**: "Fixed bug" vs "Fixd bug" should produce similar predictions | |
| - **Synonym substitution**: "fix" vs "resolve" should not affect predictions | |
| - **Case insensitivity**: "API" vs "api" should produce identical results | |
| - **Punctuation robustness**: Extra punctuation should not change predictions | |
| - **URL/code snippet noise**: URLs and code blocks should not affect core predictions | |
| **Run only invariance tests:** | |
| ```bash | |
| pytest tests/behavioral/test_invariance.py -v | |
| ``` | |
| ### 2. **Directional Tests** (`test_directional.py`) | |
| Tests that verify specific changes to the input lead to **PREDICTABLE** changes in predictions. | |
| **Examples:** | |
| - **Adding language keywords**: Adding "Java" or "Python" should affect language-related predictions | |
| - **Adding data structure keywords**: Adding "HashMap" should influence data structure predictions | |
| - **Adding error handling context**: Adding "exception handling" should affect error handling predictions | |
| - **Adding API context**: Adding "REST API" should influence API-related predictions | |
| - **Increasing technical detail**: More specific descriptions should maintain or add relevant skills | |
| **Run only directional tests:** | |
| ```bash | |
| pytest tests/behavioral/test_directional.py -v | |
| ``` | |
| ### 3. **Minimum Functionality Tests (MFT)** (`test_minimum_functionality.py`) | |
| Tests that verify the model performs well on **basic, straightforward examples** where the expected output is clear. | |
| **Examples:** | |
| - Simple bug fix: "Fixed null pointer exception" → should predict programming skills | |
| - Database work: "SQL query optimization" → should predict database skills | |
| - API development: "Created REST API endpoint" → should predict API skills | |
| - Testing work: "Added unit tests" → should predict testing skills | |
| - DevOps work: "Configured Docker" → should predict DevOps skills | |
| - Complex multi-skill tasks: Should predict multiple relevant skills | |
| **Run only MFT tests:** | |
| ```bash | |
| pytest tests/behavioral/test_minimum_functionality.py -v | |
| ``` | |
| ### 4. **Model Training Tests** (`test_model_training.py`) | |
| Tests that verify the model training process works correctly. | |
| **Examples:** | |
| - **Training completes without errors**: Training should finish successfully | |
| - **Decreasing loss**: Model should improve during training (F1 > random baseline) | |
| - **Overfitting on single batch**: Model should be able to memorize small dataset | |
| - **Training on CPU**: Should work on CPU | |
| - **Training on multiple cores**: Should work with parallel processing | |
| - **Training on GPU**: Should detect GPU if available (skipped if no GPU) | |
| - **Reproducibility**: Same random seed should give identical results | |
| - **More data improves performance**: Larger dataset should improve or maintain performance | |
| - **Model saves/loads correctly**: Trained models should persist correctly | |
| **Run only training tests:** | |
| ```bash | |
| pytest tests/behavioral/test_model_training.py -v | |
| ``` | |
| **Note:** Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance. | |
| ## Prerequisites | |
| Before running the behavioral tests, ensure you have: | |
| 1. **Trained model**: A trained model must exist in `models/` directory | |
| - Default: `random_forest_tfidf_gridsearch_smote.pkl` | |
| - Fallback: `random_forest_tfidf_gridsearch.pkl` | |
| 2. **Feature extraction**: TF-IDF features must be generated | |
| - Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features` | |
| 3. **Database**: The SkillScope database must be available | |
| - Run: `make data` to download if needed | |
| 4. **Dependencies**: Install test dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Running the Tests | |
| ### Run all behavioral tests: | |
| ```bash | |
| # Run all behavioral tests (excluding training tests that require PyTorch) | |
| pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py | |
| # Or run all tests (will fail if PyTorch not installed) | |
| pytest tests/behavioral/ -v | |
| ``` | |
| ### Run specific test categories: | |
| ```bash | |
| # Invariance tests only | |
| pytest tests/behavioral/test_invariance.py -v | |
| # Directional tests only | |
| pytest tests/behavioral/test_directional.py -v | |
| # Minimum functionality tests only | |
| pytest tests/behavioral/test_minimum_functionality.py -v | |
| ``` | |
| ### Run with markers: | |
| ```bash | |
| # Run only invariance tests | |
| pytest tests/behavioral/ -m invariance -v | |
| # Run only directional tests | |
| pytest tests/behavioral/ -m directional -v | |
| # Run only MFT tests | |
| pytest tests/behavioral/ -m mft -v | |
| # Run only training tests | |
| pytest tests/behavioral/ -m training -v | |
| ``` | |
| ### Run specific test: | |
| ```bash | |
| pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v | |
| ``` | |
| ### Run with output: | |
| ```bash | |
| # Show print statements during tests | |
| pytest tests/behavioral/ -v -s | |
| # Show detailed output and stop on first failure | |
| pytest tests/behavioral/ -v -s -x | |
| ``` | |
| ## Understanding Test Results | |
| ### Successful Test | |
| ``` | |
| tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED | |
| ``` | |
| The model correctly maintained predictions despite typos. | |
| ### Failed Test | |
| ``` | |
| tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED | |
| AssertionError: Typos changed predictions too much. Similarity: 0.45 | |
| ``` | |
| The model's predictions changed significantly with typos (similarity < 0.7 threshold). | |
| ### Common Failure Reasons | |
| 1. **Invariance test failures**: Model is too sensitive to noise (typos, punctuation, etc.) | |
| 2. **Directional test failures**: Model doesn't respond appropriately to meaningful changes | |
| 3. **MFT failures**: Model fails on basic, clear-cut examples | |
| ## Test Configuration | |
| ### Fixtures (in `conftest.py`) | |
| - **`trained_model`**: Loads the trained model from disk | |
| - **`tfidf_vectorizer`**: Loads or reconstructs the TF-IDF vectorizer | |
| - **`label_names`**: Gets the list of skill label names | |
| - **`predict_text(text)`**: Predicts skill indices from raw text | |
| - **`predict_with_labels(text)`**: Predicts skill label names from raw text | |
| ### Thresholds | |
| The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough": | |
| - **Invariance tests**: Typically 0.6-0.8 similarity required | |
| - **Directional tests**: Predictions should differ meaningfully | |
| - **MFT tests**: At least 1-2 skills should be predicted | |
| These thresholds can be adjusted in the test files based on your model's behavior. | |
| ## Interpreting Results | |
| ### Good Model Behavior: | |
| - [PASS] High similarity on invariance tests (predictions stable despite noise) | |
| - [PASS] Meaningful changes on directional tests (predictions respond to context) | |
| - [PASS] Non-empty, relevant predictions on MFT tests | |
| ### Problematic Model Behavior: | |
| - [FAIL] Low similarity on invariance tests (too sensitive to noise) | |
| - [FAIL] No changes on directional tests (not learning from context) | |
| - [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns) | |
| ## Extending the Tests | |
| To add new behavioral tests: | |
| 1. Choose the appropriate category (invariance/directional/MFT) | |
| 2. Add a new test method to the corresponding test class | |
| 3. Use the `predict_text` or `predict_with_labels` fixtures | |
| 4. Add appropriate assertions and print statements for debugging | |
| 5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft` | |
| Example: | |
| ```python | |
| @pytest.mark.invariance | |
| def test_my_new_invariance_test(self, predict_text): | |
| """Test that X doesn't affect predictions.""" | |
| original = "Some text" | |
| modified = "Some modified text" | |
| pred_orig = set(predict_text(original)) | |
| pred_mod = set(predict_text(modified)) | |
| similarity = jaccard_similarity(pred_orig, pred_mod) | |
| assert similarity >= 0.7, f"Similarity too low: {similarity}" | |
| ``` | |
| ## Integration with CI/CD | |
| Add to your CI/CD pipeline: | |
| ```yaml | |
| - name: Run Behavioral Tests | |
| run: | | |
| pytest tests/behavioral/ -v --tb=short | |
| ``` | |
| ## References | |
| - Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). **Beyond Accuracy: Behavioral Testing of NLP models with CheckList**. ACL 2020. | |
| - Project documentation: `docs/` | |
| - Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py` | |
| ## Troubleshooting | |
| ### "Model not found" error | |
| ```bash | |
| # Train a model first | |
| python -m hopcroft_skill_classification_tool_competition.modeling.train baseline | |
| # or | |
| python -m hopcroft_skill_classification_tool_competition.modeling.train smote | |
| ``` | |
| ### "Features not found" error | |
| ```bash | |
| # Generate features | |
| make features | |
| # or | |
| python -m hopcroft_skill_classification_tool_competition.features | |
| ``` | |
| ### "Database not found" error | |
| ```bash | |
| # Download data | |
| make data | |
| # or | |
| python -m hopcroft_skill_classification_tool_competition.dataset | |
| ``` | |
| ### Import errors | |
| ```bash | |
| # Reinstall dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ### pytest not found | |
| ```bash | |
| pip install pytest | |
| ``` | |
| ### "No module named 'torch'" error (for training tests) | |
| ```bash | |
| # Install PyTorch (required only for test_model_training.py) | |
| pip install torch | |
| # Or skip training tests | |
| pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py | |
| ``` | |
| ## Contact | |
| For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub. | |