# Behavioral Testing for Skill Classification Model This directory contains behavioral tests for the skill classification model, following the methodology described in **Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"**. ## Overview Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories: ### 1. **Invariance Tests** (`test_invariance.py`) Tests that verify certain transformations to the input should **NOT** change the model's predictions significantly. **Examples:** - **Typo robustness**: "Fixed bug" vs "Fixd bug" should produce similar predictions - **Synonym substitution**: "fix" vs "resolve" should not affect predictions - **Case insensitivity**: "API" vs "api" should produce identical results - **Punctuation robustness**: Extra punctuation should not change predictions - **URL/code snippet noise**: URLs and code blocks should not affect core predictions **Run only invariance tests:** ```bash pytest tests/behavioral/test_invariance.py -v ``` ### 2. **Directional Tests** (`test_directional.py`) Tests that verify specific changes to the input lead to **PREDICTABLE** changes in predictions. **Examples:** - **Adding language keywords**: Adding "Java" or "Python" should affect language-related predictions - **Adding data structure keywords**: Adding "HashMap" should influence data structure predictions - **Adding error handling context**: Adding "exception handling" should affect error handling predictions - **Adding API context**: Adding "REST API" should influence API-related predictions - **Increasing technical detail**: More specific descriptions should maintain or add relevant skills **Run only directional tests:** ```bash pytest tests/behavioral/test_directional.py -v ``` ### 3. **Minimum Functionality Tests (MFT)** (`test_minimum_functionality.py`) Tests that verify the model performs well on **basic, straightforward examples** where the expected output is clear. **Examples:** - Simple bug fix: "Fixed null pointer exception" → should predict programming skills - Database work: "SQL query optimization" → should predict database skills - API development: "Created REST API endpoint" → should predict API skills - Testing work: "Added unit tests" → should predict testing skills - DevOps work: "Configured Docker" → should predict DevOps skills - Complex multi-skill tasks: Should predict multiple relevant skills **Run only MFT tests:** ```bash pytest tests/behavioral/test_minimum_functionality.py -v ``` ### 4. **Model Training Tests** (`test_model_training.py`) Tests that verify the model training process works correctly. **Examples:** - **Training completes without errors**: Training should finish successfully - **Decreasing loss**: Model should improve during training (F1 > random baseline) - **Overfitting on single batch**: Model should be able to memorize small dataset - **Training on CPU**: Should work on CPU - **Training on multiple cores**: Should work with parallel processing - **Training on GPU**: Should detect GPU if available (skipped if no GPU) - **Reproducibility**: Same random seed should give identical results - **More data improves performance**: Larger dataset should improve or maintain performance - **Model saves/loads correctly**: Trained models should persist correctly **Run only training tests:** ```bash pytest tests/behavioral/test_model_training.py -v ``` **Note:** Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance. ## Prerequisites Before running the behavioral tests, ensure you have: 1. **Trained model**: A trained model must exist in `models/` directory - Default: `random_forest_tfidf_gridsearch_smote.pkl` - Fallback: `random_forest_tfidf_gridsearch.pkl` 2. **Feature extraction**: TF-IDF features must be generated - Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features` 3. **Database**: The SkillScope database must be available - Run: `make data` to download if needed 4. **Dependencies**: Install test dependencies ```bash pip install -r requirements.txt ``` ## Running the Tests ### Run all behavioral tests: ```bash # Run all behavioral tests (excluding training tests that require PyTorch) pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py # Or run all tests (will fail if PyTorch not installed) pytest tests/behavioral/ -v ``` ### Run specific test categories: ```bash # Invariance tests only pytest tests/behavioral/test_invariance.py -v # Directional tests only pytest tests/behavioral/test_directional.py -v # Minimum functionality tests only pytest tests/behavioral/test_minimum_functionality.py -v ``` ### Run with markers: ```bash # Run only invariance tests pytest tests/behavioral/ -m invariance -v # Run only directional tests pytest tests/behavioral/ -m directional -v # Run only MFT tests pytest tests/behavioral/ -m mft -v # Run only training tests pytest tests/behavioral/ -m training -v ``` ### Run specific test: ```bash pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v ``` ### Run with output: ```bash # Show print statements during tests pytest tests/behavioral/ -v -s # Show detailed output and stop on first failure pytest tests/behavioral/ -v -s -x ``` ## Understanding Test Results ### Successful Test ``` tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED ``` The model correctly maintained predictions despite typos. ### Failed Test ``` tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED AssertionError: Typos changed predictions too much. Similarity: 0.45 ``` The model's predictions changed significantly with typos (similarity < 0.7 threshold). ### Common Failure Reasons 1. **Invariance test failures**: Model is too sensitive to noise (typos, punctuation, etc.) 2. **Directional test failures**: Model doesn't respond appropriately to meaningful changes 3. **MFT failures**: Model fails on basic, clear-cut examples ## Test Configuration ### Fixtures (in `conftest.py`) - **`trained_model`**: Loads the trained model from disk - **`tfidf_vectorizer`**: Loads or reconstructs the TF-IDF vectorizer - **`label_names`**: Gets the list of skill label names - **`predict_text(text)`**: Predicts skill indices from raw text - **`predict_with_labels(text)`**: Predicts skill label names from raw text ### Thresholds The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough": - **Invariance tests**: Typically 0.6-0.8 similarity required - **Directional tests**: Predictions should differ meaningfully - **MFT tests**: At least 1-2 skills should be predicted These thresholds can be adjusted in the test files based on your model's behavior. ## Interpreting Results ### Good Model Behavior: - [PASS] High similarity on invariance tests (predictions stable despite noise) - [PASS] Meaningful changes on directional tests (predictions respond to context) - [PASS] Non-empty, relevant predictions on MFT tests ### Problematic Model Behavior: - [FAIL] Low similarity on invariance tests (too sensitive to noise) - [FAIL] No changes on directional tests (not learning from context) - [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns) ## Extending the Tests To add new behavioral tests: 1. Choose the appropriate category (invariance/directional/MFT) 2. Add a new test method to the corresponding test class 3. Use the `predict_text` or `predict_with_labels` fixtures 4. Add appropriate assertions and print statements for debugging 5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft` Example: ```python @pytest.mark.invariance def test_my_new_invariance_test(self, predict_text): """Test that X doesn't affect predictions.""" original = "Some text" modified = "Some modified text" pred_orig = set(predict_text(original)) pred_mod = set(predict_text(modified)) similarity = jaccard_similarity(pred_orig, pred_mod) assert similarity >= 0.7, f"Similarity too low: {similarity}" ``` ## Integration with CI/CD Add to your CI/CD pipeline: ```yaml - name: Run Behavioral Tests run: | pytest tests/behavioral/ -v --tb=short ``` ## References - Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). **Beyond Accuracy: Behavioral Testing of NLP models with CheckList**. ACL 2020. - Project documentation: `docs/` - Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py` ## Troubleshooting ### "Model not found" error ```bash # Train a model first python -m hopcroft_skill_classification_tool_competition.modeling.train baseline # or python -m hopcroft_skill_classification_tool_competition.modeling.train smote ``` ### "Features not found" error ```bash # Generate features make features # or python -m hopcroft_skill_classification_tool_competition.features ``` ### "Database not found" error ```bash # Download data make data # or python -m hopcroft_skill_classification_tool_competition.dataset ``` ### Import errors ```bash # Reinstall dependencies pip install -r requirements.txt ``` ### pytest not found ```bash pip install pytest ``` ### "No module named 'torch'" error (for training tests) ```bash # Install PyTorch (required only for test_model_training.py) pip install torch # Or skip training tests pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py ``` ## Contact For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.