Behavioral Testing for Skill Classification Model
This directory contains behavioral tests for the skill classification model, following the methodology described in Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList".
Overview
Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:
1. Invariance Tests (test_invariance.py)
Tests that verify certain transformations to the input should NOT change the model's predictions significantly.
Examples:
- Typo robustness: "Fixed bug" vs "Fixd bug" should produce similar predictions
- Synonym substitution: "fix" vs "resolve" should not affect predictions
- Case insensitivity: "API" vs "api" should produce identical results
- Punctuation robustness: Extra punctuation should not change predictions
- URL/code snippet noise: URLs and code blocks should not affect core predictions
Run only invariance tests:
pytest tests/behavioral/test_invariance.py -v
2. Directional Tests (test_directional.py)
Tests that verify specific changes to the input lead to PREDICTABLE changes in predictions.
Examples:
- Adding language keywords: Adding "Java" or "Python" should affect language-related predictions
- Adding data structure keywords: Adding "HashMap" should influence data structure predictions
- Adding error handling context: Adding "exception handling" should affect error handling predictions
- Adding API context: Adding "REST API" should influence API-related predictions
- Increasing technical detail: More specific descriptions should maintain or add relevant skills
Run only directional tests:
pytest tests/behavioral/test_directional.py -v
3. Minimum Functionality Tests (MFT) (test_minimum_functionality.py)
Tests that verify the model performs well on basic, straightforward examples where the expected output is clear.
Examples:
- Simple bug fix: "Fixed null pointer exception" → should predict programming skills
- Database work: "SQL query optimization" → should predict database skills
- API development: "Created REST API endpoint" → should predict API skills
- Testing work: "Added unit tests" → should predict testing skills
- DevOps work: "Configured Docker" → should predict DevOps skills
- Complex multi-skill tasks: Should predict multiple relevant skills
Run only MFT tests:
pytest tests/behavioral/test_minimum_functionality.py -v
4. Model Training Tests (test_model_training.py)
Tests that verify the model training process works correctly.
Examples:
- Training completes without errors: Training should finish successfully
- Decreasing loss: Model should improve during training (F1 > random baseline)
- Overfitting on single batch: Model should be able to memorize small dataset
- Training on CPU: Should work on CPU
- Training on multiple cores: Should work with parallel processing
- Training on GPU: Should detect GPU if available (skipped if no GPU)
- Reproducibility: Same random seed should give identical results
- More data improves performance: Larger dataset should improve or maintain performance
- Model saves/loads correctly: Trained models should persist correctly
Run only training tests:
pytest tests/behavioral/test_model_training.py -v
Note: Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.
Prerequisites
Before running the behavioral tests, ensure you have:
Trained model: A trained model must exist in
models/directory- Default:
random_forest_tfidf_gridsearch_smote.pkl - Fallback:
random_forest_tfidf_gridsearch.pkl
- Default:
Feature extraction: TF-IDF features must be generated
- Run:
make featuresorpython -m hopcroft_skill_classification_tool_competition.features
- Run:
Database: The SkillScope database must be available
- Run:
make datato download if needed
- Run:
Dependencies: Install test dependencies
pip install -r requirements.txt
Running the Tests
Run all behavioral tests:
# Run all behavioral tests (excluding training tests that require PyTorch)
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
# Or run all tests (will fail if PyTorch not installed)
pytest tests/behavioral/ -v
Run specific test categories:
# Invariance tests only
pytest tests/behavioral/test_invariance.py -v
# Directional tests only
pytest tests/behavioral/test_directional.py -v
# Minimum functionality tests only
pytest tests/behavioral/test_minimum_functionality.py -v
Run with markers:
# Run only invariance tests
pytest tests/behavioral/ -m invariance -v
# Run only directional tests
pytest tests/behavioral/ -m directional -v
# Run only MFT tests
pytest tests/behavioral/ -m mft -v
# Run only training tests
pytest tests/behavioral/ -m training -v
Run specific test:
pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v
Run with output:
# Show print statements during tests
pytest tests/behavioral/ -v -s
# Show detailed output and stop on first failure
pytest tests/behavioral/ -v -s -x
Understanding Test Results
Successful Test
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED
The model correctly maintained predictions despite typos.
Failed Test
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
AssertionError: Typos changed predictions too much. Similarity: 0.45
The model's predictions changed significantly with typos (similarity < 0.7 threshold).
Common Failure Reasons
- Invariance test failures: Model is too sensitive to noise (typos, punctuation, etc.)
- Directional test failures: Model doesn't respond appropriately to meaningful changes
- MFT failures: Model fails on basic, clear-cut examples
Test Configuration
Fixtures (in conftest.py)
trained_model: Loads the trained model from disktfidf_vectorizer: Loads or reconstructs the TF-IDF vectorizerlabel_names: Gets the list of skill label namespredict_text(text): Predicts skill indices from raw textpredict_with_labels(text): Predicts skill label names from raw text
Thresholds
The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":
- Invariance tests: Typically 0.6-0.8 similarity required
- Directional tests: Predictions should differ meaningfully
- MFT tests: At least 1-2 skills should be predicted
These thresholds can be adjusted in the test files based on your model's behavior.
Interpreting Results
Good Model Behavior:
- [PASS] High similarity on invariance tests (predictions stable despite noise)
- [PASS] Meaningful changes on directional tests (predictions respond to context)
- [PASS] Non-empty, relevant predictions on MFT tests
Problematic Model Behavior:
- [FAIL] Low similarity on invariance tests (too sensitive to noise)
- [FAIL] No changes on directional tests (not learning from context)
- [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)
Extending the Tests
To add new behavioral tests:
- Choose the appropriate category (invariance/directional/MFT)
- Add a new test method to the corresponding test class
- Use the
predict_textorpredict_with_labelsfixtures - Add appropriate assertions and print statements for debugging
- Add the corresponding marker:
@pytest.mark.invariance,@pytest.mark.directional, or@pytest.mark.mft
Example:
@pytest.mark.invariance
def test_my_new_invariance_test(self, predict_text):
"""Test that X doesn't affect predictions."""
original = "Some text"
modified = "Some modified text"
pred_orig = set(predict_text(original))
pred_mod = set(predict_text(modified))
similarity = jaccard_similarity(pred_orig, pred_mod)
assert similarity >= 0.7, f"Similarity too low: {similarity}"
Integration with CI/CD
Add to your CI/CD pipeline:
- name: Run Behavioral Tests
run: |
pytest tests/behavioral/ -v --tb=short
References
- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP models with CheckList. ACL 2020.
- Project documentation:
docs/ - Model training:
hopcroft_skill_classification_tool_competition/modeling/train.py
Troubleshooting
"Model not found" error
# Train a model first
python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
# or
python -m hopcroft_skill_classification_tool_competition.modeling.train smote
"Features not found" error
# Generate features
make features
# or
python -m hopcroft_skill_classification_tool_competition.features
"Database not found" error
# Download data
make data
# or
python -m hopcroft_skill_classification_tool_competition.dataset
Import errors
# Reinstall dependencies
pip install -r requirements.txt
pytest not found
pip install pytest
"No module named 'torch'" error (for training tests)
# Install PyTorch (required only for test_model_training.py)
pip install torch
# Or skip training tests
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
Contact
For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.