File size: 9,858 Bytes
225af6a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
# Behavioral Testing for Skill Classification Model
This directory contains behavioral tests for the skill classification model, following the methodology described in **Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"**.
## Overview
Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:
### 1. **Invariance Tests** (`test_invariance.py`)
Tests that verify certain transformations to the input should **NOT** change the model's predictions significantly.
**Examples:**
- **Typo robustness**: "Fixed bug" vs "Fixd bug" should produce similar predictions
- **Synonym substitution**: "fix" vs "resolve" should not affect predictions
- **Case insensitivity**: "API" vs "api" should produce identical results
- **Punctuation robustness**: Extra punctuation should not change predictions
- **URL/code snippet noise**: URLs and code blocks should not affect core predictions
**Run only invariance tests:**
```bash
pytest tests/behavioral/test_invariance.py -v
```
### 2. **Directional Tests** (`test_directional.py`)
Tests that verify specific changes to the input lead to **PREDICTABLE** changes in predictions.
**Examples:**
- **Adding language keywords**: Adding "Java" or "Python" should affect language-related predictions
- **Adding data structure keywords**: Adding "HashMap" should influence data structure predictions
- **Adding error handling context**: Adding "exception handling" should affect error handling predictions
- **Adding API context**: Adding "REST API" should influence API-related predictions
- **Increasing technical detail**: More specific descriptions should maintain or add relevant skills
**Run only directional tests:**
```bash
pytest tests/behavioral/test_directional.py -v
```
### 3. **Minimum Functionality Tests (MFT)** (`test_minimum_functionality.py`)
Tests that verify the model performs well on **basic, straightforward examples** where the expected output is clear.
**Examples:**
- Simple bug fix: "Fixed null pointer exception" → should predict programming skills
- Database work: "SQL query optimization" → should predict database skills
- API development: "Created REST API endpoint" → should predict API skills
- Testing work: "Added unit tests" → should predict testing skills
- DevOps work: "Configured Docker" → should predict DevOps skills
- Complex multi-skill tasks: Should predict multiple relevant skills
**Run only MFT tests:**
```bash
pytest tests/behavioral/test_minimum_functionality.py -v
```
### 4. **Model Training Tests** (`test_model_training.py`)
Tests that verify the model training process works correctly.
**Examples:**
- **Training completes without errors**: Training should finish successfully
- **Decreasing loss**: Model should improve during training (F1 > random baseline)
- **Overfitting on single batch**: Model should be able to memorize small dataset
- **Training on CPU**: Should work on CPU
- **Training on multiple cores**: Should work with parallel processing
- **Training on GPU**: Should detect GPU if available (skipped if no GPU)
- **Reproducibility**: Same random seed should give identical results
- **More data improves performance**: Larger dataset should improve or maintain performance
- **Model saves/loads correctly**: Trained models should persist correctly
**Run only training tests:**
```bash
pytest tests/behavioral/test_model_training.py -v
```
**Note:** Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.
## Prerequisites
Before running the behavioral tests, ensure you have:
1. **Trained model**: A trained model must exist in `models/` directory
- Default: `random_forest_tfidf_gridsearch_smote.pkl`
- Fallback: `random_forest_tfidf_gridsearch.pkl`
2. **Feature extraction**: TF-IDF features must be generated
- Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features`
3. **Database**: The SkillScope database must be available
- Run: `make data` to download if needed
4. **Dependencies**: Install test dependencies
```bash
pip install -r requirements.txt
```
## Running the Tests
### Run all behavioral tests:
```bash
# Run all behavioral tests (excluding training tests that require PyTorch)
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
# Or run all tests (will fail if PyTorch not installed)
pytest tests/behavioral/ -v
```
### Run specific test categories:
```bash
# Invariance tests only
pytest tests/behavioral/test_invariance.py -v
# Directional tests only
pytest tests/behavioral/test_directional.py -v
# Minimum functionality tests only
pytest tests/behavioral/test_minimum_functionality.py -v
```
### Run with markers:
```bash
# Run only invariance tests
pytest tests/behavioral/ -m invariance -v
# Run only directional tests
pytest tests/behavioral/ -m directional -v
# Run only MFT tests
pytest tests/behavioral/ -m mft -v
# Run only training tests
pytest tests/behavioral/ -m training -v
```
### Run specific test:
```bash
pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v
```
### Run with output:
```bash
# Show print statements during tests
pytest tests/behavioral/ -v -s
# Show detailed output and stop on first failure
pytest tests/behavioral/ -v -s -x
```
## Understanding Test Results
### Successful Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED
```
The model correctly maintained predictions despite typos.
### Failed Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
AssertionError: Typos changed predictions too much. Similarity: 0.45
```
The model's predictions changed significantly with typos (similarity < 0.7 threshold).
### Common Failure Reasons
1. **Invariance test failures**: Model is too sensitive to noise (typos, punctuation, etc.)
2. **Directional test failures**: Model doesn't respond appropriately to meaningful changes
3. **MFT failures**: Model fails on basic, clear-cut examples
## Test Configuration
### Fixtures (in `conftest.py`)
- **`trained_model`**: Loads the trained model from disk
- **`tfidf_vectorizer`**: Loads or reconstructs the TF-IDF vectorizer
- **`label_names`**: Gets the list of skill label names
- **`predict_text(text)`**: Predicts skill indices from raw text
- **`predict_with_labels(text)`**: Predicts skill label names from raw text
### Thresholds
The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":
- **Invariance tests**: Typically 0.6-0.8 similarity required
- **Directional tests**: Predictions should differ meaningfully
- **MFT tests**: At least 1-2 skills should be predicted
These thresholds can be adjusted in the test files based on your model's behavior.
## Interpreting Results
### Good Model Behavior:
- [PASS] High similarity on invariance tests (predictions stable despite noise)
- [PASS] Meaningful changes on directional tests (predictions respond to context)
- [PASS] Non-empty, relevant predictions on MFT tests
### Problematic Model Behavior:
- [FAIL] Low similarity on invariance tests (too sensitive to noise)
- [FAIL] No changes on directional tests (not learning from context)
- [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)
## Extending the Tests
To add new behavioral tests:
1. Choose the appropriate category (invariance/directional/MFT)
2. Add a new test method to the corresponding test class
3. Use the `predict_text` or `predict_with_labels` fixtures
4. Add appropriate assertions and print statements for debugging
5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft`
Example:
```python
@pytest.mark.invariance
def test_my_new_invariance_test(self, predict_text):
"""Test that X doesn't affect predictions."""
original = "Some text"
modified = "Some modified text"
pred_orig = set(predict_text(original))
pred_mod = set(predict_text(modified))
similarity = jaccard_similarity(pred_orig, pred_mod)
assert similarity >= 0.7, f"Similarity too low: {similarity}"
```
## Integration with CI/CD
Add to your CI/CD pipeline:
```yaml
- name: Run Behavioral Tests
run: |
pytest tests/behavioral/ -v --tb=short
```
## References
- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). **Beyond Accuracy: Behavioral Testing of NLP models with CheckList**. ACL 2020.
- Project documentation: `docs/`
- Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py`
## Troubleshooting
### "Model not found" error
```bash
# Train a model first
python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
# or
python -m hopcroft_skill_classification_tool_competition.modeling.train smote
```
### "Features not found" error
```bash
# Generate features
make features
# or
python -m hopcroft_skill_classification_tool_competition.features
```
### "Database not found" error
```bash
# Download data
make data
# or
python -m hopcroft_skill_classification_tool_competition.dataset
```
### Import errors
```bash
# Reinstall dependencies
pip install -r requirements.txt
```
### pytest not found
```bash
pip install pytest
```
### "No module named 'torch'" error (for training tests)
```bash
# Install PyTorch (required only for test_model_training.py)
pip install torch
# Or skip training tests
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
```
## Contact
For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.
|