File size: 9,858 Bytes
225af6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# Behavioral Testing for Skill Classification Model

This directory contains behavioral tests for the skill classification model, following the methodology described in **Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"**.

## Overview

Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:

### 1. **Invariance Tests** (`test_invariance.py`)
Tests that verify certain transformations to the input should **NOT** change the model's predictions significantly.

**Examples:**
- **Typo robustness**: "Fixed bug" vs "Fixd bug" should produce similar predictions
- **Synonym substitution**: "fix" vs "resolve" should not affect predictions
- **Case insensitivity**: "API" vs "api" should produce identical results
- **Punctuation robustness**: Extra punctuation should not change predictions
- **URL/code snippet noise**: URLs and code blocks should not affect core predictions

**Run only invariance tests:**
```bash
pytest tests/behavioral/test_invariance.py -v
```

### 2. **Directional Tests** (`test_directional.py`)
Tests that verify specific changes to the input lead to **PREDICTABLE** changes in predictions.

**Examples:**
- **Adding language keywords**: Adding "Java" or "Python" should affect language-related predictions
- **Adding data structure keywords**: Adding "HashMap" should influence data structure predictions
- **Adding error handling context**: Adding "exception handling" should affect error handling predictions
- **Adding API context**: Adding "REST API" should influence API-related predictions
- **Increasing technical detail**: More specific descriptions should maintain or add relevant skills

**Run only directional tests:**
```bash
pytest tests/behavioral/test_directional.py -v
```

### 3. **Minimum Functionality Tests (MFT)** (`test_minimum_functionality.py`)
Tests that verify the model performs well on **basic, straightforward examples** where the expected output is clear.

**Examples:**
- Simple bug fix: "Fixed null pointer exception" → should predict programming skills
- Database work: "SQL query optimization" → should predict database skills
- API development: "Created REST API endpoint" → should predict API skills
- Testing work: "Added unit tests" → should predict testing skills
- DevOps work: "Configured Docker" → should predict DevOps skills
- Complex multi-skill tasks: Should predict multiple relevant skills

**Run only MFT tests:**
```bash
pytest tests/behavioral/test_minimum_functionality.py -v
```

### 4. **Model Training Tests** (`test_model_training.py`)
Tests that verify the model training process works correctly.

**Examples:**
- **Training completes without errors**: Training should finish successfully
- **Decreasing loss**: Model should improve during training (F1 > random baseline)
- **Overfitting on single batch**: Model should be able to memorize small dataset
- **Training on CPU**: Should work on CPU
- **Training on multiple cores**: Should work with parallel processing
- **Training on GPU**: Should detect GPU if available (skipped if no GPU)
- **Reproducibility**: Same random seed should give identical results
- **More data improves performance**: Larger dataset should improve or maintain performance
- **Model saves/loads correctly**: Trained models should persist correctly

**Run only training tests:**
```bash
pytest tests/behavioral/test_model_training.py -v
```

**Note:** Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.

## Prerequisites

Before running the behavioral tests, ensure you have:

1. **Trained model**: A trained model must exist in `models/` directory
   - Default: `random_forest_tfidf_gridsearch_smote.pkl`
   - Fallback: `random_forest_tfidf_gridsearch.pkl`

2. **Feature extraction**: TF-IDF features must be generated
   - Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features`

3. **Database**: The SkillScope database must be available
   - Run: `make data` to download if needed

4. **Dependencies**: Install test dependencies
   ```bash
   pip install -r requirements.txt
   ```

## Running the Tests

### Run all behavioral tests:
```bash
# Run all behavioral tests (excluding training tests that require PyTorch)
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py

# Or run all tests (will fail if PyTorch not installed)
pytest tests/behavioral/ -v
```

### Run specific test categories:
```bash
# Invariance tests only
pytest tests/behavioral/test_invariance.py -v

# Directional tests only
pytest tests/behavioral/test_directional.py -v

# Minimum functionality tests only
pytest tests/behavioral/test_minimum_functionality.py -v
```

### Run with markers:
```bash
# Run only invariance tests
pytest tests/behavioral/ -m invariance -v

# Run only directional tests
pytest tests/behavioral/ -m directional -v

# Run only MFT tests
pytest tests/behavioral/ -m mft -v

# Run only training tests
pytest tests/behavioral/ -m training -v
```

### Run specific test:
```bash
pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v
```

### Run with output:
```bash
# Show print statements during tests
pytest tests/behavioral/ -v -s

# Show detailed output and stop on first failure
pytest tests/behavioral/ -v -s -x
```

## Understanding Test Results

### Successful Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED
```
The model correctly maintained predictions despite typos.

### Failed Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
AssertionError: Typos changed predictions too much. Similarity: 0.45
```
The model's predictions changed significantly with typos (similarity < 0.7 threshold).

### Common Failure Reasons

1. **Invariance test failures**: Model is too sensitive to noise (typos, punctuation, etc.)
2. **Directional test failures**: Model doesn't respond appropriately to meaningful changes
3. **MFT failures**: Model fails on basic, clear-cut examples

## Test Configuration

### Fixtures (in `conftest.py`)

- **`trained_model`**: Loads the trained model from disk
- **`tfidf_vectorizer`**: Loads or reconstructs the TF-IDF vectorizer
- **`label_names`**: Gets the list of skill label names
- **`predict_text(text)`**: Predicts skill indices from raw text
- **`predict_with_labels(text)`**: Predicts skill label names from raw text

### Thresholds

The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":

- **Invariance tests**: Typically 0.6-0.8 similarity required
- **Directional tests**: Predictions should differ meaningfully
- **MFT tests**: At least 1-2 skills should be predicted

These thresholds can be adjusted in the test files based on your model's behavior.

## Interpreting Results

### Good Model Behavior:
- [PASS] High similarity on invariance tests (predictions stable despite noise)
- [PASS] Meaningful changes on directional tests (predictions respond to context)
- [PASS] Non-empty, relevant predictions on MFT tests

### Problematic Model Behavior:
- [FAIL] Low similarity on invariance tests (too sensitive to noise)
- [FAIL] No changes on directional tests (not learning from context)
- [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)

## Extending the Tests

To add new behavioral tests:

1. Choose the appropriate category (invariance/directional/MFT)
2. Add a new test method to the corresponding test class
3. Use the `predict_text` or `predict_with_labels` fixtures
4. Add appropriate assertions and print statements for debugging
5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft`

Example:
```python
@pytest.mark.invariance
def test_my_new_invariance_test(self, predict_text):
    """Test that X doesn't affect predictions."""
    original = "Some text"
    modified = "Some modified text"
    
    pred_orig = set(predict_text(original))
    pred_mod = set(predict_text(modified))
    
    similarity = jaccard_similarity(pred_orig, pred_mod)
    assert similarity >= 0.7, f"Similarity too low: {similarity}"
```

## Integration with CI/CD

Add to your CI/CD pipeline:

```yaml
- name: Run Behavioral Tests
  run: |
    pytest tests/behavioral/ -v --tb=short
```

## References

- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). **Beyond Accuracy: Behavioral Testing of NLP models with CheckList**. ACL 2020.
- Project documentation: `docs/`
- Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py`

## Troubleshooting

### "Model not found" error
```bash
# Train a model first
python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
# or
python -m hopcroft_skill_classification_tool_competition.modeling.train smote
```

### "Features not found" error
```bash
# Generate features
make features
# or
python -m hopcroft_skill_classification_tool_competition.features
```

### "Database not found" error
```bash
# Download data
make data
# or
python -m hopcroft_skill_classification_tool_competition.dataset
```

### Import errors
```bash
# Reinstall dependencies
pip install -r requirements.txt
```

### pytest not found
```bash
pip install pytest
```

### "No module named 'torch'" error (for training tests)
```bash
# Install PyTorch (required only for test_model_training.py)
pip install torch

# Or skip training tests
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
```

## Contact

For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.