DaCrow13
Deploy to HF Spaces (Clean)
225af6a
# Behavioral Testing for Skill Classification Model
This directory contains behavioral tests for the skill classification model, following the methodology described in **Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"**.
## Overview
Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:
### 1. **Invariance Tests** (`test_invariance.py`)
Tests that verify certain transformations to the input should **NOT** change the model's predictions significantly.
**Examples:**
- **Typo robustness**: "Fixed bug" vs "Fixd bug" should produce similar predictions
- **Synonym substitution**: "fix" vs "resolve" should not affect predictions
- **Case insensitivity**: "API" vs "api" should produce identical results
- **Punctuation robustness**: Extra punctuation should not change predictions
- **URL/code snippet noise**: URLs and code blocks should not affect core predictions
**Run only invariance tests:**
```bash
pytest tests/behavioral/test_invariance.py -v
```
### 2. **Directional Tests** (`test_directional.py`)
Tests that verify specific changes to the input lead to **PREDICTABLE** changes in predictions.
**Examples:**
- **Adding language keywords**: Adding "Java" or "Python" should affect language-related predictions
- **Adding data structure keywords**: Adding "HashMap" should influence data structure predictions
- **Adding error handling context**: Adding "exception handling" should affect error handling predictions
- **Adding API context**: Adding "REST API" should influence API-related predictions
- **Increasing technical detail**: More specific descriptions should maintain or add relevant skills
**Run only directional tests:**
```bash
pytest tests/behavioral/test_directional.py -v
```
### 3. **Minimum Functionality Tests (MFT)** (`test_minimum_functionality.py`)
Tests that verify the model performs well on **basic, straightforward examples** where the expected output is clear.
**Examples:**
- Simple bug fix: "Fixed null pointer exception" → should predict programming skills
- Database work: "SQL query optimization" → should predict database skills
- API development: "Created REST API endpoint" → should predict API skills
- Testing work: "Added unit tests" → should predict testing skills
- DevOps work: "Configured Docker" → should predict DevOps skills
- Complex multi-skill tasks: Should predict multiple relevant skills
**Run only MFT tests:**
```bash
pytest tests/behavioral/test_minimum_functionality.py -v
```
### 4. **Model Training Tests** (`test_model_training.py`)
Tests that verify the model training process works correctly.
**Examples:**
- **Training completes without errors**: Training should finish successfully
- **Decreasing loss**: Model should improve during training (F1 > random baseline)
- **Overfitting on single batch**: Model should be able to memorize small dataset
- **Training on CPU**: Should work on CPU
- **Training on multiple cores**: Should work with parallel processing
- **Training on GPU**: Should detect GPU if available (skipped if no GPU)
- **Reproducibility**: Same random seed should give identical results
- **More data improves performance**: Larger dataset should improve or maintain performance
- **Model saves/loads correctly**: Trained models should persist correctly
**Run only training tests:**
```bash
pytest tests/behavioral/test_model_training.py -v
```
**Note:** Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.
## Prerequisites
Before running the behavioral tests, ensure you have:
1. **Trained model**: A trained model must exist in `models/` directory
- Default: `random_forest_tfidf_gridsearch_smote.pkl`
- Fallback: `random_forest_tfidf_gridsearch.pkl`
2. **Feature extraction**: TF-IDF features must be generated
- Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features`
3. **Database**: The SkillScope database must be available
- Run: `make data` to download if needed
4. **Dependencies**: Install test dependencies
```bash
pip install -r requirements.txt
```
## Running the Tests
### Run all behavioral tests:
```bash
# Run all behavioral tests (excluding training tests that require PyTorch)
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
# Or run all tests (will fail if PyTorch not installed)
pytest tests/behavioral/ -v
```
### Run specific test categories:
```bash
# Invariance tests only
pytest tests/behavioral/test_invariance.py -v
# Directional tests only
pytest tests/behavioral/test_directional.py -v
# Minimum functionality tests only
pytest tests/behavioral/test_minimum_functionality.py -v
```
### Run with markers:
```bash
# Run only invariance tests
pytest tests/behavioral/ -m invariance -v
# Run only directional tests
pytest tests/behavioral/ -m directional -v
# Run only MFT tests
pytest tests/behavioral/ -m mft -v
# Run only training tests
pytest tests/behavioral/ -m training -v
```
### Run specific test:
```bash
pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v
```
### Run with output:
```bash
# Show print statements during tests
pytest tests/behavioral/ -v -s
# Show detailed output and stop on first failure
pytest tests/behavioral/ -v -s -x
```
## Understanding Test Results
### Successful Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED
```
The model correctly maintained predictions despite typos.
### Failed Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
AssertionError: Typos changed predictions too much. Similarity: 0.45
```
The model's predictions changed significantly with typos (similarity < 0.7 threshold).
### Common Failure Reasons
1. **Invariance test failures**: Model is too sensitive to noise (typos, punctuation, etc.)
2. **Directional test failures**: Model doesn't respond appropriately to meaningful changes
3. **MFT failures**: Model fails on basic, clear-cut examples
## Test Configuration
### Fixtures (in `conftest.py`)
- **`trained_model`**: Loads the trained model from disk
- **`tfidf_vectorizer`**: Loads or reconstructs the TF-IDF vectorizer
- **`label_names`**: Gets the list of skill label names
- **`predict_text(text)`**: Predicts skill indices from raw text
- **`predict_with_labels(text)`**: Predicts skill label names from raw text
### Thresholds
The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":
- **Invariance tests**: Typically 0.6-0.8 similarity required
- **Directional tests**: Predictions should differ meaningfully
- **MFT tests**: At least 1-2 skills should be predicted
These thresholds can be adjusted in the test files based on your model's behavior.
## Interpreting Results
### Good Model Behavior:
- [PASS] High similarity on invariance tests (predictions stable despite noise)
- [PASS] Meaningful changes on directional tests (predictions respond to context)
- [PASS] Non-empty, relevant predictions on MFT tests
### Problematic Model Behavior:
- [FAIL] Low similarity on invariance tests (too sensitive to noise)
- [FAIL] No changes on directional tests (not learning from context)
- [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)
## Extending the Tests
To add new behavioral tests:
1. Choose the appropriate category (invariance/directional/MFT)
2. Add a new test method to the corresponding test class
3. Use the `predict_text` or `predict_with_labels` fixtures
4. Add appropriate assertions and print statements for debugging
5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft`
Example:
```python
@pytest.mark.invariance
def test_my_new_invariance_test(self, predict_text):
"""Test that X doesn't affect predictions."""
original = "Some text"
modified = "Some modified text"
pred_orig = set(predict_text(original))
pred_mod = set(predict_text(modified))
similarity = jaccard_similarity(pred_orig, pred_mod)
assert similarity >= 0.7, f"Similarity too low: {similarity}"
```
## Integration with CI/CD
Add to your CI/CD pipeline:
```yaml
- name: Run Behavioral Tests
run: |
pytest tests/behavioral/ -v --tb=short
```
## References
- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). **Beyond Accuracy: Behavioral Testing of NLP models with CheckList**. ACL 2020.
- Project documentation: `docs/`
- Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py`
## Troubleshooting
### "Model not found" error
```bash
# Train a model first
python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
# or
python -m hopcroft_skill_classification_tool_competition.modeling.train smote
```
### "Features not found" error
```bash
# Generate features
make features
# or
python -m hopcroft_skill_classification_tool_competition.features
```
### "Database not found" error
```bash
# Download data
make data
# or
python -m hopcroft_skill_classification_tool_competition.dataset
```
### Import errors
```bash
# Reinstall dependencies
pip install -r requirements.txt
```
### pytest not found
```bash
pip install pytest
```
### "No module named 'torch'" error (for training tests)
```bash
# Install PyTorch (required only for test_model_training.py)
pip install torch
# Or skip training tests
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
```
## Contact
For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.