# Behavioral Testing for Skill Classification Model

This directory contains behavioral tests for the skill classification model, following the methodology described in **Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList"**.

## Overview

Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:

### 1. **Invariance Tests** (`test_invariance.py`)
Tests that verify certain transformations to the input should **NOT** change the model's predictions significantly.

**Examples:**
- **Typo robustness**: "Fixed bug" vs "Fixd bug" should produce similar predictions
- **Synonym substitution**: "fix" vs "resolve" should not affect predictions
- **Case insensitivity**: "API" vs "api" should produce identical results
- **Punctuation robustness**: Extra punctuation should not change predictions
- **URL/code snippet noise**: URLs and code blocks should not affect core predictions

**Run only invariance tests:**
```bash
pytest tests/behavioral/test_invariance.py -v
```

### 2. **Directional Tests** (`test_directional.py`)
Tests that verify specific changes to the input lead to **PREDICTABLE** changes in predictions.

**Examples:**
- **Adding language keywords**: Adding "Java" or "Python" should affect language-related predictions
- **Adding data structure keywords**: Adding "HashMap" should influence data structure predictions
- **Adding error handling context**: Adding "exception handling" should affect error handling predictions
- **Adding API context**: Adding "REST API" should influence API-related predictions
- **Increasing technical detail**: More specific descriptions should maintain or add relevant skills

**Run only directional tests:**
```bash
pytest tests/behavioral/test_directional.py -v
```

### 3. **Minimum Functionality Tests (MFT)** (`test_minimum_functionality.py`)
Tests that verify the model performs well on **basic, straightforward examples** where the expected output is clear.

**Examples:**
- Simple bug fix: "Fixed null pointer exception" → should predict programming skills
- Database work: "SQL query optimization" → should predict database skills
- API development: "Created REST API endpoint" → should predict API skills
- Testing work: "Added unit tests" → should predict testing skills
- DevOps work: "Configured Docker" → should predict DevOps skills
- Complex multi-skill tasks: Should predict multiple relevant skills

**Run only MFT tests:**
```bash
pytest tests/behavioral/test_minimum_functionality.py -v
```

### 4. **Model Training Tests** (`test_model_training.py`)
Tests that verify the model training process works correctly.

**Examples:**
- **Training completes without errors**: Training should finish successfully
- **Decreasing loss**: Model should improve during training (F1 > random baseline)
- **Overfitting on single batch**: Model should be able to memorize small dataset
- **Training on CPU**: Should work on CPU
- **Training on multiple cores**: Should work with parallel processing
- **Training on GPU**: Should detect GPU if available (skipped if no GPU)
- **Reproducibility**: Same random seed should give identical results
- **More data improves performance**: Larger dataset should improve or maintain performance
- **Model saves/loads correctly**: Trained models should persist correctly

**Run only training tests:**
```bash
pytest tests/behavioral/test_model_training.py -v
```

**Note:** Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.

## Prerequisites

Before running the behavioral tests, ensure you have:

1. **Trained model**: A trained model must exist in `models/` directory
   - Default: `random_forest_tfidf_gridsearch_smote.pkl`
   - Fallback: `random_forest_tfidf_gridsearch.pkl`

2. **Feature extraction**: TF-IDF features must be generated
   - Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features`

3. **Database**: The SkillScope database must be available
   - Run: `make data` to download if needed

4. **Dependencies**: Install test dependencies
   ```bash
   pip install -r requirements.txt
   ```

## Running the Tests

### Run all behavioral tests:
```bash
# Run all behavioral tests (excluding training tests that require PyTorch)
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py

# Or run all tests (will fail if PyTorch not installed)
pytest tests/behavioral/ -v
```

### Run specific test categories:
```bash
# Invariance tests only
pytest tests/behavioral/test_invariance.py -v

# Directional tests only
pytest tests/behavioral/test_directional.py -v

# Minimum functionality tests only
pytest tests/behavioral/test_minimum_functionality.py -v
```

### Run with markers:
```bash
# Run only invariance tests
pytest tests/behavioral/ -m invariance -v

# Run only directional tests
pytest tests/behavioral/ -m directional -v

# Run only MFT tests
pytest tests/behavioral/ -m mft -v

# Run only training tests
pytest tests/behavioral/ -m training -v
```

### Run specific test:
```bash
pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v
```

### Run with output:
```bash
# Show print statements during tests
pytest tests/behavioral/ -v -s

# Show detailed output and stop on first failure
pytest tests/behavioral/ -v -s -x
```

## Understanding Test Results

### Successful Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED
```
The model correctly maintained predictions despite typos.

### Failed Test
```
tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
AssertionError: Typos changed predictions too much. Similarity: 0.45
```
The model's predictions changed significantly with typos (similarity < 0.7 threshold).

### Common Failure Reasons

1. **Invariance test failures**: Model is too sensitive to noise (typos, punctuation, etc.)
2. **Directional test failures**: Model doesn't respond appropriately to meaningful changes
3. **MFT failures**: Model fails on basic, clear-cut examples

## Test Configuration

### Fixtures (in `conftest.py`)

- **`trained_model`**: Loads the trained model from disk
- **`tfidf_vectorizer`**: Loads or reconstructs the TF-IDF vectorizer
- **`label_names`**: Gets the list of skill label names
- **`predict_text(text)`**: Predicts skill indices from raw text
- **`predict_with_labels(text)`**: Predicts skill label names from raw text

### Thresholds

The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":

- **Invariance tests**: Typically 0.6-0.8 similarity required
- **Directional tests**: Predictions should differ meaningfully
- **MFT tests**: At least 1-2 skills should be predicted

These thresholds can be adjusted in the test files based on your model's behavior.

## Interpreting Results

### Good Model Behavior:
- [PASS] High similarity on invariance tests (predictions stable despite noise)
- [PASS] Meaningful changes on directional tests (predictions respond to context)
- [PASS] Non-empty, relevant predictions on MFT tests

### Problematic Model Behavior:
- [FAIL] Low similarity on invariance tests (too sensitive to noise)
- [FAIL] No changes on directional tests (not learning from context)
- [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)

## Extending the Tests

To add new behavioral tests:

1. Choose the appropriate category (invariance/directional/MFT)
2. Add a new test method to the corresponding test class
3. Use the `predict_text` or `predict_with_labels` fixtures
4. Add appropriate assertions and print statements for debugging
5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft`

Example:
```python
@pytest.mark.invariance
def test_my_new_invariance_test(self, predict_text):
    """Test that X doesn't affect predictions."""
    original = "Some text"
    modified = "Some modified text"
    
    pred_orig = set(predict_text(original))
    pred_mod = set(predict_text(modified))
    
    similarity = jaccard_similarity(pred_orig, pred_mod)
    assert similarity >= 0.7, f"Similarity too low: {similarity}"
```

## Integration with CI/CD

Add to your CI/CD pipeline:

```yaml
- name: Run Behavioral Tests
  run: |
    pytest tests/behavioral/ -v --tb=short
```

## References

- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). **Beyond Accuracy: Behavioral Testing of NLP models with CheckList**. ACL 2020.
- Project documentation: `docs/`
- Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py`

## Troubleshooting

### "Model not found" error
```bash
# Train a model first
python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
# or
python -m hopcroft_skill_classification_tool_competition.modeling.train smote
```

### "Features not found" error
```bash
# Generate features
make features
# or
python -m hopcroft_skill_classification_tool_competition.features
```

### "Database not found" error
```bash
# Download data
make data
# or
python -m hopcroft_skill_classification_tool_competition.dataset
```

### Import errors
```bash
# Reinstall dependencies
pip install -r requirements.txt
```

### pytest not found
```bash
pip install pytest
```

### "No module named 'torch'" error (for training tests)
```bash
# Install PyTorch (required only for test_model_training.py)
pip install torch

# Or skip training tests
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
```

## Contact

For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.