Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / tests /behavioral /README_BEHAVIORAL.md

DaCrow13

Deploy to HF Spaces (Clean)

225af6a about 2 months ago

preview code

raw

history blame contribute delete

9.86 kB

Behavioral Testing for Skill Classification Model

This directory contains behavioral tests for the skill classification model, following the methodology described in Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList".

Overview

Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:

1. Invariance Tests (`test_invariance.py`)

Tests that verify certain transformations to the input should NOT change the model's predictions significantly.

Examples:

Typo robustness: "Fixed bug" vs "Fixd bug" should produce similar predictions
Synonym substitution: "fix" vs "resolve" should not affect predictions
Case insensitivity: "API" vs "api" should produce identical results
Punctuation robustness: Extra punctuation should not change predictions
URL/code snippet noise: URLs and code blocks should not affect core predictions

Run only invariance tests:

pytest tests/behavioral/test_invariance.py -v

2. Directional Tests (`test_directional.py`)

Tests that verify specific changes to the input lead to PREDICTABLE changes in predictions.

Examples:

Adding language keywords: Adding "Java" or "Python" should affect language-related predictions
Adding data structure keywords: Adding "HashMap" should influence data structure predictions
Adding error handling context: Adding "exception handling" should affect error handling predictions
Adding API context: Adding "REST API" should influence API-related predictions
Increasing technical detail: More specific descriptions should maintain or add relevant skills

Run only directional tests:

pytest tests/behavioral/test_directional.py -v

3. Minimum Functionality Tests (MFT) (`test_minimum_functionality.py`)

Tests that verify the model performs well on basic, straightforward examples where the expected output is clear.

Examples:

Simple bug fix: "Fixed null pointer exception" → should predict programming skills
Database work: "SQL query optimization" → should predict database skills
API development: "Created REST API endpoint" → should predict API skills
Testing work: "Added unit tests" → should predict testing skills
DevOps work: "Configured Docker" → should predict DevOps skills
Complex multi-skill tasks: Should predict multiple relevant skills

Run only MFT tests:

pytest tests/behavioral/test_minimum_functionality.py -v

4. Model Training Tests (`test_model_training.py`)

Tests that verify the model training process works correctly.

Examples:

Training completes without errors: Training should finish successfully
Decreasing loss: Model should improve during training (F1 > random baseline)
Overfitting on single batch: Model should be able to memorize small dataset
Training on CPU: Should work on CPU
Training on multiple cores: Should work with parallel processing
Training on GPU: Should detect GPU if available (skipped if no GPU)
Reproducibility: Same random seed should give identical results
More data improves performance: Larger dataset should improve or maintain performance
Model saves/loads correctly: Trained models should persist correctly

Run only training tests:

pytest tests/behavioral/test_model_training.py -v

Note: Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.

Prerequisites

Before running the behavioral tests, ensure you have:

Trained model: A trained model must exist in models/ directory
- Default: random_forest_tfidf_gridsearch_smote.pkl
- Fallback: random_forest_tfidf_gridsearch.pkl
Feature extraction: TF-IDF features must be generated
- Run: make features or python -m hopcroft_skill_classification_tool_competition.features
Database: The SkillScope database must be available
- Run: make data to download if needed
Dependencies: Install test dependencies
```
pip install -r requirements.txt
```

Running the Tests

Run all behavioral tests:

# Run all behavioral tests (excluding training tests that require PyTorch)
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py

# Or run all tests (will fail if PyTorch not installed)
pytest tests/behavioral/ -v

Run specific test categories:

# Invariance tests only
pytest tests/behavioral/test_invariance.py -v

# Directional tests only
pytest tests/behavioral/test_directional.py -v

# Minimum functionality tests only
pytest tests/behavioral/test_minimum_functionality.py -v

Run with markers:

# Run only invariance tests
pytest tests/behavioral/ -m invariance -v

# Run only directional tests
pytest tests/behavioral/ -m directional -v

# Run only MFT tests
pytest tests/behavioral/ -m mft -v

# Run only training tests
pytest tests/behavioral/ -m training -v

Run specific test:

pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v

Run with output:

# Show print statements during tests
pytest tests/behavioral/ -v -s

# Show detailed output and stop on first failure
pytest tests/behavioral/ -v -s -x

Understanding Test Results

Successful Test

tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED

The model correctly maintained predictions despite typos.

Failed Test

tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
AssertionError: Typos changed predictions too much. Similarity: 0.45

The model's predictions changed significantly with typos (similarity < 0.7 threshold).

Common Failure Reasons

Invariance test failures: Model is too sensitive to noise (typos, punctuation, etc.)
Directional test failures: Model doesn't respond appropriately to meaningful changes
MFT failures: Model fails on basic, clear-cut examples

Test Configuration

Fixtures (in `conftest.py`)

trained_model: Loads the trained model from disk
tfidf_vectorizer: Loads or reconstructs the TF-IDF vectorizer
label_names: Gets the list of skill label names
predict_text(text): Predicts skill indices from raw text
predict_with_labels(text): Predicts skill label names from raw text

Thresholds

The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":

Invariance tests: Typically 0.6-0.8 similarity required
Directional tests: Predictions should differ meaningfully
MFT tests: At least 1-2 skills should be predicted

These thresholds can be adjusted in the test files based on your model's behavior.

Interpreting Results

Good Model Behavior:

[PASS] High similarity on invariance tests (predictions stable despite noise)
[PASS] Meaningful changes on directional tests (predictions respond to context)
[PASS] Non-empty, relevant predictions on MFT tests

Problematic Model Behavior:

[FAIL] Low similarity on invariance tests (too sensitive to noise)
[FAIL] No changes on directional tests (not learning from context)
[FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)

Extending the Tests

To add new behavioral tests:

Choose the appropriate category (invariance/directional/MFT)
Add a new test method to the corresponding test class
Use the predict_text or predict_with_labels fixtures
Add appropriate assertions and print statements for debugging
Add the corresponding marker: @pytest.mark.invariance, @pytest.mark.directional, or @pytest.mark.mft

Example:

@pytest.mark.invariance
def test_my_new_invariance_test(self, predict_text):
    """Test that X doesn't affect predictions."""
    original = "Some text"
    modified = "Some modified text"
    
    pred_orig = set(predict_text(original))
    pred_mod = set(predict_text(modified))
    
    similarity = jaccard_similarity(pred_orig, pred_mod)
    assert similarity >= 0.7, f"Similarity too low: {similarity}"

Integration with CI/CD

Add to your CI/CD pipeline:

- name: Run Behavioral Tests
  run: |
    pytest tests/behavioral/ -v --tb=short

References

Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP models with CheckList. ACL 2020.
Project documentation: docs/
Model training: hopcroft_skill_classification_tool_competition/modeling/train.py

Troubleshooting

"Model not found" error

# Train a model first
python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
# or
python -m hopcroft_skill_classification_tool_competition.modeling.train smote

"Features not found" error

# Generate features
make features
# or
python -m hopcroft_skill_classification_tool_competition.features

"Database not found" error

# Download data
make data
# or
python -m hopcroft_skill_classification_tool_competition.dataset

Import errors

# Reinstall dependencies
pip install -r requirements.txt

pytest not found

pip install pytest

"No module named 'torch'" error (for training tests)

# Install PyTorch (required only for test_model_training.py)
pip install torch

# Or skip training tests
pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py

Contact

For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.

Behavioral Testing for Skill Classification Model

Overview

1. Invariance Tests (test_invariance.py)

2. Directional Tests (test_directional.py)

3. Minimum Functionality Tests (MFT) (test_minimum_functionality.py)

4. Model Training Tests (test_model_training.py)

Prerequisites

Running the Tests

Run all behavioral tests:

Run specific test categories:

Run with markers:

Run specific test:

Run with output:

Understanding Test Results

Successful Test

Failed Test

Common Failure Reasons

Test Configuration

Fixtures (in conftest.py)

Thresholds

Interpreting Results

Good Model Behavior:

Problematic Model Behavior:

Extending the Tests

Integration with CI/CD

References

Troubleshooting

"Model not found" error

"Features not found" error

"Database not found" error

Import errors

pytest not found

"No module named 'torch'" error (for training tests)

Contact

1. Invariance Tests (`test_invariance.py`)

2. Directional Tests (`test_directional.py`)

3. Minimum Functionality Tests (MFT) (`test_minimum_functionality.py`)

4. Model Training Tests (`test_model_training.py`)

Fixtures (in `conftest.py`)