Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / tests /behavioral /README_BEHAVIORAL.md

DaCrow13

Deploy to HF Spaces (Clean)

225af6a about 2 months ago

preview code

raw

history blame contribute delete

9.86 kB

	# Behavioral Testing for Skill Classification Model

	This directory contains behavioral tests for the skill classification model, following the methodology described in Ribeiro et al. (2020) "Beyond Accuracy: Behavioral Testing of NLP models with CheckList".

	## Overview

	Behavioral tests go beyond traditional accuracy metrics to verify that the model behaves correctly in specific scenarios. The tests are organized into four categories:

	### 1. Invariance Tests (`test_invariance.py`)
	Tests that verify certain transformations to the input should NOT change the model's predictions significantly.

	Examples:
	- Typo robustness: "Fixed bug" vs "Fixd bug" should produce similar predictions
	- Synonym substitution: "fix" vs "resolve" should not affect predictions
	- Case insensitivity: "API" vs "api" should produce identical results
	- Punctuation robustness: Extra punctuation should not change predictions
	- URL/code snippet noise: URLs and code blocks should not affect core predictions

	Run only invariance tests:
	```bash
	pytest tests/behavioral/test_invariance.py -v
	```

	### 2. Directional Tests (`test_directional.py`)
	Tests that verify specific changes to the input lead to PREDICTABLE changes in predictions.

	Examples:
	- Adding language keywords: Adding "Java" or "Python" should affect language-related predictions
	- Adding data structure keywords: Adding "HashMap" should influence data structure predictions
	- Adding error handling context: Adding "exception handling" should affect error handling predictions
	- Adding API context: Adding "REST API" should influence API-related predictions
	- Increasing technical detail: More specific descriptions should maintain or add relevant skills

	Run only directional tests:
	```bash
	pytest tests/behavioral/test_directional.py -v
	```

	### 3. Minimum Functionality Tests (MFT) (`test_minimum_functionality.py`)
	Tests that verify the model performs well on basic, straightforward examples where the expected output is clear.

	Examples:
	- Simple bug fix: "Fixed null pointer exception" → should predict programming skills
	- Database work: "SQL query optimization" → should predict database skills
	- API development: "Created REST API endpoint" → should predict API skills
	- Testing work: "Added unit tests" → should predict testing skills
	- DevOps work: "Configured Docker" → should predict DevOps skills
	- Complex multi-skill tasks: Should predict multiple relevant skills

	Run only MFT tests:
	```bash
	pytest tests/behavioral/test_minimum_functionality.py -v
	```

	### 4. Model Training Tests (`test_model_training.py`)
	Tests that verify the model training process works correctly.

	Examples:
	- Training completes without errors: Training should finish successfully
	- Decreasing loss: Model should improve during training (F1 > random baseline)
	- Overfitting on single batch: Model should be able to memorize small dataset
	- Training on CPU: Should work on CPU
	- Training on multiple cores: Should work with parallel processing
	- Training on GPU: Should detect GPU if available (skipped if no GPU)
	- Reproducibility: Same random seed should give identical results
	- More data improves performance: Larger dataset should improve or maintain performance
	- Model saves/loads correctly: Trained models should persist correctly

	Run only training tests:
	```bash
	pytest tests/behavioral/test_model_training.py -v
	```

	Note: Training tests use small subsets of data for speed. They verify the training pipeline works correctly, not that the model achieves optimal performance.

	## Prerequisites

	Before running the behavioral tests, ensure you have:

	1. Trained model: A trained model must exist in `models/` directory
	- Default: `random_forest_tfidf_gridsearch_smote.pkl`
	- Fallback: `random_forest_tfidf_gridsearch.pkl`

	2. Feature extraction: TF-IDF features must be generated
	- Run: `make features` or `python -m hopcroft_skill_classification_tool_competition.features`

	3. Database: The SkillScope database must be available
	- Run: `make data` to download if needed

	4. Dependencies: Install test dependencies
	```bash
	pip install -r requirements.txt
	```

	## Running the Tests

	### Run all behavioral tests:
	```bash
	# Run all behavioral tests (excluding training tests that require PyTorch)
	pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py

	# Or run all tests (will fail if PyTorch not installed)
	pytest tests/behavioral/ -v
	```

	### Run specific test categories:
	```bash
	# Invariance tests only
	pytest tests/behavioral/test_invariance.py -v

	# Directional tests only
	pytest tests/behavioral/test_directional.py -v

	# Minimum functionality tests only
	pytest tests/behavioral/test_minimum_functionality.py -v
	```

	### Run with markers:
	```bash
	# Run only invariance tests
	pytest tests/behavioral/ -m invariance -v

	# Run only directional tests
	pytest tests/behavioral/ -m directional -v

	# Run only MFT tests
	pytest tests/behavioral/ -m mft -v

	# Run only training tests
	pytest tests/behavioral/ -m training -v
	```

	### Run specific test:
	```bash
	pytest tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness -v
	```

	### Run with output:
	```bash
	# Show print statements during tests
	pytest tests/behavioral/ -v -s

	# Show detailed output and stop on first failure
	pytest tests/behavioral/ -v -s -x
	```

	## Understanding Test Results

	### Successful Test
	```
	tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness PASSED
	```
	The model correctly maintained predictions despite typos.

	### Failed Test
	```
	tests/behavioral/test_invariance.py::TestInvariance::test_typo_robustness FAILED
	AssertionError: Typos changed predictions too much. Similarity: 0.45
	```
	The model's predictions changed significantly with typos (similarity < 0.7 threshold).

	### Common Failure Reasons

	1. Invariance test failures: Model is too sensitive to noise (typos, punctuation, etc.)
	2. Directional test failures: Model doesn't respond appropriately to meaningful changes
	3. MFT failures: Model fails on basic, clear-cut examples

	## Test Configuration

	### Fixtures (in `conftest.py`)

	- `trained_model`: Loads the trained model from disk
	- `tfidf_vectorizer`: Loads or reconstructs the TF-IDF vectorizer
	- `label_names`: Gets the list of skill label names
	- `predict_text(text)`: Predicts skill indices from raw text
	- `predict_with_labels(text)`: Predicts skill label names from raw text

	### Thresholds

	The tests use similarity thresholds (Jaccard similarity) to determine if predictions are "similar enough":

	- Invariance tests: Typically 0.6-0.8 similarity required
	- Directional tests: Predictions should differ meaningfully
	- MFT tests: At least 1-2 skills should be predicted

	These thresholds can be adjusted in the test files based on your model's behavior.

	## Interpreting Results

	### Good Model Behavior:
	- [PASS] High similarity on invariance tests (predictions stable despite noise)
	- [PASS] Meaningful changes on directional tests (predictions respond to context)
	- [PASS] Non-empty, relevant predictions on MFT tests

	### Problematic Model Behavior:
	- [FAIL] Low similarity on invariance tests (too sensitive to noise)
	- [FAIL] No changes on directional tests (not learning from context)
	- [FAIL] Empty or irrelevant predictions on MFT tests (not learning basic patterns)

	## Extending the Tests

	To add new behavioral tests:

	1. Choose the appropriate category (invariance/directional/MFT)
	2. Add a new test method to the corresponding test class
	3. Use the `predict_text` or `predict_with_labels` fixtures
	4. Add appropriate assertions and print statements for debugging
	5. Add the corresponding marker: `@pytest.mark.invariance`, `@pytest.mark.directional`, or `@pytest.mark.mft`

	Example:
	```python
	@pytest.mark.invariance
	def test_my_new_invariance_test(self, predict_text):
	"""Test that X doesn't affect predictions."""
	original = "Some text"
	modified = "Some modified text"

	pred_orig = set(predict_text(original))
	pred_mod = set(predict_text(modified))

	similarity = jaccard_similarity(pred_orig, pred_mod)
	assert similarity >= 0.7, f"Similarity too low: {similarity}"
	```

	## Integration with CI/CD

	Add to your CI/CD pipeline:

	```yaml
	- name: Run Behavioral Tests
	run: \|
	pytest tests/behavioral/ -v --tb=short
	```

	## References

	- Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP models with CheckList. ACL 2020.
	- Project documentation: `docs/`
	- Model training: `hopcroft_skill_classification_tool_competition/modeling/train.py`

	## Troubleshooting

	### "Model not found" error
	```bash
	# Train a model first
	python -m hopcroft_skill_classification_tool_competition.modeling.train baseline
	# or
	python -m hopcroft_skill_classification_tool_competition.modeling.train smote
	```

	### "Features not found" error
	```bash
	# Generate features
	make features
	# or
	python -m hopcroft_skill_classification_tool_competition.features
	```

	### "Database not found" error
	```bash
	# Download data
	make data
	# or
	python -m hopcroft_skill_classification_tool_competition.dataset
	```

	### Import errors
	```bash
	# Reinstall dependencies
	pip install -r requirements.txt
	```

	### pytest not found
	```bash
	pip install pytest
	```

	### "No module named 'torch'" error (for training tests)
	```bash
	# Install PyTorch (required only for test_model_training.py)
	pip install torch

	# Or skip training tests
	pytest tests/behavioral/ -v --ignore=tests/behavioral/test_model_training.py
	```

	## Contact

	For questions or issues with the behavioral tests, please refer to the main project documentation or open an issue on GitHub.